Singapore

Understanding multi-page documents poses a significant challenge for multimodal large language models (MLLMs), as it requires fine-grained visual comprehension and multi-hop reasoning across pages. While prior work has explored reinforcement learning (RL) for enhancing advanced reasoning in MLLMs, its application to multi-page document understanding remains underexplored. In this paper, we introduce DocR1, an MLLM trained with a novel RL framework, Evidence Page-Guided GRPO (EviGRPO). EviGRPO incorporates an evidence-aware reward mechanism that promotes a coarse-to-fine reasoning strategy, guiding the model to first retrieve relevant pages before generating answers. To support this, we design a rigorous two-stage annotation pipeline and a curriculum learning strategy that enables effective training with limited supervision. Using this pipeline, we construct two datasets: EviBench, a high-quality training set with 4.8k examples, and ArxivFullQA, a benchmark with 8.6k QA examples over full scientific papers. Extensive experiments across a wide range of benchmarks demonstrate that DocR1 achieves state-of-the-art performance on multi-page tasks while maintaining strong results on single-page benchmarks.

AAAI 2026

DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding

cv: multi-modal vision cv: large vision models cv: visual reasoning & symbolic representations

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Streaming free-viewpoint video (FVV) in real-time still faces significant challenges, particularly in training, rendering, and transmission efficiency. Harnessing superior performance of 3D Gaussian Splatting (3DGS), recent 3DGS-based FVV methods have achieved notable breakthroughs in both training and rendering. However, the storage requirements of these methods can reach up to $10$MB per frame, making stream FVV in real-time impossible. To address this problem, we propose a novel FVV representation, dubbed StreamSTGS, designed for real-time streaming. StreamSTGS represents a dynamic scene using canonical 3D Gaussians, temporal features, and a deformation field. For high compression efficiency, we encode canonical Gaussian attributes as 2D images and temporal features as a video. This design not only enables real-time streaming, but also inherently supports adaptive bitrate control based on network condition without any extra training. Moreover, we propose a sliding window scheme to aggregate adjacent temporal features to learn local motions, and then introduce a transformer-guided auxiliary training module to learn global motions. On diverse FVV benchmarks, StreamSTGS demonstrates competitive performance on all metrics compared to state-of-the-art methods. Notably, StreamSTGS increases the PSNR by an average of $1$dB while reducing the average frame size to just $170$KB.

StreamSTGS: Streaming Spatial and Temporal Gaussian Grids for Real-Time Free-Viewpoint Video

Graph Neural Networks (GNNs) have achieved impressive performance in semi-supervised graph anomaly detection (GAD). While many GNN variants have been developed for this task, they largely focus on advanced message aggregation schemes, leaving the message routing aspect underexplored. We argue that the commonly used broadcast-based routing can also hinder generalization, particularly in the presence of rare and structurally challenging (vertices with a high-degree) anomalies. To address this, we propose Binary Message Passing (BMP), a novel routing paradigm that models the message flow of each vertex as a binary tree (BMP tree), where vanilla graph convolution is decoupled by its left and right subtrees. Each vertex recursively gathers information from neighbors with higher anomaly probabilities within each subtree, thereby amplifying the propagation of anomaly information across the topology. The anomaly probabilities are estimated and updated by the model itself, enabling adaptive, self-supervised routing over iterations. Furthermore, combining multiple BMP trees into a BMP forest provides multi-scale structural context, enhancing the expressiveness of final vertex embeddings. Extensive experiments show that BMP improves detection performance under limited supervision while exhibiting better generalization across structurally diverse anomalies.

Binary Message Passing for Generalizable Semi-Supervised Graph Anomaly Detection

Multimodal Aspect-Based Sentiment Analysis (MABSA) involves extracting aspect terms from text-image pairs and identifying their sentiments. Most existing tasks consider one fixed sentiment category with explicitly mentioned aspects. However, these tasks seldom consider expressive sentiment categories, implicit aspects, and explainability. To this end, we introduce a novel task of Open-domain Explainable Multimodal Aspect-Based Sentiment Reasoning (OX-MABSR). This task enables the prediction of open-vocabulary aspect-sentiment pairs, together with the generation of sentiment explanations and reasoning paths. To benchmark OX-MABSR task, we construct OX-MABSR-Bench, a dataset annotated with explicit and implicit aspects, expressive sentiment categories, as well as perceptual and cognitive two-level explanations. The explanations capture visual and textual cues, including aesthetics, facial expressions, scenes, and textual semantics, together with background and situational knowledge. In addition, we annotate the reasoning paths that trace how the sentiment evolves from surface cues to a deeper contextual understanding. To address OX-MABSR task, we propose MABSR-LLM. Extensive experimental results show our MABSR-LLM outperforms strong baselines. To the best of our knowledge, we are the first to provide a unified framework for open-domain and explainable MABSR. The data and code will be made publicly available on GitHub.

OX-MABSR: A Benchmark for Open-domain Explainable Multimodal Aspect-Based Sentiment Reasoning

Image restoration (IR) models are typically trained to recover high-quality images using L1 or LPIPS loss. To handle diverse unknown degradations, zero-shot IR methods have also been introduced. However, existing pre-trained and zero-shot IR approaches often fail to align with human preferences, resulting in restored images that may not be favored. This highlights the critical need to enhance restoration quality and adapt flexibly to various image restoration tasks or backbones without requiring model retraining and ideally without labor-intensive preference data collection. In this paper, we propose the first Test-Time Preference Optimization (TTPO) paradigm for image restoration, which enhances perceptual quality, generates preference data on-the-fly, and is compatible with any IR model backbone.
Specifically, we design a training-free, three-stage pipeline:
(i) generate candidate preference images online using diffusion inversion and denoising based on the initially restored image;
(ii) select preferred and dispreferred images using automated preference-aligned metrics or human feedback;
and (iii) use the selected preference images as reward signals to guide the diffusion denoising process, optimizing the restored image to better align with human preferences. Extensive experiments across various image restoration tasks and models demonstrate the effectiveness and flexibility of the proposed pipeline.

Test-Time Preference Optimization for Image Restoration

Multimodal Sentiment Analysis (MSA) aims to infer human sentiment by integrating information from multiple modalities such as text, audio, and video. In real-world scenarios, however, the presence of missing modalities and noisy signals significantly hinders the robustness and accuracy of existing models. While prior works have made progress on these issues, they are typically addressed in isolation, limiting overall effectiveness in practical settings. To jointly mitigate the challenges posed by missing and noisy modalities, we propose a framework called Two-stage Modality Denoising and Complementation (TMDC). TMDC comprises two sequential training stages. In the Intra-Modality Denoising Stage, denoised modality-specific and modality-shared representations are extracted from complete data using dedicated denoising modules, reducing the impact of noise and enhancing representational robustness. In the Inter-Modality Complementation Stage, these representations are leveraged to compensate for missing modalities, thereby enriching the available information and further improving robustness. Extensive evaluations on MOSI, MOSEI, and IEMOCAP demonstrate that TMDC consistently achieves superior performance compared to existing methods, establishing new state-of-the-art results.

TMDC: A Two-Stage Modality Denoising and Complementation Framework for Multimodal Sentiment Analysis with Missing and Noisy Modalities

Time series forecasting is an important task that involves analyzing temporal dependencies and underlying patterns (such as trends, cyclicality, and seasonality) in historical data to predict future values or trends. Current deep learning-based forecasting models primarily employ Mean Squared Error (MSE) loss functions for regression modeling. Despite enabling direct value prediction, this method offers no uncertainty estimation and exhibits poor outlier robustness.
To address these limitations, we propose OCE-TS, a novel ordinal classification approach for time series forecasting that replaces MSE with Ordinal Cross-Entropy (OCE) loss, preserving prediction order while quantifying uncertainty through probability output. Specifically, OCE-TS begins by discretizing observed values into ordered intervals and deriving their probabilities via a parametric distribution as supervision signals. Using a simple linear model, we then predict probability distributions for each timestep. The OCE loss is computed between the cumulative distributions of predicted and ground-truth probabilities, explicitly preserving ordinal relationships among forecasted values.
Through theoretical analysis using influence functions, we establish that cross-entropy (CE) loss exhibits superior stability and outlier robustness compared to MSE loss.
Empirically, we compared OCE-TS with five baseline models—Autoformer, DLinear, iTransformer, TimeXer, and TimeBridge—on seven public time series datasets. Using MSE and Mean Absolute Error (MAE) as evaluation metrics, the results demonstrate that OCE-TS consistently outperforms benchmark models.
The code will be published.

Beyond MSE: Ordinal Cross-Entropy for Probabilistic Time Series Forecasting

Text-driven 3D editing enables user-friendly 3D object or scene editing with text instructions. Existing methods usually employ off-the-shelf 2D generation or editing models to perform per-view editing, followed by iterative 3D updating. However, these methods suffer from multi-view inconsistency and slow convergence speed due to the lack of cross-view consistency priors. In this paper, we propose, for the first time to our best knowledge, generative Video Prior based 3D Editing, ViP3DE in short, to explore the use of temporal consistency priors in video generation models, achieving effective 3D editing within a single forward pass. Our key insight lies in the utilization of pre-trained video priors to directly generate multi-view consistent editing results, thereby bypassing the conventional inefficient 2D-3D-2D iterative editing paradigm. To generate edited views from specific perspectives, we propose motion-preserved noise blending to ensure that the edited views inherit camera motion cues from the source views. In addition, we introduce geometrically aware denoising to further enhance cross-view consistency by integrating 3D geometric priors into video models during the diffusion denoising process. Extensive experiments demonstrate that our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed. Code and model will be made publicly available.

Fast Multi-view Consistent 3D Editing with Video Priors

Cross-modal alignment is a promising yet challenging task in multimodal learning. Existing methods typically assess it by measuring the cross-modal semantic similarity from both global and local perspectives. However, these methods often neglect their potential interdependence. Specifically, global matching methods suffer from the over-compression of local features, while local matching methods rarely consider the inherent spatial topology of image patches. To address these limitations, we propose MG-Net, a unified framework with two collaborative modules: Multi-View Differential Mixer (MDM) and Graph-Guided Structural Region Selector (GSRS). The MDM is designed to capture discriminative global representations. It generates a series of views by decomposing feature vectors through multi-order differential operations, and adaptively fuses them via a lightweight Mixture-of-Experts (MoE) network. Meanwhile, the GSRS organizes image patches as a spatial graph and employs text-guided contextual reasoning to select spatially coherent and semantically complete structural region. Extensive experiments on the Flickr30K and MS-COCO benchmarks demonstrate that the proposed MG-Net outperforms state-of-the-art methods in most cases.

Multi-View Differential Mixing and Graph-Guided Structural Region Selection for Cross-Modal Alignment

We present *bfact*, a Python package for performing accurate low-rank Boolean matrix factorisation (BMF). *bfact* uses a hybrid combinatorial optimisation approach based on \emph{a priori} candidate factors generated from clustering algorithms. It selects the best disjoint factors, before performing either a second combinatorial, or heuristic algorithm to recover the BMF. We show that *bfact* does particularly well at estimating the true rank of matrices in simulated settings. In real benchmarks, using a collation of single-cell RNA-sequencing datasets from the Human Lung Cell Atlas, we show that bfact achieves strong signal recovery, with a much lower rank.

Hybrid Restricted Master Problem for Boolean Matrix Factorisation

Understanding dynamic 4D scenes from an egocentric perspective—modeling changes in 3D spatial structure over time—is crucial for human–machine interaction, autonomous navigation, and embodied intelligence. While existing egocentric datasets contain dynamic scenes, they lack unified 4D annotations and task-driven evaluation protocols for fine-grained spatio-temporal reasoning, especially on motion of objects and human, together with their interactions.

To address this gap, we introduce EgoDynamic4D, a novel QA benchmark on highly dynamic scenes, comprising RGB-D video, camera poses, globally unique instance masks, and 4D bounding boxes. We construct 927K QA pairs accompanied by explicit Chain-of-Thought (CoT), enabling verifiable, step-by-step spatio-temporal reasoning. We design 12 dynamic QA tasks covering agent motion, human–object interaction, trajectory prediction, relation understanding, and temporal–causal reasoning, with fine-grained, multidimensional metrics.

To tackle these tasks, we propose an end-to-end spatio-temporal reasoning framework that unifies dynamic and static scene information, using instance-aware feature encoding, time and camera encoding, and spatially adaptive down-sampling to compress large 4D scenes into token sequences manageable by LLMs. Experiments on EgoDynamic4D show that our method consistently outperforms baselines, validating the effectiveness of multimodal temporal modeling for egocentric dynamic scene understanding.

Downloads

Next from AAAI 2026

StreamSTGS: Streaming Spatial and Temporal Gaussian Grids for Real-Time Free-Viewpoint Video

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

StreamSTGS: Streaming Spatial and Temporal Gaussian Grids for Real-Time Free-Viewpoint Video

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads