Singapore

Vision-and-Language Navigation (VLN) requires an agent to dynamically explore complex 3D environments following human instructions. Recent research underscores the potential of harnessing large language models (LLMs) for VLN, given their commonsense knowledge and general reasoning capabilities. Despite their strengths, a substantial gap in task completion performance persists between LLM-based approaches and domain experts, as LLMs inherently struggle to comprehend real-world spatial correlations precisely; additionally, LLM inference can make the decision-making process considerably inefficient. To address these issues, we propose a novel dual-process thinking framework dubbed $R^3$, integrating LLMs’ generalization capabilities with VLN-specific expertise in a zero-shot manner. The framework comprises three core modules: Runner, Ruminator, and Regulator. The Runner is a lightweight transformer-based expert model that ensures efficient and accurate navigation under regular circumstances. The Ruminator employs a multimodal LLM as the backbone and adopts chain-of-thought (CoT) prompting to elicit structured reasoning from the LLM. The Regulator monitors the navigation progress and controls the appropriate thinking mode according to three criteria, integrating Runner and Ruminator harmoniously. Experimental results illustrate that R$^3$ significantly outperforms other state-of-the-art methods, exceeding 3.28% and 3.30% in SPL and RGSPL respectively on the REVERIE benchmark, highlighting the effectiveness of our method in handling challenging VLN tasks.

AAAI 2026

Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation

rob: embodied ai

and navigation

rob: localization

ml: multimodal learning

mapping

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Generating high-quality human interactions holds significant value for applications like virtual reality and robotics.
However, existing methods often fail to preserve unique individual characteristics or fully adhere to textual descriptions. 
To address these challenges, we introduce InterMoE, a novel framework built on a Dynamic Temporal-Selective Mixture of Experts. 
The core of InterMoE is a routing mechanism that synergistically uses both high-level text semantics and low-level motion context to dispatch temporal motion features to specialized experts. 
This allows experts to dynamically determine the selection capacity and focus on critical temporal features, thereby preserving specific individual characteristic identities while ensuring high semantic fidelity.
Extensive experiments show that InterMoE achieves state-of-the-art performance in individual-specific high-fidelity 3D human interaction generation, reducing FID scores by 9\% on the InterHuman dataset and 22\% on InterX.

InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE

Recovering precise surface geometry from corrupted point clouds remains a core challenge in 3D vision. Although existing denoising techniques achieve remarkable success, balancing noise removal with preserving intricate geometric details continues to pose difficulties. A critical limitation of current methods is that their adaptive feature aggregation mechanisms rely heavily on intermediate network features that have not been explicitly regularized, resulting in unstable guidance signals. This instability restricts the capability of the network to optimally differentiate true geometric details from noise.
To overcome this limitation, we propose a novel deep learning framework that explicitly learns structured representations as robust priors to guide feature refinement. Our approach first derives a set of representative local structural primitives from input features by means of a learned codebook. This learned structured representation then serves as a robust conditional signal, directing a subsequent feature fusion mechanism to dynamically aggregate information in a structure-aware manner, thereby more effectively discerning noise and meticulously reconstructing geometric details. Extensive experiments on several benchmarks have demonstrated the superiority of our framework over existing advanced techniques in terms of detail preservation and noise suppression.

Guiding Point Cloud Denoising with Learned Structural Priors

Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP’s vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP's internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP's dense prediction performance. Consequently, we propose $\underline{R}$e$\underline{F}$ocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP's multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.

Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective

Federated Graph Learning (FGL) has emerged as a compelling paradigm for collaboratively training a global model while preserving the privacy of multi-source graphs. Nonetheless, FGL faces a critical challenge of data heterogeneity, where semantic and structural discrepancies across clients significantly degrade its performance. Although existing methods attempt to calibrate client-specific graph distributions during the federated training, they inevitably fall short in aligning the optimization behaviors across clients due to dynamic parameter updates, thereby inducing a bottleneck in generalization improvement. To tackle this challenge, we propose a solution from a new perspective of prior refinement, which seeks to proactively harmonize client graph distributions before the federated training. In particular, we propose a Federated Graph Harmonization (FedGH) framework that exploits the generative strengths of graph diffusion models to perform prior refinement of local graphs. In a nutshell, FedGH designs a conditional diffusion mechanism on each client that synthesizes pseudo-graphs encapsulating both feature and structural priors, thereby facilitating explicit correction of inter-client distributional bias. On the server side, we employ the graph contrastive learning between various client-specific pseudo-graphs to incorporate the global information, subsequently guiding local data reconstruction. Importantly, model-agnostic FedGH can be seamlessly deployed as a plug-and-play module to be easily integrated with existing FGL architectures. Extensive experiments demonstrate that FedGH consistently outperforms state-of-the-art FGL baselines.

Prior Refinement Is Better: Diffusion-Driven Graph Harmonization for Federated Graph Learning

Cross-modal alignment is a crucial task in multimodal learning aimed at achieving semantic consistency between vision and language. This requires that image-text pairs exhibit similar semantics. Traditional algorithms pursue embedding consistency to achieve semantic consistency, ignoring the non-semantic information present in the embedding. An intuitive approach is to decouple the embeddings into semantic and modality components, aligning only the semantic component. However, this introduces two main challenges: (1) There is no established standard for distinguishing semantic and modal information. (2) The modality gap can cause semantic alignment deviation or information loss. To align the true semantics, we propose a novel cross-modal alignment algorithm via Constrained Decoupling and Distribution Sampling (CDDS). Specifically, (1) A dual-path UNet is introduced to adaptively decouple the embeddings, applying multiple constraints to ensure effective separation. (2) A distribution sampling method is proposed to bridge the modality gap, ensuring the rationality of the alignment process. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of CDDS, outperforming state-of-the-art methods by 6.6\%-14.2\%.

Aligning the True Semantics: Constrained Decoupling and Distribution Sampling for Cross-Modal Alignment

Echocardiography and vascular ultrasound are essential for comprehensive cardiovascular assessment, yet manual evaluation and writing reports are labor-intensive, time-consuming, and require expertise from both cardiology and vascular surgery departments. Current automated report generation systems mainly focus on X-ray or CT, often neglecting echocardiographic modalities and critical quantitative parameters like aortic diameter and main pulmonary artery diameter, limiting their clinical utility. Moreover, the interdependence between cardiac and peripheral vascular health necessitates cross-departmental insights, which existing methods fail to incorporate. 
To address these limitations, we first propose the vision-language framework named the Echo-Cardiac-Vascular (ECV) framework, for joint cardiac and vascular ultrasound report generation and parameter measurements. ECV introduces a Mixture-of-Experts vision encoder tailored for distinct ultrasound subtypes, a structured parameter measurement module for accurate quantification, and a cross-modal attention mechanism that generates interpretable, multimodal diagnostic reports. Our framework, trained on 11,276 paired records that achieves high accuracy and fast generation speed, significantly improving diagnostic efficiency, consistency, and cross-disciplinary clinical applicability. Our model and codes will be publicly available.

Unified Mixture-of-Experts Framework for Joint Cardiac and Vascular Ultrasound Analysis and Report Generation

Recent diffusion-based models have significantly improved inpainting quality.
However, existing methods struggle with multi-task inpainting due to conflicting optimization objectives, and current datasets are typically limited to task-specific scenarios, hindering joint training.
To address these challenges, we propose \textbf{MagicPaint}, a unified diffusion-based inpainting model that supports object addition, removal, and unconditional inpainting across both text and image modalities.
MagicPaint semantically decouples operation types and target content by learnable tokens in \textbf{MMToken Module}, effectively reconciling conflicting optimization objectives and enabling robust multi-task, multi-modal inpainting.
Besides, a novel inpainting paradigm named \textbf{MagicMask}, encodes operating intent directly into the mask and applies a mask loss for spatially precise supervision.
In addition, existing inpainting datasets are insufficient for multi-task and multi-modal scenarios, limiting the capability of inpainting models. Thus, we further introduce
a new dataset comprising 2.1M image tuples. It is dedicatedly designed to support diverse inpainting scenarios and significantly improves upon existing datasets, particularly in object removal. 
Through efforts from both model and data perspectives, \textbf{MagicPaint} enables users to operate anything—add, remove or inpaint content which is specified through either text or image modalities in a seamless and unified manner.
Extensive experiments demonstrate that MagicPaint achieves state-of-the-art performance across three key tasks (i.e., text-guided addition, image-guided addition, and object removal) and produces outputs with superior visual consistency and contextual fidelity compared to existing methods. Our code and data will be publicly released.

MagicPaint: Operate Anything for Image Inpainting with Diffusion Model

Temporal reasoning is a fundamental capability for large language models (LLMs) to understand real-world dynamics. Existing research on temporal reasoning has predominantly focused on the Gregorian calendar. However, as many countries and regions concurrently adopt multiple calendar systems, temporal reasoning across calendars becomes crucial for LLMs in global and multicultural contexts. Unfortunately, cross-calendar temporal reasoning remains underexplored, with no dedicated benchmark available to evaluate this capability.
To bridge this gap, we introduce SPAN, a cross-calendar temporal reasoning benchmark, which requires LLMs to perform intra-calendar temporal reasoning and inter-calendar temporal conversion. SPAN features 10 cross-calendar temporal reasoning directions, two reasoning types, and two question formats, involving the Gregorian, Chinese lunar, Shaka, Hebrew, Islamic, and Persian calendars. To enable time-variant and contamination-free evaluation, we propose a template-driven evaluation protocol for dynamic instance generation, which allows assessment on a user-specified Gregorian date. We conduct extensive experiments on both open- and closed-source state-of-the-art (SOTA) LLMs over a range of dates spanning 100 years from 1960 to 2060. Our evaluations show that these LLMs achieve an average accuracy of only 34.5%, with none exceeding 80%, indicating that this task remains challenging. Through in-depth analysis of reasoning types, question formats, and temporal reasoning directions, we identify two key obstacles for LLMs: Future-Date Degradation and Calendar Asymmetry Bias. 
To strengthen LLMs' cross-calendar temporal reasoning capability, we further develop an LLM-powered Time Agent that leverages tool-augmented code generation. Empirical results show that Time Agent achieves an average accuracy of 95.31%, outperforming several competitive baselines, highlighting the potential of tool-augmented code generation to advance cross-calendar temporal reasoning. We hope this work will inspire further efforts toward more temporally and culturally adaptive LLMs.

SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models

Image deraining is crucial for vision applications but is challenged by the complex multi-scale physics of rain and its coupling with scenes. To address this challenge, a novel approach inspired by multi-stage image restoration is proposed, incorporating Point Spread Function (PSF) mechanisms to reveal the image degradation process while combining dynamic physical modeling with sequential feature fusion transfer, named SD-PSFNet. Specifically, SD-PSFNet employs a progressive refinement architecture with three cascaded stages that sequentially restore the rain-free image. The network utilizes specialized blocks with learned point spread functions to dynamically model rain streak optics, enabling effective rain-background separation while progressively enhancing outputs through novel PSF components at each stage. Additionally, SD-PSFNet incorporates adaptive gated fusion for optimal cross-stage feature integration, enabling progressive refinement from coarse rain removal to fine detail restoration. Our model achieves state-of-the-art PSNR/SSIM metrics on Rain100H (33.12dB/0.9371), RealRain-1k-L (42.28dB/0.9872), and RealRain-1k-H (41.08dB/0.9838). SD-PSFNet demonstrates superior robustness in complex scenes and dense rain conditions, all while maintaining a modest parameter count (9.63M) compared to more complex contemporary models.

SD-PSFNet: Sequential and Dynamic Point Spread Function Network for Image Deraining

Robust Multimodal Learning (RML) aims to address the issues of unreliable predictions of multimodal models.
Nevertheless, previous RML works often struggle to distinguish between different categories that rely on identical intra-modal cues, making ambiguous predictions.
We defined this degree of ``uncertain'' in extracting discriminative features of a multimodal model as vagueness.
Neglecting such vagueness, as previous RML works commonly do, will undermine the ability to extract unique semantics of each category in multimodal models, further resulting in worse robustness under disturbances that affect semantic representations.
Additionally, this vagueness will lead the parameter updating processes towards unreliable fusion, thus diverting the learning processes of the multimodal model from learning unique features of each category.
Based on the above insight, we propose a novel robust multimodal learning approach, termed Hyper-Opinion Quantifying Vagueness (HOQV).
Specifically, we first introduce hyper-opinion to capture and quantify the vagueness of multimodal learning in discriminating representations of different categories.
Moreover, to mitigate the interference in parameter updating of unreliable representations with high vagueness, we also design the Hyper-Opinion Gradient Modulation to guide the optimization processes. 
We evaluate our HOQV on six datasets with different disturbances, including noise and adversarial attack, and demonstrate that our proposed method achieves state-of-the-art performance consistently.

Downloads

Next from AAAI 2026

InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads