Singapore

Multimodal fusion of color fundus photography (CFP) and optical coherence tomography (OCT) B-scan images has demonstrated superior diagnostic potential for retinal diseases compared to single-modality approaches. However, existing fusion paradigms - whether through naive concatenation or attention mechanisms - treat cross-modal interactions indiscriminately, lacking adaptive modulation of modality-specific contributions under varying clinical scenarios. We propose an adaptive fusion framework that dynamically routes and refines multimodal signals for enhancing disease recognition. The framework comprises two key components: 1) Dynamic Cross-Modal Expert Routing (CMER), which selectively activates convolutional neural network (CNN) experts from one modality based on contextual guidance from the other, ensuring only the most relevant feature extractors contribute to fusion; and 2) Top-K Expert-Guided Wavelet Fusion (TEWF), which performs discrete wavelet transform (DWT) to decompose selected features into low- and high-frequency subbands. Cross-modal attention is then applied specifically to high-frequency components, where lesion-specific microstructures reside, enabling frequency-aware fusion. Finally, inverse DWT (IDWT) reconstructs the fused representation, weighted by CMER-derived importance scores to amplify informative modality cues while suppressing redundancy. Experimental validation on two multimodal retinal datasets demonstrates that our method achieves state-of-the-art performance, outperforming existing fusion strategies by significant margins in disease classification accuracy and robustness.

AAAI 2026

Frequency-Aligned Cross-Modal Learning with Top-K Wavelet Fusion and Dynamic Expert Routing for Enhanced Retinal Disease Diagnosis

medical image analysis

multimodal learning

computer vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

In this work, we propose a disentangled latent optimization-based method for parameterizing grouped deforming 3D objects into shape and deformation factors in an unsupervised manner. Our approach involves the joint optimization of a generator network along with the shape and deformation factors, supported by specific regularization techniques. For efficient amortized inference of disentangled shape and deformation codes, we train two order-invariant PoinNet-based encoder networks in the second stage of our method. We demonstrate several significant downstream applications of our method, including unsupervised deformation transfer, deformation classification, and explainability analysis. Extensive experiments conducted on 3D human, animal, and facial expression datasets demonstrate that our simple approach is highly effective in these downstream tasks, comparable or superior to existing methods with much higher complexity.

DiLO: Disentangled Latent Optimization for Learning Shape and Deformation in Grouped Deforming 3D Objects

Minimizing inconsistencies across successive versions of an AI system is as crucial as reducing the overall error. In image classification, such inconsistencies manifest as negative flips, where an updated model misclassifies test samples that were previously classified correctly. This issue becomes increasingly pronounced as the number of training classes grows over time, since adding new categories reduces the margin of each class and may introduce conflicting patterns that undermine their learning process, thereby degrading performance on the original subset. To mitigate negative flips, we propose a novel approach that preserves the margins of the original model while learning an improved one. Our method encourages a larger relative margin between the previously learned and newly introduced classes by introducing an explicit margin-calibration term on the logits. However, overly constraining the logit margin for the new classes can significantly degrade their accuracy compared to a new independently trained model. To address this, we integrate a double-source focal distillation loss with the previous model and a new independently trained model, learning an appropriate decision margin from both old and new data, even under a logit margin calibration. Extensive experiments on image classification benchmarks demonstrate that our approach consistently reduces the negative flip rate with high overall accuracy.

Mitigating Negative Flips via Margin Preserving Training

Finding the optimal configuration of Sparse Mixture-of- Experts (SMoE) that maximizes semantic differentiation among experts is essential for exploiting the full potential of MoE architectures. However, existing SMoE frameworks either heavily rely on hyperparameter tuning or overlook the importance of diversifying semantic roles across experts when adapting the expert pool size. We propose Mixture-of-Experts for Adaptive Semantic Specialization (MASS), a semantic- aware MoE framework for adaptive expert expansion and dy- namic routing. MASS introduces two key advancements: (1) a gradient-based semantic drift detector that prompts targeted expert expansion when the existing expert pool lacks capacity to capture the full semantic diversity of the data, and (2) a novel routing strategy that dynamically adjusts expert usage based on token-level routing confidence mass. We first demon- strate that MASS reliably converges to the point of optimal balance between cost-performance trade-off with notably im- proved sematic specialization in a highly controlled synthetic setup. Further empirical results on real-world datasets across language and vision domains show that MASS consistently outperforms a range of strong MoE baselines, demonstrating its domain robustness and enhanced expert specialization.

How Many Experts Are Enough? Towards Optimal Semantic Specialization for Mixture-of-Experts

While 3D Gaussian Splatting (3DGS) excels at real-time rendering of standard scenes, it struggles to reconstruct underwater environments due to severe challenges such as light scattering, color attenuation, and sparse coverage of Gaussian kernels in far-field aqueous regions. To address this, we introduce \textit{AquaSplatting}, a hybrid framework that combines explicit and implicit modeling methods for robust underwater scene reconstruction. Our dual-branch architecture employs 3DGS in a geometry-guided branch to model solid surfaces like the seabed, while a medium-aware branch uses a compact, view-dependent MLP to represent volumetric water effects. Furthermore, a neural underwater hybrid rendering mechanism adaptively fuses these two representations based on accumulated opacity. Thanks to this dual-branch framework, our method can also synthesize restored images without water medium. To enhance efficiency, our proposed engagement-based pruning (EBP) strategy quantifies each Gaussian's contribution by accumulating its image-space gradients over multiple frames, enabling the principled removal of primitives with negligible impact. The entire framework is optimized using a comprehensive loss function that integrates photometric, exposure, semantic, and depth priors to maximize visual fidelity. Experiments on challenging underwater datasets demonstrate that AquaSplatting achieves the state-of-the-art in reconstruction quality surpassing prior methods while maintaining real-time performance.

AquaSplatting: A Hybrid 3D Representation for Robust Underwater Scene Reconstruction via Dual-Branch Rendering

Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. 
However, evaluating these models remains a fundamental challenge. 
We propose *SageLM*, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. 
First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. 
Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. 
Third, we introduce *SpeechFeedback*, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. 
Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79\% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42\% and 26.20\%, respectively.

SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

Noisy correspondence in cross-modal retrieval introduces significant challenges due to its inherent difficulty in identification and correction. Although existing methods attempt to minimize the influence of noisy samples by the weighting mechanism, these methods still struggle with performance degradation under increasing noise levels. Specifically, the clean samples are assigned the same weight of 1, which ignores the sample hardness. In addition, the weights for noisy samples are approaching 0, leading to the overlook of sample diversity. To address these issues, we propose a Hardness and Noise-aware (HaNa) robust cross-modal retrieval method. HaNa introduces a momentum-based reweighting mechanism to adaptively balance learning difficulty across clean samples, avoiding overfitting risk and accumulative partitioning bias. Moreover, HaNa addresses the limitation that weights for noisy data are approaching 0 from a new perspective to fully employ the diversity of samples to further improve its generalization. It employs an Asymmetric Noise-aware Regularization Loss (ANRL) to treat identified noisy data as negative samples for optimization. Extensive experiments demonstrate that HaNa achieves superior matching accuracy and stability, especially in high-noise scenarios, outperforming state-of-the-art methods.

HaNa: Hardness and Noise-Aware Robust Cross-modal Retrieval

Realistic choreography demands simultaneous attention to rhythm and motivation. Prevailing automated dance gener-
ation methods mainly depend on musical input, overlooking the motivations that drive meaningful dance creation.
Inspired by the motivation choreography, we aim to articulate dance motivations through textual guidance. However,
the absence of high-quality datasets concurrently containing music, textual descriptions, and motion data presents a
challenge in achieving accurate fine-grained textual control. To address this limitation, we present MotivDance, a novel
framework integrating fine-grained textual guidance with music to synthesize semantically coherent dance sequences. Our
approach first synthesizes text-guided key poses as motivations. We then introduce an Adaptive Keyframe Locator that
dynamically positions these motivations within the musical context through beat-aware synchronization and cross-modal
latent space alignment. Finally, a Transformer-based U-Net diffusion model performs the motion in-betweening while
preserving motivational integrity. Extensive qualitative and quantitative experiments demonstrate that MotivDance effec-
tively integrates music with fine-grained text control to generate high-fidelity dance motions.

MotivDance: Fine-Grained Text-Guided Motivation Choreography with Music Synchronization

3D Gaussian Splatting (3DGS) has emerged as a powerful representation for 3D scenes, widely adopted due to its exceptional efficiency and high-fidelity visual quality. Given the significant value of 3DGS assets, recent works have introduced specialized watermarking schemes to ensure copyright protection and ownership verification. However, can existing 3D Gaussian watermarking approaches genuinely guarantee robust protection of the 3D assets? In this paper, for the first time, we systematically explore and validate possible vulnerabilities of 3DGS watermarking frameworks. We demonstrate that conventional watermark removal techniques designed for 2D images do not effectively generalize to the 3DGS scenario due to the specialized rendering pipeline and unique attributes of each gaussian primitives. Motivated by this insight, we propose GSPure, the first watermark purification framework specifically for 3DGS watermarking representations. By analyzing view-dependent rendering contributions and exploiting geometrically accurate feature clustering, GSPure precisely isolates and effectively removes watermark-related Gaussian primitives while preserving scene integrity. Extensive experiments demonstrate that our GSPure achieves the best watermark purification performance, reducing watermark PSNR by up to 16.34dB while minimizing degradation to original scene fidelity with less than 1dB PSNR loss. Moreover, it consistently outperforms existing methods in both effectiveness and generalization.

Can Protective Watermarking Safeguard the Copyright of 3D Gaussian Splatting?

Large vision-language models (VLMs) for autonomous driving (AD) are evolving beyond perception and cognition tasks toward motion planning. However, we identify two critical challenges in this direction: (1) VLMs tend to learn shortcuts by relying heavily on history input information, achieving seemingly strong planning results without genuinely understanding the visual inputs; and (2) the chain-of-thought (COT) reasoning processes are always misaligned with the motion planning outcomes, and how to effectively leverage the complex reasoning capability to enhance planning remains largely underexplored. In this paper, we start from a small-scale domain-specific VLM and propose **Drive-R1**, designed to bridge the scenario reasoning and motion planning for AD. **Drive-R1** first undergoes the supervised finetuning on an elaborate dataset containing both long and short COT data. **Drive-R1** is encouraged to reason step-by-step from visual input to final planning decisions. Subsequently, **Drive-R1** is trained within a reinforcement learning framework that incentivizes the discovery of reasoning paths that are more informative for planning, guided by rewards based on predicted trajectories and meta actions. Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate that **Drive-R1** achieves superior performance compared to existing state-of-the-art VLMs. We believe that **Drive-R1**presents a promising direction for bridging reasoning and planning in AD, offering methodological insights for future research and applications.

Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning

Self-supervised graph representation learning (GRL) typically generates paired graph augmentations from each graph to infer similar representations for augmentations of the same graph, but distinguishable representations for different graphs. While effective augmentation requires both semantics-preservation and dataperturbation, most existing GRL methods focus solely on data-perturbation, leading to suboptimal solutions. To fill the gap, in this paper, we propose a novel method, Explanation-Preserving Augmentation (EPA), which leverages graph explanation for semantics-preservation. EPA first uses a small number of labels to train a graph explainer, which infers the subgraphs that explain the graph’s label. Then these explanations are used for generating semantics-preserving augmentations for boosting self-supervised GRL. Thus, the entire process, namely EPA-GRL, is semi-supervised. We demonstrate theoretically, using an analytical example, and through extensive experiments on a variety of benchmark datasets, that EPA-GRL outperforms the state-of-the-art (SOTA) GRL methods with semantics-agnostic augmentations.

Content not yet available

Next from AAAI 2026

DiLO: Disentangled Latent Optimization for Learning Shape and Deformation in Grouped Deforming 3D Objects

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES