Singapore

The integration of medical images with clinical context is essential for generating accurate and clinically interpretable radiology reports. However, current automated methods often rely on resource-heavy Large Language Models (LLMs) or static knowledge graphs and struggle with two fundamental challenges in real-world clinical data: (1) missing modalities, such as incomplete clinical context , and (2) feature entanglement, where mixed modality-specific and shared information leads to suboptimal fusion and clinically unfaithful hallucinated findings. To address these challenges, we propose the DiA-gnostic VLVAE, which achieves robust radiology reporting through Disentangled Alignment. Our framework is designed to be resilient to missing modalities by disentangling shared and modality-specific features using a Mixture-of-Experts (MoE) based Vision-Language Variational Autoencoder (VLVAE). A constrained optimization objective enforces orthogonality and alignment between these latent representations to prevent suboptimal fusion. A compact LLaMA-X decoder then uses these disentangled representations to generate reports efficiently. On the IU X-Ray and MIMIC-CXR datasets, DiA has set new state-of-the-art BLEU@4 scores of 0.266 and 0.134, respectively. Experimental results show that the proposed method significantly outperforms state-of-the-art models.

AAAI 2026

DiA-gnostic VLVAE: Disentangled Alignment-Constrained Vision Language Variational AutoEncoder for Robust Radiology Reporting with Missing Modalities

vision language variational autoencoder

mixture of expert

missing modalities

disentangled representation learning

radiology report generation

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Spiking Neural Networks (SNNs), with their brain-inspired spatiotemporal dynamics and spike-driven computation, have emerged as promising energy-efficient alternatives to Artificial Neural Networks (ANNs). However, existing SNNs typically replicate inputs directly or aggregate them into frames at fixed intervals. Such strategies lead to neurons receiving nearly identical stimuli across time steps, severely limiting the model's expressive power—particularly in complex tasks like object detection. In this work, we propose the Temporal Dynamics Enhancer (TDE) to strengthen SNNs' capacity for temporal information modeling. TDE consists of two modules: a Spiking Encoder (SE) that generates diverse input stimuli across time steps, and an Attention Gating Module (AGM) that guides the SE generation based on inter-temporal dependencies. Moreover, to eliminate the high-energy multiplication operations introduced by the AGM, we propose a Spike-Driven Attention (SDA) to reduce attention-related energy consumption. Extensive experiments demonstrate that TDE can be seamlessly integrated into existing SNN-based detectors and consistently outperforms state-of-the-art methods, achieving mAP@50-95 scores of 57.7% on the static PASCAL VOC dataset and 47.6% on the neuromorphic EvDET200K dataset. In terms of energy consumption, the SDA consumes only 0.240× the energy of conventional attention modules.

Temporal Dynamics Enhancer for Directly Trained Spiking Object Detectors

Timely detection of retinal diseases is crucial for prevent- ing vision loss; yet the limited availability of ophthalmolo- gists and disparities in access to diagnostic services continue to hinder widespread screening, particularly in primary care settings. We present REMEDIS, a Software-as-a-Service (SaaS)–based clinical AI framework for the automated diag- nosis of major retinal diseases, including age-related macu- lar degeneration (AMD), diabetic retinopathy (DR), epireti- nal membrane (ERM), and glaucoma, using fundus images. The system analyzes high-resolution fundus photographs in a secure cloud environment via a Swin-Large–based multi- disease classification network, producing disease-specific probability scores. To ensure clinically meaningful decision- making, Youden’s Index is applied to determine optimized sensitivity–specificity thresholds for each condition. An ex- plainability module based on Grad-CAM generates lesion- localization contour visualizations, providing interpretable evidence that assists ophthalmologists in case review and fa- cilitates integration into electronic medical records (EMR). The framework was evaluated in an IRB-approved multi- center prospective clinical trial conducted under real-world conditions, achieving an average AUC exceeding 0.94 across the four target diseases and demonstrating strong concor- dance with expert diagnoses. To our knowledge, this repre- sents one of the first SaaS-based AI diagnostic frameworks for retinal diseases validated through prospective clinical studies, highlighting its potential as an emerging clinical ap- plication of AI.

REMEDIS: A Clinical AI Framework for Retinal Disease Diagnosis with Explainable Fundus Image Analysis

Recommender systems are widely required and deployed to address real-world problems. In this paper, we study a new yet challenging real-world setting for recommender systems, where only user browsing histories are available without any explicit feedback. No item acquisition information, e.g., purchasing or rating, is given. By assuming that user browsing sequences are likely to contain the items to acquire, we draw an analogy to the setting of partial label learning in weakly supervised learning. This enables us to train reliable recommender systems only using browsing histories. We term the proposed method as Partial Acquisition Recommender System (PARS). Empirical results on real-world benchmark datasets show the effectiveness of the proposed method. Surprisingly, we also show that the proposed method even surpasses some baselines using item acquisition information.

PARS: Partial-Label-Learning-inspired Recommender Systems

Diffusion-based generative models have demonstrated remarkable capabilities in image synthesis, yet realistic hand generation remains a persistent challenge due to complex articulations, self-occlusion, and the lack of explicit structural guidance. 
To address these issues, we present SGMHand, a novel structure-guided hand inpainting framework that explicitly injects topological priors to enhance structural fidelity and spatial precision. 
Specifically, we present a structure-guided modulation (SGM) module that synergistically combines structure spatial attention with global feature calibration, enabling fine-grained geometric control over the generative process. 
Then, we devise a keypoint-aware (KA) loss that enforces topological coherence by aligning attention activations with structures, thereby bridging the gap between high-level semantics and low-level geometry. 
By jointly optimizing over structural constraints in both representation and learning objectives, SGMHand achieves semantically consistent and geometrically plausible hand synthesis, even under severe occlusion. 
Extensive experiments demonstrate the effectiveness and strong generalization ability of SGMHand across various foundation models, significantly enhancing the quality and realism of human image synthesis in diverse scenarios.
Code and models will be publicly available.

SGMHand: Structure-Guided Modulation for Structure-Aware Hand Inpainting

We study a class of two-player zero-sum Colonel Blotto games in which, after allocating soldiers across battlefields, players engage in (possibly distinct) normal-form games on each battlefield. Per-battlefield payoffs are parameterized by the soldier allocations. This generalizes the classical Blotto setting, where outcomes depend only on relative soldier allocations. We consider both discrete and continuous allocation models and examine two types of aggregate objectives: linear aggregation and worst-case battlefield value. For each setting, we analyze the existence and computability of Nash equilibrium. The general problem is not convex-concave, which limits the applicability of standard convex optimization techniques. However, we show that in several settings it is possible to reformulate the strategy space in a way where convex-concave structure is recovered. We evaluate the proposed methods on synthetic and real-world instances inspired by security applications, suggesting that our approaches scale well in practice.

Colonel Blotto with Battlefield Games

Sequential recommendation (SR) aims to predict users' next action based on their historical behavior, and is widely adopted by a number of platforms. The performance of SR models relies on rich interaction data. However, in real-world scenarios, many users only have a few historical interactions, leading to the problem of data sparsity. Data sparsity not only leads to model overfitting on sparse sequences, but also hinders the model’s ability to capture the underlying hierarchy of user intents. This results in misinterpreting the user's true intents and recommending irrelevant items. Existing data augmentation methods attempt to mitigate overfitting by generating relevant and varied data. However, they overlook the problem of reconstructing the user's intent hierarchy, which is lost in sparse data. Consequently, the augmented data often fails to align with the user's true intents, potentially leading to misguided recommendations. To address this, we propose the Adaptive Diffusion Augmentation for Recommendation (ADARec) framework. Critically, instead of using a diffusion model as a black-box generator, we use its entire step-wise denoising trajectory to reconstruct a user's intent hierarchy from a single sparse sequence. To ensure both efficiency and effectiveness, our framework adaptively determines the required augmentation depth for each sequence and employs a specialized mixture-of-experts architecture to decouple coarse- and fine-grained intents. Experiments show ADARec outperforms state-of-the-art methods by 2-9\% on standard benchmarks and 3-17\% on sparse sequences, demonstrating its ability to reconstruct hierarchical intent representations from sparse data. Our code is available at: https://anonymous.4open.science/r/ADARec-6C8D.

De-collapsing User Intent: Adaptive Diffusion Augmentation with Mixture-of-Experts for Sequential Recommendation

Visualizations play a crucial part in effective communication of concepts and information. Recent advances in reasoning and retrieval augmented generation have enabled Large Language Models (LLMs) to perform deep research and generate comprehensive reports. Despite its progress, existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored. This novel task poses key challenges in designing informative visualizations and effectively integrating them with text reports. To address these challenges, we propose Formal Description of Visualization (FDV), a structured textual representation of charts that enables LLMs to learn from and generate diverse, high-quality visualizations. Building on this representation, we introduce Multimodal DeepResearcher, an agentic framework that decomposes the task into four stages: (1) researching, (2) exemplar report textualization, (3) planning and (4) multimodal report generation. For the evaluation of the generated reports, we develop MultimodalReportBench which contains 100 diverse topics as inputs, and a set of dedicated metrics for report and chart evaluation. Extensive experiments across models and evaluation methods demonstrate the effectiveness of Multimodal DeepResearcher. Notably, utilizing the same Claude 3.7 Sonnet model, Multimodal DeepResearcher achieves an 82\% overall win rate over the baseline method.

Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports from Scratch with Agentic Framework

Uncertainty visualizations, such as hurricane cones and ensemble tracks, are vital for risk communication but are frequently misinterpreted, leading to dangerous outcomes. No current visualization reliably conveys forecast uncertainty without cognitive bias. As AI assistants like large language models (LLMs) increasingly support public understanding and decision-making, they offer a promising pathway to enhance the interpretation of such complex visualizations. We present HurricaneTrust, the first benchmark to systematically compare how humans and LLMs reason about hurricane forecast uncertainty visualizations. HurricaneTrust spans two escalating phases and includes seven representative visualization formats, six real hurricane scenarios, and three agent types (humans, LLMs with context, and LLMs without context), resulting in 880 visualizations and over 117,600 structured question–answer pairs under matched evaluation conditions. Phase 1 contrasts LLM and human performance across implicit and explicit uncertainty encodings; Phase 2 examines reasoning under single- versus multi-dimensional uncertainty representations. Our evaluation includes a thorough analysis of damage estimation, reasoning strategies, and comprehension patterns. Results reveal that LLMs have a stronger semantic and conceptual understanding of uncertainty, and are less misled by visual variability, but still replicate key human biases during decision-making. Our findings offer insights into aligning LLM behavior with human cognition in uncertainty-rich visual reasoning tasks.

Do Large Language Models Reason About Uncertainty Like Humans? A Benchmark on Hurricane Forecast Visualization Comprehension

Building on VeriX (Verified eXplainability), a system for producing *optimal verified explanations* for machine learning models, we present VeriX+, which significantly improves both the *size* and the generation *time* of formal explanations. We introduce a bound propagation-based sensitivity technique to improve the size, and a binary search-based traversal with confidence ranking for improving time---the two techniques are orthogonal and can be used independently or together. We also show how to adapt the QuickXplain algorithm to our setting to provide a trade-off between size and time. Experimental evaluations on standard benchmarks demonstrate significant improvements on both metrics, e.g., a size reduction of $38$% on the GTSRB dataset and a time reduction of $90$% on MNIST. We demonstrate that our approach is scalable to transformers and real-world scenarios such as autonomous aircraft taxiing and sentiment analysis. We conclude by showcasing several novel applications of formal explanations.

Efficiently Computing Compact Formal Explanations

The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards—alignment tuning, system prompts, and content moderation.
Yet the real-world robustness of these defenses against adversarial attack remains underexplored. 
We introduce MFA, a framework that systematically uncovers general safety vulnerabilities in leading defense-equipped VLMs, including GPT-4o, Gemini-Pro, and Llama 4.1, etc. Central to MFA is the Attention-Transfer Attack (ATA), which conceals harmful instructions inside a meta task with competing objectives. We offer a theoretical perspective grounded in reward-hacking to explain why such an attack can succeed. To maximize cross-model transfer, we introduce a lightweight transfer-enhancement algorithm combined with a simple repetition strategy that jointly evades both input- and output-level filters—without any model-specific fine-tuning. 
We empirically show that adversarial images optimized for one vision encoder transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability.
Combined, our techniques reach a 58.5% overall success rate, consistently outperforming existing methods. 
Notably, on state-of-the-art commercial models, our attack achieves a 52.8% success rate, outperforming the second-best attack by 34%. These findings challenge the perceived robustness of current defensive mechanisms, systematically expose general safety loopholes within defense-equipped VLMs, and offer a practical probe for diagnosing and strengthening the safety of VLMs.

Downloads

Next from AAAI 2026

Temporal Dynamics Enhancer for Directly Trained Spiking Object Detectors

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Temporal Dynamics Enhancer for Directly Trained Spiking Object Detectors

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads