Singapore

Automated analysis of temporal changes in multimodal retinal images is critical for the prognostic assessment of ophthalmic diseases. 
Unlike traditional single-timepoint diagnosis, tracking longitudinal changes across multiple imaging modalities introduces significant data bias challenges: (1) Imbalanced modality samples compromise the integration of knowledge within minority modalities; (2) Heterogeneous visual patterns across modalities undermine the perception of disease-relevant biomarkers. To tackle these issues, we propose a Modality-Incremental Expert Aggregation Network (MoEA-Net), which unifies the inter-modal integration and intra-modal perception for enhanced retinal prognostic prediction. Specifically, we employ the large language model (LLM) with incremental LoRA layers for specific modalities to effectively integrate knowledge from imbalanced data. Besides, we introduce a Spatiotemporal-aware Expert (SAE) module to better perceive both the anatomical structures and longitudinal changes within modalities. By progressively combining the SAE module with incremental LoRA, MoEA-Net supports continual knowledge accumulation and improves accurate reasoning. Experimental results show that MoEA-Net achieves state-of-the-art performance on \textit{subretinal fluid change} and \textit{visual recovery} classification tasks, validating its effectiveness. Our code will be open-sourced upon acceptance.

AAAI 2026

MoEA-Net: Modality-Incremental Expert Aggregation Network for Retinal Prognostic Prediction

disease classification

multimodal large language models

medical image analysis

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Large Language Models (LLMs) have achieved remarkable success across reasoning and knowledge-intensive tasks, yet their static pretraining leaves them unable to handle rapidly evolving or domain-specific knowledge. Retrieval-Augmented Generation (RAG) addresses this by grounding LLM outputs in dynamically retrieved evidence, improving factual accuracy and reducing hallucinations. However, standard RAG pipelines struggle with temporally sensitive queries, especially when documents contain fuzzy or indirect time expressions (e.g., “a few years later”). This leads to Temporal Misalignment, where topically relevant but temporally incorrect results are retrieved. To overcome this, we propose DeFuzzRAG, a lightweight framework that enhances temporal robustness in RAG. DeFuzzRAG employs a small local language model to infer concrete time scopes from vague expressions and applies metadata-based filtering to realign retrieval with the query’s temporal intent. Experiments on a benchmark of fuzzified queries demonstrate that DeFuzzRAG substantially improves retrieval accuracy, raising Hit Rate by 15.7\% while maintaining efficiency and model-agnostic integration. Our findings highlight the importance of temporal reasoning in RAG and establish DeFuzzRAG as a practical, plug-and-play solution for deploying temporally robust LLM systems in real-world settings.

DeFuzzRAG: Handling Fuzzy Time Expressions for Temporal Robustness in Retrieval-Augmented Generation

Textured high-fidelity 3D models are crucial for games, AR/VR, and film, but human-aligned evaluation methods still fall behind despite recent advances in 3D reconstruction and generation. Existing metrics, such as Chamfer Distance, often fail to align with how humans evaluate the fidelity of 3D shapes. Recent learning-based metrics attempt to improve this by relying on rendered images and 2D image quality metrics. However, these approaches face limitations due to incomplete structural coverage and sensitivity to viewpoint choices. Moreover, most methods are trained on synthetic distortions, which differ significantly from real-world distortions, resulting in a domain gap. To address these challenges, we propose a new fidelity evaluation method that is based directly on 3D meshes with texture, without relying on rendering. Our method, named Textured Geometry Evaluation (TGE), jointly uses the geometry and color information to calculate the fidelity of the input textured mesh in comparison to a reference colored shape. To train and evaluate our metric, we design a human-annotated dataset with real-world distortions. Experiments show that TGE outperforms rendering-based and geometry-only methods on real-world distortion datasets.

Textured Geometry Evaluation: Perceptual 3D Textured Shape Metric via 3D Latent-Geometry Network

Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment through supervised fine-tuning and reinforcement learning from human feedback. These vulnerabilities manifest as differential safety behavior across token positions, with safety modifications concentrating in early positions while later positions show minimal distributional changes from base models. We provide a mechanistic analysis of safety alignment training dynamics, revealing that gradient concentration during autoregressive training creates signal decay across token positions. This leads to incomplete distributional learning where safety training fails to fully transform model preferences in later response regions. We introduce base-favored tokens as computational indicators of incomplete safety learning. Analysis reveals that while early positions undergo substantial distributional changes, later positions retain concerning base model preferences in safety-critical contexts, indicating systematic incomplete learning due to insufficient training signals. We develop a targeted completion method that addresses these undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates remarkable improvements in adversarial robustness, with dramatic reductions in attack success rates across multiple attack types while fully preserving general capabilities.

Rethinking Deep Alignment Through the Lens of Incomplete Safety Learning

Sarcasm detection is a crucial yet challenging Natural Language Processing task. Existing Large Language Model methods are often limited by single-perspective analysis, static reasoning pathways, and a susceptibility to hallucination when processing complex ironic rhetoric, which impacts their accuracy and reliability. To address these challenges, we propose SEVADE, an novel Self-Evolving multi-agent Analysis framework with Decoupled Evaluation for hallucination-resistant sarcasm detection. The core of our framework is a Dynamic Agentive Reasoning Engine (DARE), which utilizes a team of specialized agents grounded in linguistic theory to perform a multifaceted deconstruction of the text and generate a structured reasoning chain. Subsequently, a separate lightweight rationale adjudicator (RA) performs the final classification based solely on this reasoning chain. This decoupled architecture is designed to mitigate the risk of hallucination by separating complex reasoning from the final judgment. Extensive experiments on four benchmark datasets demonstrate that our framework achieves state-of-the-art performance, with average improvements of 6.75% in Accuracy and 6.29% in Macro-F1 score.

SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Sarcasm Detection

Automated essay scoring (AES) is a challenging task in cross-prompt settings due to the diversity of scoring criteria. While previous studies have focused on the output of large language models (LLMs) to improve scoring accuracy, we believe activations from intermediate layers may also provide valuable information. To explore this possibility, we evaluated the discriminative power of LLMs’ activations in cross-prompt essay scoring task. Specifically, we used activation to fit probes and further analyzed the effects of different models and input content of LLMs on this discriminative power. By computing the directions of essays across various trait dimensions under different prompts, we analyzed the variation in evaluation perspectives of large language models concerning essay types and traits. Results show that the activations possess strong discriminative power in evaluating essay quality and that LLMs can adapt their evaluation perspectives to different traits and essay types, effectively handling the diversity of scoring criteria in cross-prompt settings.

Activations as Features: Probing LLMs for Generalizable Essay Scoring Representations

3D hand pose estimation that involves accurate estimation of 3D human hand keypoint locations is crucial for many human-computer interaction applications such as augmented reality. However, this task poses significant challenges due to self-occlusion of the hands and occlusions caused by interactions with objects. In this paper, we propose HandMCM to address these challenges. Our HandMCM is a novel method based on the powerful state space model (Mamba). By incorporating modules for local information injection/filtering and correspondence modeling, the proposed correspondence Mamba effectively learns the highly dynamic kinematic topology of keypoints across various occlusion scenarios. Moreover, by integrating multi-modal image features, we enhance the robustness and representational capacity of the input, leading to more accurate hand pose estimation. Empirical evaluations on three benchmark datasets demonstrate that our model significantly outperforms current state-of-the-art methods, particularly in challenging scenarios involving severe occlusions. These results highlight the potential of our approach to advance the accuracy and reliability of 3D hand pose estimation in practical applications. Our source code will be made open source upon paper acceptance.

HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation

Topological Data Analysis (TDA) provides artificial intelligence (AI) systems with mathematically rigorous geometric descriptors through Persistent Homology (PH), capturing essential shape characteristics in high-dimensional data. Yet, PH’s combinatorial complexity and sensitivity to outliers hinder its scalability and reliability, especially for Intrinsic PH (IPH) that relies on accurate geodesic distances. While stateof-the-art landmark-based subsampling methods, PH Landmarks, ameliorate computational costs and improve outlier robustness by selecting representative points based on local PH scores, it remain computationally intensive and at low sampling rates struggle to reconstruct the global topology. In this work, we introduce TOPOGRAPH, a simple yet powerful framework that preserves intrinsic topology. The resulting coarsened graph supports efficient IPH computations using Fermat distances. Experiments on both synthetic and realworld datasets show that TOPOGRAPH outperforms stateof-the-art sampling-based methods by achieving an order-ofmagnitude speedup and substantially improved topological fidelity in persistence diagrams, demonstrating its ability for robust and scalable topological data analysis.

TOPOGRAPH: Topology-Preserving Graph Reduction with Adaptive Structure for Persistent Homology

Despite the rapid progress in large language models (LLMs), even sub-billion-scale systems perform at chance level on challenging natural language inference (NLI) benchmarks such as Adversarial Natural Language Inference (ANLI), while training larger models is often impractical due to limited computational resources. We address this parameter-efficiency bottleneck in NLI with a Complex-Vector Token Representation that explicitly decouples each token from its context, and a Token-Context Attention mechanism that updates each token based on the most informative contextual semantics. On ANLI, a 0.8B-parameter Token-Context Attention model achieves higher parameter efficiency (accuracy per parameter) than all 1B and comparable 0.8B self-attention baselines; it also suffers smaller performance degradation under FGSM/PGD attacks and exhibits better transfer performance to SNLI in zero- and few-shot learning. These results suggest that explicitly disentangling token and context offers a viable alternative to standard self-attention for NLI tasks.

Token-Context Attention for NLI: An Alternative to Self-Attention

Traditional Intrusion Detection Systems (IDS) are typically trained in specific network environments, and their performance often degrades significantly when deployed in new environments with different attack categories. To address this challenge, we propose and define the task of cross-dataset intrusion detection and design a novel multimodal contrastive learning framework named TriFusion-IDS. This framework represents network traffic from three complementary dimensions: a graph view to capture structural communication patterns, a tabular view to model statistical features, and a textual view to define the semantics of attacks. TriFusion-IDS fuses the graph and tabular representations and aligns them with textual descriptions in a shared embedding space using a CLIP-style contrastive loss function. This semantics-based alignment mechanism enables the model to overcome the effects of zero-shot categories and thus generalize to new network environments. Our extensive experiments on several mainstream datasets demonstrate that this method significantly outperforms existing baselines in cross-dataset intrusion detection scenarios.

TriFusion-IDS: A Multimodal Graph-Tabular-Text Contrastive Framework for Cross-Dataset Intrusion Detection

Conversational search aims to satisfy users’ complex information needs via multiple-turn interactions. The key challenge lies in revealing real users’ search intent from the context-dependent queries. Previous studies achieve conversational search by fine-tuning a conversational dense retriever with relevance judgments between pairs of context-dependent queries and documents. However, this training paradigm encounters data scarcity issues. To this end, we propose ConvMix, a mixed-criteria framework to augment conversational dense retrieval, which covers more aspects than existing data augmentation frameworks. We design a two-sided relevance judgment augmentation schema in a scalable manner via the aid of large language models. Besides, we integrate the framework with quality control mechanisms to obtain semantically diverse samples and near-distribution supervisions to combine various annotated data. Experimental results on five widely used benchmarks show that the conversational dense retriever trained by our ConvMix framework outperforms previous baseline methods, which demonstrates our superior effectiveness.

Downloads

Next from AAAI 2026

DeFuzzRAG: Handling Fuzzy Time Expressions for Temporal Robustness in Retrieval-Augmented Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

DeFuzzRAG: Handling Fuzzy Time Expressions for Temporal Robustness in Retrieval-Augmented Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads