Singapore

The homogeneity and heterogeneity across modalities are critical factors that influence multimodal fusion. In Multimodal Sentiment Analysis (MSA), the inherent textual information within the audio modality induces cross-modality homogeneity with the text modality. Conversely, the mutual independence between text and vision modalities results in their cross-modal heterogeneity. Although existing disentangle-based methods achieve notable performance gains by separating modality features into distinct subspaces, they overlook the characteristics of cross-modality heterogeneity and homogeneity among different modalities. To this end, we propose a novel Modality-aware Disentangle and Fusion (MDF) framework to investigate the role of core modality features. Specifically, we first use text as the anchor to disentangle the audio modality and extract its unique modality-specific features, thereby establishing cross-modal heterogeneity among text, audio, and vision. We then introduce a Cross-Modality Heterogeneity Enhancement (CHE) module to refine these features, further reinforcing their heterogeneous characteristics. Finally, a Modality Adaptive Weighting (MAW) module is employed to dynamically assign weights to the text, sound, and vision modalities based on their potential contributions to sentiment prediction, achieving a more effective multimodal representation for MSA. Experimental evaluations on different benchmarks demonstrate MDF&#39;s superiority, with extensive ablation studies confirming its effectiveness.

AAAI 2026

MDF: A Modality-Aware Disentanglement and Fusion Framework for Multimodal Sentiment Analysis

nlp: sentiment analysis

nlp: text classification & sentiment analysis

and argument mining

stylistic analysis

ml: multimodal learning

The homogeneity and heterogeneity across modalities are critical factors that influence multimodal fusion. In Multimodal Sentiment Analysis (MSA), the inherent textual information within the audio modality induces cross-modality homogeneity with the text modality. Conversely, the mutual independence between text and vision modalities results in their cross-modal heterogeneity. Although existing disentangle-based methods achieve notable performance gains by separating modality features into distinct subspaces, they overlook the characteristics of cross-modality heterogeneity and homogeneity among different modalities. To this end, we propose a novel Modality-aware Disentangle and Fusion (MDF) framework to investigate the role of core modality features. Specifically, we first use text as the anchor to disentangle the audio modality and extract its unique modality-specific features, thereby establishing cross-modal heterogeneity among text, audio, and vision. We then introduce a Cross-Modality Heterogeneity Enhancement (CHE) module to refine these features, further reinforcing their heterogeneous characteristics. Finally, a Modality Adaptive Weighting (MAW) module is employed to dynamically assign weights to the text, sound, and vision modalities based on their potential contributions to sentiment prediction, achieving a more effective multimodal representation for MSA. Experimental evaluations on different benchmarks demonstrate MDF's superiority, with extensive ablation studies confirming its effectiveness.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Knowledge Tracing (KT) diagnoses students’ concept mas- tery through continuous learning state monitoring in education. Existing methods primarily focus on studying behavioral sequences based on ID or textual information. While existing methods rely on ID-based sequences or shallow textual features, they often fail to capture (1) the hierarchical evolution of cognitive states and (2) individualized prob- lem difficulty perception due to limited semantic modeling. Therefore, this paper proposes a Large Language Model Hyperbolic Aligned Knowledge Tracing(L-HAKT). First, the teacher agent deeply parses question semantics and explicitly constructs hierarchical dependencies of knowledge points; the student agent simulates learning behaviors to generate synthetic data. Then, contrastive learning is performed between synthetic and real data in hyperbolic space to reduce distribution differences in key features such as question difficulty and forgetting patterns. Finally, by optimizing hyperbolic curvature, we explicitly model the tree-like hierarchical structure of knowledge points, precisely characterizing differences in learning curve morphology for knowledge points at different levels. Extensive experiments on four real-world educational datasets validate the effectiveness of our Large Language Model Hyperbolic Aligned Knowledge Tracing (L-HAKT) framework.

Towards LLM-Empowered Knowledge Tracing via LLM-Student Hierarchical Behavior Alignment in Hyperbolic Space

Robust multi-label learning under noisy supervision remains a persistent challenge, where corrupted, incomplete, or ambiguous labels undermine the reliability of semantic learning. Existing approaches often address label noise through heuristic correction or local consistency constraints, but lack a unified mechanism to validate and refine supervision across structural and semantic levels. Inspired by cognitive theories of human memory, we propose CogniTrust, a novel framework that unifies verifiable supervision with a triadic memory model: episodic memory, semantic memory, and reconstructive memory. Episodic memory factorizes feature activations into spatially disentangled patterns to assess structural support and assign interpretable trust scores to labels. Building on this, semantic memory maintains class-level prototypes from structurally attentive regions to estimate semantic plausibility via prototype alignment. Moreover, reconstructive memory simulates generative supervision by interpolating between images through a diffusion-based mixup process, enriching training signals for ambiguous or borderline cases. Together, these components form a closed-loop mechanism that validates, calibrates, and synthesizes supervision from both spatial and semantic perspectives. Extensive experiments on noisy hashing benchmarks demonstrate that CogniTrust consistently outperforms state-of-the-art baselines and provides interpretable justifications for label decisions. This work establishes a cognitively grounded paradigm for denoising through structurally verifiable supervision.

CogniTrust: Cognitive Memory-Driven Verifiable Supervision for Robust Hashing

Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude "hard" feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.

ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

Self-supervised depth estimation has gained significant attention in autonomous driving and robotics. However, existing methods exhibit substantial performance degradation under adverse weather conditions such as rain and fog, where reduced visibility critically impairs depth prediction. To address this issue, we propose a novel self-evolution contrastive learning framework called SEC-Depth for self-supervised robust depth estimation tasks. Our approach leverages intermediate parameters generated during training to construct temporally evolving latency models. Using these, we design a self-evolution contrastive scheme to mitigate performance loss under challenging conditions. Concretely, we first design a dynamic update strategy of latency models for the depth estimation task to capture optimization states across training stages. To effectively leverage latency models, we introduce a self-evolution contrastive Loss (SECL) that treats outputs from historical latency models as negative samples. This mechanism adaptively adjusts learning objectives while implicitly sensing weather degradation severity, reducing the needs for manual intervention. Experiments show that our method integrates seamlessly into diverse baseline models and significantly enhances robustness in zero-shot evaluations.

Learning Depth from Past Selves: Self-Evolution Contrast for Robust Depth Estimation

The multi-lead electrocardiogram (ECG) stands as a cornerstone of cardiac diagnosis. Recent strides in electrocardiogram self-supervised learning (eSSL) have brightened prospects for enhancing representation learning without relying on high-quality annotations. Yet earlier eSSL methods suffer a key limitation: they focus on consistent patterns across leads and beats, overlooking the inherent differences in heartbeats rooted in cardiac conduction processes, while subtle but significant variations carry unique physiological signatures. Moreover, representation learning for ECG analysis should align with ECG diagnostic guidelines, which progress from individual heartbeats to single leads and ultimately to lead combinations. This sequential logic, however, is often neglected when applying pre-trained models to downstream tasks. To address these gaps, we propose CLEAR-HUG, a two-stage framework designed to capture subtle variations in cardiac conduction across leads while adhering to ECG diagnostic guidelines. In the first stage, we introduce an eSSL model termed **C**onduction-**LEA**d **R**econstructor (CLEAR), which captures both specific variations and general commonalities across heartbeats. Treating each heartbeat as a distinct entity, CLEAR employs a simple yet effective sparse attention mechanism to reconstruct signals without interference from other heartbeats. In the second stage, we implement a **H**ierarchical lead-**U**nified **G**roup head (HUG) for disease diagnosis, mirroring clinical workflow. Experimental results across six tasks show a 6.84\% improvement, validating the effectiveness of CLEAR-HUG. This highlights its ability to enhance representations of cardiac conduction and align patterns with expert diagnostic guidelines.

Tracing the Heart’s Pathways: ECG Representation Learning from a Cardiac Conduction Perspective

Reinforcement learning (RL) faces significant challenges in real-world deployments due to the sim-to-real gap, where policies trained in simulators often underperform in practice due to mismatches between training and deployment conditions. Distributionally robust RL addresses this issue by optimizing worst-case performance over an uncertainty set of environments and providing an optimized lower bound on deployment performance. However, existing studies typically assume access to either a generative model or offline datasets with broad coverage of the deployment environment—assumptions that limit their practicality in unknown environments without prior knowledge. In this work, we study the more realistic and challenging setting of online distributionally robust RL, where the agent interacts only with a single unknown training environment while aiming to optimize its worst-case performance. We focus on general $f$-divergence-based uncertainty sets, including $\chi^2$ and KL divergence balls, and propose a computationally efficient algorithm with sublinear regret guarantees under minimal assumptions. Furthermore, we establish a minimax lower bound on regret of online learning, demonstrating the near-optimality of our approach. Extensive experiments across diverse environments further confirm the robustness and efficiency of our algorithm, validating our theoretical findings.

ORVIT: Near-Optimal Online Distributionally Robust Reinforcement Learning

In recent years, CLIP has been widely applied to weakly supervised semantic segmentation tasks due to its powerful cross-modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches: over-activation in non-target foreground regions and background areas. Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter-class overlap while enhancing semantic correlations, to rectify over-activation in non-target foreground regions effectively; at the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely filter out interference from non-target regions during affinity propagation, significantly rectifying background over-activation. Extensive experiments on the PASCAL VOC and MS COCO datasets demonstrate that our method outperforms all single-stage approaches, as well as more complex multi-stage approaches, achieving mIoU scores of 79.5% and 50.6%, respectively. Our source code will be publicly available at http://github.com/***.

SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation

Cold-start item recommendation is a significant challenge in recommendation systems, particularly when new items are introduced without any historical interaction data. While existing methods leverage multi-modal content to alleviate the cold-start issue, they often neglect the inherent multi-view structure of modalities, namely the distinction between shared and modality-specific features. In this paper, we propose Multi-Modal Multi-View Variational AutoEncoder (M²VAE), a generative model that addresses the challenges of modeling common and unique views in attribute and multi-modal features, as well as user preferences over single-typed item features. Specifically, we generate type-specific latent variables for item IDs, categorical attributes, and image features, and use Product-of-Experts (PoE) to derive a common representation. A disentangled contrastive loss decouples the common view from unique views while preserving feature informativeness. To model user inclinations, we employ a preference-guided Mixture-of-Experts (MoE) to adaptively fuse representations. We further incorporate co-occurrence signals via contrastive learning, eliminating the need for pretraining. Extensive experiments on real-world datasets validate the effectiveness of our approach.

M²VAE: Multi-Modal Multi-View Variational Autoencoder for Cold-start Item Recommendation

CircRNA-miRNA interaction (CMI) plays a pivotal role in disease therapeutics and drug discovery. However, existing methods face several challenges in modeling complex biological networks and zero-shot learning scenarios. Biological networks encapsulate rich biological information, yet current approaches often fail to fully exploit this depth. Moreover, zero-shot prediction requires models to identify new interactions without relying on previously observed samples, imposing stringent requirements on generalization capabilities. To address these limitations, we propose a dual-channel learning framework leveraging State space modeling for Zero-shot CMI prediction (ZeroStem). ZeroStem first enhances the biological relevance of node using prior knowledge, and employs a graph Transformer to extract macro-topological representations. Subsequently, it generates semantic subgraphs based on meta-paths to focus on specific biological relationships, utilizing the Mamba to extract micro-semantic representations via state space modeling. Finally, macro-topological and micro-semantic representations are seamlessly integrated through linear transformation and residual connections, enabling high-precision zero-shot CMI prediction. Extensive experiments on multiple benchmark datasets demonstrate that ZeroStem significantly outperforms existing methods, validating its efficiency and robust generalization in CMI prediction. Case studies further illustrate that ZeroStem offers novel insights into the molecular mechanisms underlying intricate disease-associated networks.

Dual-Channel Learning Framework for Zero-Shot CircRNA-miRNA Interaction Prediction via State Space Modeling

Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a significant challenge in resolution generalization, particularly in the widely used Diffusion Transformers, lies in the mismatch between the positional encodings encountered during testing and those used during training. While existing methods have employed techniques such as interpolation, extrapolation, or their combinations, none have fully resolved this issue. In this paper, we propose a novel two-dimensional randomized positional encodings (RPE-2D) framework that focuses on learning positional order of image patches instead of the specific distances between them, enabling seamless high- and low-resolution image generation without requiring high- and low-resolution image training. Specifically, RPE-2D independently selects positions over a broader range along both the horizontal and vertical axes, ensuring that all position encodings are trained during the inference phase, thus improving resolution generalization. Additionally, we propose a random data augmentation technique to enhance the modeling of position order. To address the issue of image cropping caused by the augmentation, we introduce corresponding micro-conditioning to enable the model to perceive the specific cropping patterns. On the ImageNet dataset, our proposed RPE-2D achieves state-of-the-art resolution generalization performance, outperforming existing competitive methods when trained at a resolution of $256 \times 256$ and inferred at $384 \times 384$ and $512 \times 512$, as well as when scaling from $512 \times 512$ to $768 \times 768$ and $1024 \times 1024$.
And it also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration and multi-resolution inheritance.

Downloads

Next from AAAI 2026

Towards LLM-Empowered Knowledge Tracing via LLM-Student Hierarchical Behavior Alignment in Hyperbolic Space

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Towards LLM-Empowered Knowledge Tracing via LLM-Student Hierarchical Behavior Alignment in Hyperbolic Space

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads