Singapore

The rapid advancement of large language models (LLMs) has revitalised research in Emotion Recognition in Conversation (ERC). However, existing LLM-based ERC approaches operate solely on textual input, whereas MLLM-based emotion recognition methods in non-conversational scenarios typically perform only basic multimodal fusion and fail to consider speaker-sensitive contextual dependencies, which limits their performance on ERC tasks. To integrate multimodal cues effectively and address their limitations in handling contextual dependencies, we propose a novel LLM-based framework, Causal-ERC, which captures context representations within each modality and incorporates them into the LLM. Moreover, experimental results show that LLMs perform poorly on long conversations. To improve LLMs&#39; ability to model long conversations, we adjust corresponding causal prompts according to the causal type of each utterance. Experiments on two benchmark MERC datasets demonstrate that our Causal-ERC framework consistently outperforms existing state-of-the-art approaches and improves LLM&#39;s performance in long-context scenarios.

AAAI 2026

Causal-ERC: A Multimodal Framework with Causal Prompting for Emotion Recognition in Conversations with Large Language Models

causal detection

context modelling

conversational emotion recognition

The rapid advancement of large language models (LLMs) has revitalised research in Emotion Recognition in Conversation (ERC). However, existing LLM-based ERC approaches operate solely on textual input, whereas MLLM-based emotion recognition methods in non-conversational scenarios typically perform only basic multimodal fusion and fail to consider speaker-sensitive contextual dependencies, which limits their performance on ERC tasks. To integrate multimodal cues effectively and address their limitations in handling contextual dependencies, we propose a novel LLM-based framework, Causal-ERC, which captures context representations within each modality and incorporates them into the LLM. Moreover, experimental results show that LLMs perform poorly on long conversations. To improve LLMs' ability to model long conversations, we adjust corresponding causal prompts according to the causal type of each utterance. Experiments on two benchmark MERC datasets demonstrate that our Causal-ERC framework consistently outperforms existing state-of-the-art approaches and improves LLM's performance in long-context scenarios.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

RGB and infrared images has shown remarkable robustness for object detection based on unmanned aerial vehicles (UAV). However, the primitive RGB and infrared (IR) images are inevitably misaligned due to the device gap between RGB and infrared cameras. Most existing methods rely on manually filtered and aligned images, and thus are limited in real-world application. Some recent methods tend to directly learn from misaligned images, which only weakly benefit from the multi-modality and may be misled by dramatically misaligned IR images. Considering that the manually aligned images are available during training while unavailable in inference, we explore a new learning paradigm using the IR modality as privileged information. In the training stage, our model learns to hallucinate the complementary knowledge in IR modality based on RGB modality. In inference, our model could hallucinate the complementary IR modality to facilitate UAV detection. Specifically, we propose to quantize the IR features and hallucinate the codebook-indices based on RGB features, which is more effective and robust than directly hallucinating features. In addition, we propose to hierarchically hallucinate multi-scale codebook-indices, which could further improve the hallucinating quality. Experiments on DroneVehicle and VisDrone datasets demonstrate the effectiveness of our method.

Infrared-Privileged UAV Detection via Cross-Modal Vector-Quantization

Vision-Language Models (VLMs) extend Large Language Models (LLMs) with visual perception capabilities, unlocking broad applications across many domains. However, ensuring their safety remains a critical challenge, as adversarial visual inputs can easily bypass built-in safeguards and elicit harmful content. In this paper, we uncover a phenomenon we call delayed safety awareness, where a jailbroken VLM initially produces harmful content but ultimately recognizes the harmfulness at the end of the generation process. We attribute this phenomenon to the fact that the model's safety awareness against jailbreaks cannot be effectively transferred to the intermediate stages of text generation. Motivated by this insight, we introduce SafetyReminder, a simple yet effective defense that optimizes a learnable soft prompt using our proposed safety-activation prompt tuning. This soft prompt is inserted into the generated text to activate the safety awareness of the model, steering it toward refusal when harmful content arises while preserving helpfulness in benign scenarios. We evaluate our method on three established harmful benchmarks and across three types of adversarial attacks. Experimental results demonstrate that our method achieves state-of-the-art defense performance with strong generalization, offering a practical and lightweight solution for safer deployment of VLMs.

SafetyReminder: Reviving Delayed Safety Awareness of Vision-Language Models to Defend Against Jailbreak Attacks

Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models genuinely integrate audio-visual cues to segment sounding objects? Our study reveals a fundamental bias in current methods: they tend to generate segmentation masks based predominantly on visual salience, irrespective of the audio context, resulting in unreliable predictions when sounds are absent or irrelevant. To address this challenge, we introduce AVSBench-Robust, a comprehensive benchmark incorporating diverse negative audio scenarios, including silence, noise, and off-screen sounds. We also propose a simple yet effective approach combining balanced training with negative samples and classifier-guided similarity learning. Our extensive experiments show that while state-of-the-art AVS methods consistently fail under negative audio conditions, our approach achieves remarkable improvements in both standard metrics and robustness measures, maintaining near-perfect false positive rates while preserving high-quality segmentation performance.

Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

This paper studies the problem of dimension reduction, tailored to improving time series forecasting with high-dimensional predictors. We propose a novel Supervised Deep Dynamic Principal component analysis (SDDP) framework that incorporates the target variable and lagged observations into the factor extraction process. Assisted by a temporal neural network, we construct target-aware predictors by scaling the original predictors in a supervised manner, with larger weights assigned to predictors with stronger forecasting power.
A principal component analysis is then performed on the target-aware predictors to extract the estimated SDDP factors. This supervised factor extraction not only improves predictive accuracy in the downstream forecasting task but also yields more interpretable and target-specific latent factors. Building upon SDDP, we propose a factor-augmented nonlinear dynamic forecasting model that unifies a broad family of factor-model-based forecasting approaches. To further demonstrate the broader applicability of SDDP, we extend our studies to a more challenging scenario when the predictors are only partially observable. We validate the empirical performance of the proposed method on several real-world public datasets. The results show that our algorithm achieves notable improvements in forecasting accuracy compared to state-of-the-art methods.

Supervised Dynamic Dimension Reduction with Deep Neural Network

Training models on synthetic data has emerged as an increasingly important strategy for improving the performance of generative AI. This approach is particularly helpful for large multimodal models (LMMs) due to the relative scarcity of high-quality paired image-text data compared to language-only data. While a variety of methods have been proposed for generating large multimodal datasets, they do not tailor the synthetic data to address specific deficiencies in the reasoning abilities of LMMs which will be trained with the generated dataset. In contrast, humans often learn in a more efficient manner by seeking out examples related to the types of reasoning where they have failed previously. Inspired by this observation, we propose a new approach for synthetic data generation which is grounded in the analysis of an existing LMM's reasoning failures. Our methodology leverages frontier models to automatically analyze errors produced by a weaker LMM and propose new examples which can be used to correct the reasoning failure via additional training, which are then further filtered to ensure high quality. We generate a large multimodal instruction tuning dataset containing over 553k examples using our approach and conduct extensive experiments demonstrating its utility for improving the performance of LMMs on multiple downstream tasks. Our results show that models trained on our synthetic data can even exceed the performance of LMMs trained on an equivalent amount of additional real data, demonstrating the high value of generating synthetic data targeted to specific reasoning failure modes in LMMs. We will make our dataset and code publicly available.

Learning from Reasoning Failures via Synthetic Data Generation

In cross-modal retrieval tasks, unsupervised hash code learning still faces key challenges, including the difficulty of modeling shared semantic structures across modalities and the inability to adaptively balance multiple supervision objectives during optimization. To address these issues, we propose a novel Unsupervised Dynamic Weighted Cluster-Cooperative Hashing (UDCH) framework, which jointly models feature-level alignment and cluster-level semantic structure to guide consistency learning across modalities under label-free conditions. Specifically, we design an instance-level contrastive loss in the feature branch to align the embedding spaces of images and texts, while employing K-Means clustering to generate pseudo-labels and construct a cluster-center contrast mechanism that captures semantic grouping information. Furthermore, we integrate cross-modal feature similarity to construct a high-order structure matrix, enabling fine-grained structural supervision. To enhance the synergy of multi-objective optimization, we introduce a dynamic weighting strategy that adaptively adjusts the contributions of the feature and cluster branches based on the degree of modal alignment and semantic compactness. Extensive experiments on multiple cross-modal retrieval benchmarks demonstrate that UDCH achieves superior semantic alignment and retrieval performance under unsupervised settings, validating the effectiveness of multi-level semantic modeling and adaptive collaboration mechanisms in unsupervised hashing tasks.

UDCH: Unsupervised Dynamic Weighted Cluster-cooperative Hashing for Cross-modal Retreival

We focus on a binary classification problem in an edge intelligence system where false negatives are more costly than false positives. The system has a compact, locally deployed model, which is supplemented by a larger, remote model, which is accessible via the network by incurring an offloading cost. For each sample, our system first uses the locally deployed model for inference. Based on the output of the local model, the sample may be offloaded to the remote model. This work aims to understand the fundamental trade-off between classification accuracy and these offloading costs within such a hierarchical inference (HI) system. To optimize this system, we propose an online learning framework that continuously adapts a pair of thresholds on the local model's confidence scores. These thresholds determine the prediction of the local model and whether a sample is classified locally or offloaded to the remote model. We present a closed-form solution for the setting where the local model is calibrated. For the more general case of uncalibrated models, we introduce H2T2, an online two-threshold hierarchical inference policy, and prove it achieves sublinear regret. H2T2 is model-agnostic, requires no training, and learns in the inference phase using limited feedback. Simulations on real-world datasets show that H2T2 consistently outperforms naive and single-threshold HI policies, sometimes even surpassing offline optima. The policy also demonstrates robustness to distribution shifts and adapts effectively to mismatched classifiers.

Inference Offloading for Cost-Sensitive Binary Classification at the Edge

Best-of-$N$ reasoning improves the accuracy of language models in solving mathematical tasks by sampling multiple candidate solutions and then selecting the best one based on some criteria. A critical bottleneck for this strategy is the output diversity limit, which occurs when the model generates similar outputs despite stochastic sampling, and hence recites the same error. To address this lack of variance in reasoning paths, we propose a novel unsupervised activation steering strategy that simultaneously optimizes the steering vectors for multiple reasoning trajectories at test time. At any synchronization anchor along the batch generation process, we find the steering vectors that maximize the total volume spanned by all possible intervened activation subsets. We demonstrate that these steering vectors can be determined by solving a Riemannian optimization problem over the product of spherical manifolds with a log-determinant objective function. We then use a Riemannian block-coordinate descent algorithm with a well-tuned learning rate to obtain a stationary point of the problem, and we apply these steering vectors until the generation process reaches the subsequent synchronization anchor. Empirical evaluations on popular mathematical benchmarks demonstrate that our test-time Riemannian activation steering strategy outperforms vanilla sampling techniques in terms of generative diversity and solution accuracy.

Test-time Diverse Reasoning by Riemannian Activation Steering

Learnable sparse retrieval (LSR) models encode texts into high-dimensional sparse representations, supporting token-level expansion beyond the original text and addressing the vocabulary mismatch problem in traditional bag-of-words retrieval.
However, in the absence of representation-level supervision, these representations usually overemphasize irrelevant tokens while neglecting truly relevant ones.
We term this phenomenon the Representation Hallucination problem in LSR models, a critical bottleneck impeding accurate retrieval.
To address this challenge, we introduce SiRe, a self-improving training framework for sparse retrieval that integrates two core strategies: Heuristic Representation Refinement and Representation-Focused Learning.
Specifically, SiRe first identifies and corrects representation hallucinations in the outputs of the current LSR model using heuristic methods.
The resulting representations serve as the primary supervision signals, guiding a pretrained language model (e.g., BERT) to mitigate the problem directly at the representation level.
This process can be iterated, enabling progressive model improvement.
Extensive experiments on both in-domain and out-domain benchmarks show that SiRe produces higher-quality sparse representations, significantly enhancing retrieval performance over strong baselines.

Self-Improving Sparse Retrieval Through Heuristic Representation Refinement and Representation-Focused Learning

While feature-based knowledge distillation has proven highly effective for compressing CNNs, these techniques unexpectedly fail when applied to Vision Transformers (ViTs), often performing worse than simple logit-based distillation. We provide the first comprehensive analysis of this phenomenon through a novel analytical framework termed as "distillation dynamics", combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking. Our investigation reveals that ViTs exhibit a distinctive U-shaped information processing pattern: initial compression followed by expansion. We identify the root cause of negative transfer in feature distillation: a fundamental representational paradigm mismatch between teacher and student models. Through frequency-domain analysis, we show that teacher models employ distributed, high-dimensional encoding strategies in later layers that smaller student models cannot replicate due to limited channel capacity. This mismatch causes late-layer feature alignment to actively harm student performance. Our findings reveal that successful knowledge transfer in ViTs requires moving beyond naive feature mimicry to methods that respect these fundamental representational constraints, providing essential theoretical guidance for designing effective ViTs compression strategies. All source code and experimental logs are provided in the supplementary material.

Downloads

Next from AAAI 2026

Infrared-Privileged UAV Detection via Cross-Modal Vector-Quantization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Infrared-Privileged UAV Detection via Cross-Modal Vector-Quantization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads