Singapore

Recent self-supervised pre-training methods for object detection often rely on generic object proposals for localization and semantic feature learning for classification, but they yield limited improvements when applied to Detection Transformers (DETR) due to a lack of architectural alignment. Hence, we propose an elegant and versatile self-supervised framework tailored for DETR-like models called **Dis**tance-aware Multi-view **Co**ntrastive Learning (**DisCo DETR**). **DisCo DETR** enhances localization and semantic features through two core components. (i) **Distance-aware Multi-view Object Query Fusion** explicitly guides object queries to focus on spatially close objects across views, stabilizing training and improving localization accuracy. (ii) **Contrastive Learning for DETR** uses native bipartite matching to identify positive output pairs across views and pull them closer, enhancing semantic features discrimination with no extra matching. DisCo DETR can be seamlessly integrated into DETR-like models and achieves SOTA transfer performance on PASCAL VOC and COCO benchmarks across multiple variants.

AAAI 2026

DisCo DETR: Distance-aware Multi-view Contrastive Learning for DETR Pre-training

object detection

contrastive learning

self-supervised learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks through the \textbf{example-driven learning paradigm}. However, in high-stakes domains such as emergency response or industrial safety, real incidents are scarce, confidential, or both, while concise \emph{rule books} are plentiful.
We formalize this underexplored setting as \textbf{rule knowledge–driven reasoning} and ask: \emph{Can an LLM reason reliably when rules are abundant but examples are almost nil?}
To answer this question we introduce \textbf{RULER}, a fully automatic benchmark that derives 32K rigorously verified questions from 1K expert-curated emergency response rule knowledge to probe three core abilities—\emph{rule memorization}, \emph{single-rule application}, and \emph{multi-rule complex reasoning}, supported by a hallucination-aware evaluation suite and novel relational metrics.
A comprehensive empirical study of five open-source LLMs and five enhancement strategies shows that, after reliable performance on rule memorization and single-rule application, multi-rule complex reasoning plateaus at 5.4 on a 10-point scale.
We bridge this gap with \textbf{RAMPS}—a \textbf{R}ule-knowledge-\textbf{A}ware \textbf{M}onte-Carlo-tree-search \textbf{P}rocess-reward \textbf{S}upervision framework.
RAMPS injects rule knowledge priors into MCTS, distills 12K step-level traces without human annotation, and trains an advantage-based reward model that scores candidate reasoning paths during the beam search inference.
Experimental results demonstrate a notable improvement in complex reasoning, increasing to 7.7 (+2.3).
Together, RULER and RAMPS provide an automatic benchmark and a strong baseline suite for rule knowledge-driven reasoning in LLMs.

Benchmarking and Enhancing Rule Knowledge-Driven Reasoning of Large Language Models

Though promising in healthcare consultation applications, large language models (LLMs) face critical limitations in retaining and utilizing long-term memory across multi-turn interactions. In particular, existing memory enhancing paradigms are constrained by limited context windows and embedding-based retrieval, often failing to maintain task relevance and still suffering from memory prototype collapse in multi-turn healthcare consultation. To address these challenges, we propose a cognitively-inspired memory framework named MemoryART, which is grounded in Adaptive Resonance Theory (ART)—a cognitive and learning theory of how humans and animals adapt to dynamic environments. MemoryART employs three memory modules—working memory, episodic memory, and semantic memory to support task-aware memory organization and dynamic retrieval. Specifically, episodic memory provides the storage of specific experiences along with contextual clues, which is crucial for managing patient-specific information and perfect for multi-turn healthcare consultation interactions. Building upon this concept, MemoryART leverages multi-channel competitive learning and resonance matching to enable efficient and interpretable episodic memory encoding, alleviating issues of prototype collapse and noisy memory associations. For evaluation, we construct a long-term medical dialogue benchmark called MediLongChat using a LLM-based generation pipeline. The resulting dataset features realistic, multi-disease chat histories, each exceeding 100K tokens across 20–30 dialogues, simulating real-world healthcare interaction patterns. Our experimental results show that MemoryART outperforms mainstream approaches in memory-intensive tasks, achieving SOTA results and significantly reducing token consumption across five popular LLMs, confirming its effectiveness and efficiency in providing scalable, reliable memory for LLMs in healthcare. Code and datasets are available at \url{https://github.com/dairkkriad/MemoryART}

MemoryART: Enhancing LLMs via Multi-Memory Models with Adaptive Resonance Theory for Healthcare Agents

Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. 
FOEM approximates gradients by performing a first-order Taylor expansion around the pre-quantization weights. This yields an approximation based on the difference between latent and full-precision weights as well as the Hessian matrix. When substituted into the theoretical solution, the formulation eliminates the need to explicitly compute the Hessian, thereby avoiding the high computational cost and limited generalization of backpropagation-based gradient methods. This design introduces only minimal additional computational overhead.
Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 17.3\% and increases the 5-shot MMLU accuracy from 53.8\% achieved by GPTAQ to 56.1\%. Moreover, FOEM can be seamlessly combined with advanced techniques such as SpinQuant, delivering additional gains under the challenging W4A4KV4 setting and further narrowing the performance gap with full-precision baselines, surpassing existing state-of-the-art methods.

First-Order Error Matters: Accurate Compensation for Quantized Large Language Models

Generalized Category Discovery (GCD) aims to classify labeled instances from known categories while discovering novel categories from unlabeled data. Despite recent progress in GCD for computer vision, existing GCD approaches largely rely on static final-step representations (in the visual domain), overlooking the temporally evolving nature of time-series data. In this paper, we introduce TGCD, the first framework specifically designed for GCD in time-series data. TGCD leverages both the dynamics of latent representations and the heterogeneity of predictions across multiple temporal segments to disover unknown (i.e., novel) categories, based on a pre-trained time-series foundation model. We propose a unified learning objective for TGCD that integrates the following three components: (i) a Stochastic Temporal Segment Dropout (STeSD) objective that regularizes the model by selectively penalizing high-entropy segments to encourage confident predictions on uncertain regions of the time-series, and (ii) a Known–Unknown Temporal Discriminability (KUTD) objective that promotes representational separation between known and unknown categories within unlabeled data and (iii) a margin-aware classification objective to improve generalization. Empirical evaluation on six multivariate time-series data sets demonstrates that the TGCD substantially outperforms existing GCD methods, particularly in discovering unknown categories. We further conduct ablation studies to highlight the individual contributions of each component. Additionally, we provide the first comprehensive benchmarking of recent GCD approaches on time-series data, revealing the limitations of naive transfer and underscoring the benefits of temporal modeling.

TGCD: A Framework for Generalized Category Discovery in Time-Series Data

With the rapid development of large language models (LLMs), machine-generated texts have approached human writing quality, leading to four main text categories: purely machine-generated, machine-rewritten, machine-polished, and human-written content. Traditional detection methods face significant challenges in human-machine hybrid scenarios where LLMs perform rewriting or polishing, as existing approaches focus on single-level features and fail to capture subtle, multi-layered machine traces. To address this limitation, we propose a Multi-level Style Preference Optimization (MSPO) framework that captures machine-generated style features across multiple granularities: sequence-level optimization evaluates overall text style consistency, phrase-level detection identifies distinctive n-gram patterns, and lexical-level modeling captures word selection differences through probability distribution analysis. We further incorporate four text complexity indicators (Type-Token Ratio, Average Sentence Length, Average Word Length, and Punctuation Ratio) to dynamically adjust optimization parameters based on human-machine text complexity differences, enhancing adaptability across diverse text types. Additionally, we construct a comprehensive detection dataset spanning three representative domains (scientific writing, news, and creative writing) across four text types (human-written, purely machine-generated, machine-rewritten, and machine-polished), generated using state-of-the-art LLMs for robust evaluation. Experimental results demonstrate that MSPO significantly outperforms existing methods across generated, rewritten, and polished text detection tasks, with the most notable improvement of 0.156 AUROC points over baseline ImBD on challenging polished texts, while maintaining robust cross-domain generalizability.

Multi-level Style Preference Optimization: An Adaptive Detection Framework for Human-Machine Hybrid Text

The rapid advancement of large language models (LLMs) has revitalised research in Emotion Recognition in Conversation (ERC). However, existing LLM-based ERC approaches operate solely on textual input, whereas MLLM-based emotion recognition methods in non-conversational scenarios typically perform only basic multimodal fusion and fail to consider speaker-sensitive contextual dependencies, which limits their performance on ERC tasks. To integrate multimodal cues effectively and address their limitations in handling contextual dependencies, we propose a novel LLM-based framework, Causal-ERC, which captures context representations within each modality and incorporates them into the LLM. Moreover, experimental results show that LLMs perform poorly on long conversations. To improve LLMs' ability to model long conversations, we adjust corresponding causal prompts according to the causal type of each utterance. Experiments on two benchmark MERC datasets demonstrate that our Causal-ERC framework consistently outperforms existing state-of-the-art approaches and improves LLM's performance in long-context scenarios.

Causal-ERC: A Multimodal Framework with Causal Prompting for Emotion Recognition in Conversations with Large Language Models

RGB and infrared images has shown remarkable robustness for object detection based on unmanned aerial vehicles (UAV). However, the primitive RGB and infrared (IR) images are inevitably misaligned due to the device gap between RGB and infrared cameras. Most existing methods rely on manually filtered and aligned images, and thus are limited in real-world application. Some recent methods tend to directly learn from misaligned images, which only weakly benefit from the multi-modality and may be misled by dramatically misaligned IR images. Considering that the manually aligned images are available during training while unavailable in inference, we explore a new learning paradigm using the IR modality as privileged information. In the training stage, our model learns to hallucinate the complementary knowledge in IR modality based on RGB modality. In inference, our model could hallucinate the complementary IR modality to facilitate UAV detection. Specifically, we propose to quantize the IR features and hallucinate the codebook-indices based on RGB features, which is more effective and robust than directly hallucinating features. In addition, we propose to hierarchically hallucinate multi-scale codebook-indices, which could further improve the hallucinating quality. Experiments on DroneVehicle and VisDrone datasets demonstrate the effectiveness of our method.

Infrared-Privileged UAV Detection via Cross-Modal Vector-Quantization

Vision-Language Models (VLMs) extend Large Language Models (LLMs) with visual perception capabilities, unlocking broad applications across many domains. However, ensuring their safety remains a critical challenge, as adversarial visual inputs can easily bypass built-in safeguards and elicit harmful content. In this paper, we uncover a phenomenon we call delayed safety awareness, where a jailbroken VLM initially produces harmful content but ultimately recognizes the harmfulness at the end of the generation process. We attribute this phenomenon to the fact that the model's safety awareness against jailbreaks cannot be effectively transferred to the intermediate stages of text generation. Motivated by this insight, we introduce SafetyReminder, a simple yet effective defense that optimizes a learnable soft prompt using our proposed safety-activation prompt tuning. This soft prompt is inserted into the generated text to activate the safety awareness of the model, steering it toward refusal when harmful content arises while preserving helpfulness in benign scenarios. We evaluate our method on three established harmful benchmarks and across three types of adversarial attacks. Experimental results demonstrate that our method achieves state-of-the-art defense performance with strong generalization, offering a practical and lightweight solution for safer deployment of VLMs.

SafetyReminder: Reviving Delayed Safety Awareness of Vision-Language Models to Defend Against Jailbreak Attacks

Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models genuinely integrate audio-visual cues to segment sounding objects? Our study reveals a fundamental bias in current methods: they tend to generate segmentation masks based predominantly on visual salience, irrespective of the audio context, resulting in unreliable predictions when sounds are absent or irrelevant. To address this challenge, we introduce AVSBench-Robust, a comprehensive benchmark incorporating diverse negative audio scenarios, including silence, noise, and off-screen sounds. We also propose a simple yet effective approach combining balanced training with negative samples and classifier-guided similarity learning. Our extensive experiments show that while state-of-the-art AVS methods consistently fail under negative audio conditions, our approach achieves remarkable improvements in both standard metrics and robustness measures, maintaining near-perfect false positive rates while preserving high-quality segmentation performance.

Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

This paper studies the problem of dimension reduction, tailored to improving time series forecasting with high-dimensional predictors. We propose a novel Supervised Deep Dynamic Principal component analysis (SDDP) framework that incorporates the target variable and lagged observations into the factor extraction process. Assisted by a temporal neural network, we construct target-aware predictors by scaling the original predictors in a supervised manner, with larger weights assigned to predictors with stronger forecasting power.
A principal component analysis is then performed on the target-aware predictors to extract the estimated SDDP factors. This supervised factor extraction not only improves predictive accuracy in the downstream forecasting task but also yields more interpretable and target-specific latent factors. Building upon SDDP, we propose a factor-augmented nonlinear dynamic forecasting model that unifies a broad family of factor-model-based forecasting approaches. To further demonstrate the broader applicability of SDDP, we extend our studies to a more challenging scenario when the predictors are only partially observable. We validate the empirical performance of the proposed method on several real-world public datasets. The results show that our algorithm achieves notable improvements in forecasting accuracy compared to state-of-the-art methods.

Content not yet available

Next from AAAI 2026

Benchmarking and Enhancing Rule Knowledge-Driven Reasoning of Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES