Singapore

Most existing multi-modal trackers adopt uniform fusion strategies and propagate temporal information through mixed tokens, which fails to account for modality-specific differences and results in entangled temporal representations. To address these limitations, we propose a **Modality-aware fusion and Decoupled temporal propagation multi-modal object Tracking (MDTrack)**. 
Specifically, for modality-aware fusion, we allocate dedicated experts to each modality (Infrared, Event, Depth, and RGB) to process their respective representations. The gating mechanism within the mixture of experts (MoE) then dynamically selects the optimal experts based on the input features, enabling adaptive and modality-specific fusion.
For decoupled temporal propagation, we introduce two separate State Space Model (SSM) structures to independently store and update the hidden states $h$ of the RGB and X-modal streams, effectively capturing their distinct temporal information. To ensure synergy between the two temporal representations, we incorporate cross-attention between the input features of the two SSMs, facilitating implicit information exchange. The resulting temporally enriched features are then integrated into the backbone via cross-attention, enhancing MDTrack’s ability to leverage temporal information. 
**Extensive experiments demonstrate the effectiveness of our proposed MDTrack. Both MDTrack-S (Modality-Specific Training) and MDTrack-U (Unified-Modality Training) achieve state-of-the-art performance on five multi-modal tracking benchmarks.**

AAAI 2026

Exploring Modality-Aware Fusion and Decoupled Temporal Propagation for Multi-Modal Object Tracking

cv: motion & tracking

cv: multi-modal vision

computer vision (cv)

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Compositional generalization has achieved substantial progress in computer vision on pre-collected training data. Nonetheless, real-world data continually emerges, with possible compositions being nearly infinite, long-tailed, and not entirely visible. Thus, an ideal model is supposed to gradually improve the capability of compositional generalization in an incremental manner. In this paper, we explore Composition-Incremental Learning for Compositional Generalization (CompIL) in the context of the compositional zero-shot learning (CZSL) task, where models need to continually learn new compositions, intending to improve their compositional generalization capability progressively. To quantitatively evaluate CompIL, we develop a benchmark construction pipeline leveraging existing datasets, yielding MIT-States-CompIL and C-GQA-CompIL. Furthermore, we propose a pseudo-replay framework utilizing a visual synthesizer to synthesize visual representations of learned compositions and a linguistic primitive distillation mechanism to maintain aligned primitive representations across the learning process. Extensive experiments demonstrate the effectiveness of the proposed framework.

Composition-Incremental Learning for Compositional Generalization

Categorical attributes with qualitative values are ubiquitous in cluster analysis of real datasets. Unlike the Euclidean distance of numerical attributes, the categorical attributes lack well-defined relationships of their possible values (also called categories interchangeably), which hampers the exploration of compact categorical data clusters. Although most attempts are made for developing appropriate distance metrics, they typically assume a fixed topological relationship between categories when learning distance metrics, which limits their adaptability to varying cluster structures and often leads to suboptimal clustering performance. This paper, therefore, breaks the intrinsic relationship tie of attribute categories and learns customized distance metrics suitable for flexibly and accurately revealing various cluster distributions. As a result, the fitting ability of the clustering algorithm is significantly enhanced, benefiting from the learnable category relationships. Moreover, the learned category relationships are proved to be Euclidean distance metric-compatible, enabling a seamless extension to mixed datasets that include both numerical and categorical attributes. Comparative experiments on 12 real benchmark datasets with significance tests show the superior clustering accuracy of the proposed method with an average ranking of 1.25, which is significantly higher than the 5.21 ranking of the current best-performing method.

Break the Tie: Learning Cluster-Customized Category Relationships for Categorical Data Clustering

Object detection methods have evolved from closed-set to open-set paradigms over the years. Current open-set object detectors, however, remain constrained by their exclusive reliance on positive indicators based on given prompts like text descriptions or visual exemplars. This positive-only paradigm experiences consistent vulnerability to visually similar but semantically different distractors. We propose *T-Rex-Omni*, a novel framework that addresses this limitation by incorporating negative visual prompts to negate hard negative distractors. Specifically, we first introduce a unified visual prompt encoder that jointly processes positive and negative visual prompts. Next, a training-free Negating Negative Computing (NNC) module is proposed to dynamically suppress negative responses during the probability computing stage. To further boost performance through fine-tuning, our Negating Negative Hinge (NNH) loss enforces discriminative margins between positive and negative embeddings. *T-Rex-Omni* supports flexible deployment in both positive-only and joint positive-negative inference modes, accommodating either user-specified or automatically generated negative examples. Extensive experiments demonstrate remarkable zero-shot detection performance, significantly narrowing the performance gap between visual-prompted and text-prompted methods while showing particular strength in long-tailed scenarios (51.2 AP$_r$ on LVIS-minival). This work establishes negative prompts as a crucial new dimension for advancing open-set visual recognition systems.

T-Rex-Omni: Integrating Negative Visual Prompt in Generic Object Detection

Semi-supervised medical image segmentation is an effective method for addressing scenarios with limited labeled data. Existing methods mainly rely on frameworks such as mean teacher and dual-stream consistency learning. These approaches often face issues like error accumulation and model structural complexity, while also neglecting the interaction between labeled and unlabeled data streams.
To overcome these challenges, we propose a Bidirectional Channel-selective Semantic Interaction (BCSI) architecture for semi-supervised medical image segmentation. First, we propose the Semantic-Spatial Perturbation (SSP) mechanism, which disturbs the data using two strong augmentation operations and leverages unsupervised learning with pseudo-labels from weak augmentations. Additionally, we employ consistency learning for the predictions from the two strong augmentations to further improve model stability and robustness. Second, to reduce noise during the interaction between labeled and unlabeled data, we propose the Channel-selective Router (CR), which dynamically selects the most relevant channels for information exchange. This mechanism ensures that only highly relevant features are activated, minimizing unnecessary interference. Finally, the Bidirectional Channel-wise Interaction (BCI) strategy is employed to supplement additional semantic information and enhance the representation of important channels. Experimental results on three publicly available 3D medical datasets demonstrate that the proposed method outperforms existing semi-supervised approaches. The code will be released upon acceptance.

Bidirectional Channel-selective Semantic Interaction for Semi-Supervised Medical Segmentation

Recent advances in image editing tools, particularly those used in content-aware retouching and object-level manipulation, have raised significant concerns regarding the authenticity of digital images. While many Image Manipulation Detection and Localization (IMDL) methods have been proposed, they often struggle with subtle forgeries, intricate boundary artifacts, and manipulations generated by unseen editing techniques. In this work, we propose a novel edge-aware framework that leverages the strong natural image priors of pre-trained inpainting models to harmonize manipulated regions. By guiding the inpainting process with generated edge-aware masks, our method reconstructs tampered areas using surrounding context, yielding perceptually coherent results. The pixel-wise residual between the original and reconstructed images reveals manipulation-sensitive inconsistencies—particularly around editing boundaries—thereby enabling accurate and generalizable detection and localization. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art performance, especially in challenging scenarios involving realistic and finely retouched image forgeries.

EARG-Net: Edge-Aware Reconstruction-Guided Network for Image Manipulation Detection and Localization

With the rapid advancement of generative models, high-fidelity AI-generated images have become increasingly indistinguishable from real images, posing significant challenges to traditional detection methods that rely on explicit artifacts or uniform feature learning. We hypothesize that detection ambiguity originates from pattern coexistence: synthetic images simultaneously embed (a) authentic patterns inherited from real-image distributions and (b) synthetic patterns induced by generative architectures, whereas real images maintain consistent patterns. We validate this hypothesis through SHAP-based quantitative analysis, demonstrating that synthetic images inherently exhibit a dual distribution—simultaneously containing authentic patterns and synthetic traces—while real images show a unimodal distribution. Building on this insight, this paper proposes a Dual-Branch Asymmetric Discrepancy Learning (DADL) framework. The DADL leverages multi-scale feature extraction and Asymmetric Feature Discrepancy Loss to capture and amplify such pattern differences across multiple scales. Extensive experiments on three benchmarks (AIGCDetectBenchmark, GenImage, and Chameleon) show that DADL achieves state-of-the-art performance, with particular strengths in detecting high-fidelity synthetic images from diffusion models (e.g., Midjourney, SDv1.4, SDv1.5) and enhancing generalization across diverse generative paradigms. This study not only offers an effective approach for AIGI detection but also sheds light on the intrinsic properties of synthetic images, providing a new perspective for advancing AIGI forensics. The code will be released soon.

Dual-Branch Asymmetric Discrepancy Learning Based on Fake Image Pattern-Coexistence for AI-Generated Image Detection

Human sensing with radio signals has emerged as a non-intrusive and occlusion-robust alternative to vision-based approaches, and WiFi signals further support device-free sensing. However, current approaches deeply rely on neural networks, whose black-box nature hinders model transparency and explainability, limiting the use of WiFi-based human sensing in critical fields. For model explainability, recent works have studied saliency methods which attribute model outputs to important features, but they mostly bias in favor of common modalities (e.g., images, time series). This paper proposes a Matryoshka-like saliency method, MatryMask, an initial exploration of feature attribution for human sensing with radio signals. Compared to existing methods that require empirical knowledge about the sparsity of important features, MatryMask regularizes multiple masks to highlight salient areas at different scales, adapting to the uncertain and varying sparsity of important features in radio signals. To effectively perturb radio signals, we devise a novel frequency-removal perturbation beyond existing spatial/time-domain perturbations. Experimentally, MatryMask outperforms state-of-the-art saliency methods and significantly improves the attribution performance by up to 38.1$\sim$70.6\% for three tasks.

Feature Attribution for Human Sensing with Radio Signals

As large language models (LLMs) are increasingly deployed in high-stakes domains such as education, healthcare, and law, accurately evaluating their nuanced reasoning process becomes essential to ensure their safety, reliability, and trustworthiness. However, most existing benchmarks evaluate LLMs at a coarse granularity. They emphasize end results and neglect complex reasoning steps, which leads to masking latent deficits, producing misleading high scores, and ultimately limiting accurate assessment of model suitability in complex real-world scenarios. To address these limitations, we introduce \textit{CogProbe}, a diagnostic benchmark that decomposes complex reasoning processes into orthogonal cognitive operations, featuring multilingual datasets \textit{CogEval} and cognitively informed metrics for fine-grained evaluation of LLM cognitive capabilities. Drawing from cognitive psychology, we design a comprehensive taxonomy of model capabilities, comprising 5 macro-cognitive capabilities and 17 corresponding micro-cognitive operations, which facilitates precise identification of latent weaknesses and provides detailed assessments of model capabilities, supporting informed deployment of LLMs in real-world scenarios. Experimental results demonstrate that our method can effectively assess implicit cognitive capabilities. They further reveal that, despite achieving high scores on traditional benchmarks, current LLMs exhibit significant cognitive deficits, particularly in metacognitive capability. Merely training models on coarse-grained datasets does not effectively enhance their underlying cognitive capabilities.

Measuring the Unmeasurable: Unveiling Latent Cognitive Capabilities of LLM

As digital twins become central to the transformation of modern cities, accurate and structured 3D building models emerge as a key enabler of high-fidelity, updatable urban representations. These models underpin diverse applications including energy modeling, urban planning, autonomous navigation, and real-time reasoning. 
Despite recent advances in 3D urban modeling, most learning-based models are trained on building datasets with limited architectural diversity, which significantly undermines their generalizability across heterogeneous urban environments. 
To address this limitation, we present BuildingWorld, a comprehensive and structured 3D building dataset designed to bridge the gap in stylistic diversity. 
It encompasses buildings from geographically and architecturally diverse regions—including North America, Europe, Asia, Africa, and Oceania—offering a globally representative dataset for urban-scale foundation modeling and analysis. 
Specifically, BuildingWorld provides about Five million LOD2 building models collected from diverse sources, accompanied by both real and simulated airborne LiDAR point clouds. This enables comprehensive research on 3D reconstruction, building detection and segmentation, as well as roof structure segmentation. 
Cyber City, a virtual city model, is introduced to enable the generation of unlimited training data with customized and structurally diverse point cloud distributions.
Furthermore, we provide standardized evaluation metrics tailored for building reconstruction, aiming to facilitate the training, evaluation, and comparison of large-scale vision models and foundation models in structured 3D urban environments

BuildingWorld: A Structured 3D Building Dataset for Urban Foundation Models

Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose S2D-Align, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. S2D-Align implements a shallow-to-deep strategy, progressively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatomical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guidance. For evaluation, we conduct experiments on the public MIMIC-CXR and IU X-Ray benchmarks, where S2D-Align achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.

Downloads

Next from AAAI 2026

Composition-Incremental Learning for Compositional Generalization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Composition-Incremental Learning for Compositional Generalization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads