Singapore

Vision-language object tracking overcomes the limitations of relying solely on visual features by leveraging language descriptions of objects to provide cross-modal semantic information, thereby enhancing model robustness in complex scenarios. However, most existing high-performance vision-language trackers are trained jointly on pure visual data and vision-language multimodal data. Due to the relative sparsity of language annotations in the data, the trackers tend to prioritize the localization role of visual features, diminishing the model&#39;s attention to language information. To mitigate this issue, we propose a novel vision-language tracker: Aware Distillation for Robust Vision-Language Tracking under Linguistic Sparsity (ADTrack). We introduce a knowledge distillation framework employing a knowledge-rich teacher model and a lightweight student model to establish modality correlations between vision and language, enabling efficient modeling between visual information and language descriptions. Specifically, our lightweight student module simultaneously distills language encoding capabilities from large language models through teacher-guided learning on input language, while performing target-aware perception on template images using language descriptions to generate more effective template features for subsequent visual extraction. Furthermore, to ensure perceptual robustness in linguistically sparse scenarios, we simulate language-deficient conditions during training and employ contrastive learning to enhance model adaptability. Extensive experiments demonstrate that ADTrack reduces parameters by over 50% while achieving state-of-the-art (SOTA) performance and speed on vision-language tracking benchmarks, including LaSOT, LaSOText, TNL2K, OTB-Lang and MGIT.

AAAI 2026

Aware Distillation for Robust Vision-Language Tracking Under Linguistic Sparsity

object tracking， visual-langugae，multi-modal

Vision-language object tracking overcomes the limitations of relying solely on visual features by leveraging language descriptions of objects to provide cross-modal semantic information, thereby enhancing model robustness in complex scenarios. However, most existing high-performance vision-language trackers are trained jointly on pure visual data and vision-language multimodal data. Due to the relative sparsity of language annotations in the data, the trackers tend to prioritize the localization role of visual features, diminishing the model's attention to language information. To mitigate this issue, we propose a novel vision-language tracker: Aware Distillation for Robust Vision-Language Tracking under Linguistic Sparsity (ADTrack). We introduce a knowledge distillation framework employing a knowledge-rich teacher model and a lightweight student model to establish modality correlations between vision and language, enabling efficient modeling between visual information and language descriptions. Specifically, our lightweight student module simultaneously distills language encoding capabilities from large language models through teacher-guided learning on input language, while performing target-aware perception on template images using language descriptions to generate more effective template features for subsequent visual extraction. Furthermore, to ensure perceptual robustness in linguistically sparse scenarios, we simulate language-deficient conditions during training and employ contrastive learning to enhance model adaptability. Extensive experiments demonstrate that ADTrack reduces parameters by over 50% while achieving state-of-the-art (SOTA) performance and speed on vision-language tracking benchmarks, including LaSOT, LaSOText, TNL2K, OTB-Lang and MGIT.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Mining time-frequency features is critical for time series forecasting. Existing research has predominantly focused on modeling low-frequency patterns, where most time series energy is concentrated. The overlooking of mid to high frequency continues to limit further performance gains in deep learning models. We propose FreqCycle, a novel framework integrating: (i) a Filter-Enhanced Cycle Forecasting (FECF) module to extract low-frequency features by explicitly learning shared periodic patterns in the time domain, and (ii) a Segmented Frequency-domain Pattern Learning (SFPL) module to enhance mid to high frequency energy proportion via learnable filters and adaptive weighting. Furthermore, time series data often exhibit coupled multi-periodicity, such as intertwined weekly and daily cycles. To address coupled multi-periodicity as well as long lookback window challenges, we extend FreqCycle hierarchically into MFreqCycle, which decouples nested periodic features through cross-scale interactions. Extensive experiments on seven diverse domain benchmarks demonstrate that FreqCycle achieves state-of-the-art accuracy while maintaining faster inference speeds, striking an optimal balance between performance and efficiency.

FreqCycle: A Multi-Scale Time-Frequency Analysis Method for Time Series Forecasting

Multi-view multi-label classification aims to utilize the rich information contained in multiple views for accurate classification. However, in real-world applications, its performance is often severely constrained by the concurrent missingness of both views and labels. To address this problem, this paper first targets the drawback of representation degradation in traditional feature disentanglement methods caused by strong consistency constraints and proposes a soft consistency constraint. This constraint not only effectively aligns the shared information and maximally avoids the compression of information beneficial to the classification task, but it also enhances the aggregation effect of high-quality representations on other representations. Furthermore, to address the coarse-grained problem of traditional fusion strategies, we designed a quality assessment network that achieves instance-level dynamic weighted fusion in a data-driven manner. Extensive experiments on multiple benchmark datasets demonstrate that our method achieves state-of-the-art performance in both incomplete and complete data scenarios, showcasing its robustness and generality.

Quality-aware and Soft Consistency Driven Representation Fusion for Incomplete Multi-view Multi-label Classification

In large-scale sensor networks, Multivariate Time Series Classification (MTSC) is a pivotal task for identifying events dependent on longitudinal data at the edge. However, existing methods focus on neither the inherent ability of convolutional networks to perceive subsequence features, nor the prolonged processing pipeline and the model deployment overhead brought by the highly parameterized models. To resolve these difficulties, we present EdgeMTSC, a lightweight large-kernel ConvNet for MTSC, which naturally extracts and learns features of diverse subsequences. Specifically, a novel module named Inter-channel Message Passing-driven Kernel Block (IMP-KB) is proposed, which maintains a learnable correlation matrix to propagate and merge inter-channel messages, and fuses miscellaneous patterns learned by parallel conv kernels of different sizes. EdgeMTSC sequences two modules of different receptive fields to aggregate local features using small kernels and study long-term representation provisioned by large kernels, respectively.
For inference parameter reduction and accelerating inference without performance loss, the conv blocks in IMP-KBs follows are structural re-reparameterizable. The performance of our model (76.2\%) is benchmarked on 26 UEA MTSC datasets and is superior to the SOTA model (MPTSNet, 75\%). At the same time, EdgeMTSC uses the least parameters and achieves the minimum inference time, applicable on any machine (8 devices ranging from large-scale distributed AI computing servers to resource-constrained edge devices) and in any application scenario.

EdgeMTSC: A Lightweight Large-Kernel ConvNet for Multivariate Time Series Classification

Evaluating a graphic design involves assessing it from multiple facets like alignment, composition, aesthetics and color choices. Holistic evaluation would involve aggregating feedback from individual expert reviewers. Towards this, we propose an Agentic Design Review System (Agentic-DRS), where multiple agents collaboratively analyze a design, orchestrated by a meta-agent. A novel in-context exemplar selection approach based on graph matching and a unique prompt expansion method plays central role towards making each agent design aware. In order to evaluate this framework, we propose DRS-BENCH. Thorough experimental evaluation against state-of-the-art baselines adapted to the problem setup, backed by critical ablations, demonstrates efficacy of Agentic-DRS in evaluating designs and generating actionable feedback.

Agentic Design Review System

Cytological images originate from exfoliated cells, collected via liquid-based slides and digitized into whole slide images (WSIs). Unlike histological WSIs that exhibit continuous and well-structured tissue, cytological WSIs are sparse in spatial distribution and unstructured in cellular relationships. Typically, the nucleus serves as the primary diagnostic feature, while surrounding cytoplasmic information plays a supportive role. These unique characteristics limit the development of effective foundation models and hinder the transferability of histology-based models for cytopathology. To address this, we propose **Cyto-SSL**, the first self-supervised pretraining framework for cytological images. It introduces **Nuclei-Centered Perturbation**, which highlights individual nuclei by perturbing non-nuclear regions. We also design an SR-Transformer module, which complements this by using sparse attention to concentrate on diagnostically relevant scattered cells, while iRPE helps model to capture local spatial relationships and avoids unnecessary attention to irrelevant global structures. Experimental results show that **Cyto-SSL** enhances performance across diverse cytological datasets and Multiple Instance Learning (MIL) methods. On a WSI-level dataset, it achieved 95.67\% accuracy and outperformed ImageNet-pretrained ResNet-50 by 11.33\%, demonstrating superior feature representation for cytological analysis. Additionally, **Cyto-SSL** modules are plug-and-play, easily integrated into other pretraining frameworks, yielding a 2.6\% accuracy gain across different SSL methods.

Cyto-SSL: A Self-Supervised Pretraining Framework for Cytology Foundation Model

Effective customer support requires not only accurate problem-solving but also structured and empathetic communication aligned with professional standards. However, existing dialogue datasets often lack strategic guidance, and real-world service data is difficult to access and annotate. To address this, we introduce the task of Customer Support Conversation (CSC), aimed at training customer service supporters to respond using well-defined support strategies. We propose a structured CSC framework grounded in COPC guidelines, defining five conversational stages and twelve strategies to guide high-quality interactions. Based on this, we construct CSConv, an evaluation dataset of 1,855 real-world customer–agent conversations rewritten using LLMs to reflect deliberate strategy use, and annotated accordingly. Additionally, we develop a role-playing approach that simulates strategy-rich conversations using LLM-powered roles aligned with the CSC framework, resulting in the training dataset RoleCS. Experiments show that fine-tuning strong LLMs on RoleCS significantly improves their ability to generate high-quality, strategy-aligned responses on CSConv. Human evaluations further confirm gains in problem resolution. All code and data will be publicly available at https://github.com/aliyun/qwen-dianjin.

Evaluating, Synthesizing, and Enhancing for Customer Support Conversation

We address the challenge of integrating high-level semantic reasoning with low-level trajectory planning in end-to-end autonomous driving, where most existing frameworks decouple perception, decision-making, and control, leading to limited interpretability and poor instruction compliance. To bridge this gap, we propose Driving with Advice, a novel closed-loop framework that treats a vision-language model (VLM) as a motion advisor to provide interpretable, language-mediated guidance for trajectory generation. Our approach introduces three key innovations: (1) Semantic-Intentional Pretraining (SIP), which injects driving rationale into a compact VLM via machine-generated question-answering pairs; (2) a discrete action space grounded in directional and speed primitives, enabling structured and interpretable policy learning; and (3) an advice-following diffusion policy refined via Group Relative Policy Optimization under a multi-objective reward that ensures safety, comfort, and alignment with semantic intent. We evaluate our method on the NAVSIM benchmark in a closed-loop setting, achieving a state-of-the-art Predictive Driver Model Score (PDMS) of 91.5, outperforming strong baselines in safety (NC: 99.2). The results demonstrate that leveraging language as a cognitive interface between perception and control enhances both generalization and behavioral transparency, advancing the paradigm of language-conditioned driving.

Driving with Advice: Large Model as Motion Advisor for Joint Planning

Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming human-written templates, limiting both diversity and scalability. We propose MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning. Rather than modifying existing problems, MathSmith constructs new ones from scratch by randomly sampling concept–explanation pairs from PlanetMath, ensuring data independence and avoiding contamination. To increase difficulty, we design nine predefined strategies as soft constraints during rationales. We further adopts reinforcement learning to jointly optimize structural validity, reasoning complexity, and answer consistency. The length of the reasoning trace generated under autoregressive prompting is used to reflect cognitive complexity, encouraging the creation of more demanding problems aligned with long-chain-of-thought reasoning. Experiments across five benchmarks, categorized as easy \& medium (GSM8K, MATH-500) and hard (AIME2024, AIME2025, OlympiadBench), show that MathSmith consistently outperforms existing baselines under both short and long CoT settings. Additionally, a weakness-focused variant generation module enables targeted improvement on specific concepts. Overall, MathSmith exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities.

MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

Multimodal Instruction Following serves as a fundamental capability of multimodal language models, involving accurate comprehension and execution of user-provided instructions. However, existing multimodal instruction-following datasets and benchmarks face the shortcomings outlined below: (a) Lack of Difficulty Stratification, they collect diverse instruction categories but neglect the stratification of difficulty levels across these categories, which leads to overlap, bias, and low interpretability. (b) Lack of Fine-Grained Metrics, they conflate the model's ability to ``solve tasks" and ``follow constraints" into a single metric, which fails to accurately reflect its instruction-following capability. (c) Lack of Multi-Task Instructions, they overlook the fact that real-world user instructions often consist of multiple combined tasks. This paper proposes MMIFEvol, a framework for multimodal instruction evolving and benchmarking. First, we define the essential components of a carefully curated multimodal instruction set and establish corresponding difficulty levels, based on which we synthesize diverse instruction data. Next, we decouple the evaluation criteria for the instruction following into three different metrics to construct a high-quality benchmark and assess existing models. Experimental results demonstrate that current models still struggle with following complex instructions, while fine-tuning using MMIFEvol data effectively improves models' responsiveness to multimodal instructions.

MMIFEvol: Towards Evolutionary Multimodal Instruction Following

Magnetic Resonance Imaging (MRI) and its automatic segmentation are pivotal in assisting physicians with clinical diagnoses. In recent years, with the scarcity of labeled data, significant advancements have been made in semi-supervised segmentation. However, the prediction of many current methods is affected by the presence of false positive regions, which limits their reliability in clinical applications. To tackle this issue, we propose a pseudo-label optimization method based on polar coordinate modeling and prior constraints (PMPC), which refines false positive regions in pseudo-labels by leveraging prior knowledge within the polar coordinate system. Firstly, to improve the efficiency and rationality during polar coordinate modeling, the Adaptive Pole Selection (APS) algorithm is presented to ensure that the pole is located within the foreground region. Secondly, to mitigate false positive regions in pseudo-labels that violate medical anatomical priors, we propose the Prior Knowledge Constraint in Polar Coordinate System (KCP) module to reassign pixel categories in these regions. Finally, the Shape-Aware Weighting strategy is presented to evaluate the quality of the optimized pseudo-labels based on their shape and then determine their weight in guiding network parameter updates. Experiments on three MRI datasets demonstrate that the proposed method can be effectively integrated with existing pelvic MRI segmentation approaches, significantly reducing false positive rates and further improving segmentation quality.

Downloads

Next from AAAI 2026

FreqCycle: A Multi-Scale Time-Frequency Analysis Method for Time Series Forecasting

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

FreqCycle: A Multi-Scale Time-Frequency Analysis Method for Time Series Forecasting

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads