Singapore

Current Zero-Shot Temporal Action Localization (ZSTAL) methods, whether training-based or training-free ones, still predominantly rely on a single, unified query to localize an entire action. This unified representation is fundamentally ill-suited for complex real-world activities, as it fails to capture their internal compositional structure and adapt to dynamic, multi-stage variations across videos. To address this, we regard ZSTAL as a compositional reasoning task and introduce CASCADE, a Context-Aware Staged Action DEcomposition framework. Inspired by the human cognitive process of perceiving context, decomposing events, and reconstructing instances, CASCADE follows a training-free pipeline. It first perceives the video&#39;s context by leveraging a Multimodal Large Language Model (MLLM) to both filter out irrelevant actions and then generate a rich, video-specific caption for each action present in the video. An LLM then decomposes this caption into multiple, temporally ordered stages, which serve as fine-grained queries to guide the MLLM in estimating frame-level confidence scores. Recognizing that this decomposition can fragment a single action, a novel hierarchical merging logic then reconstructs complete instances by intelligently fusing these preliminary temporal segments based on their semantic progression and coherence. Extensive experiments and ablation studies on THUMOS14 and ActivityNet-1.3 show that CASCADE not only sets a new state-of-the-art among training-free methods but, most notably, significantly outperforms all prior training-based approaches on ActivityNet-1.3.

AAAI 2026

Decompose and Conquer: Compositional Reasoning for Zero-Shot Temporal Action Localization

video understanding & activity analysis

Current Zero-Shot Temporal Action Localization (ZSTAL) methods, whether training-based or training-free ones, still predominantly rely on a single, unified query to localize an entire action. This unified representation is fundamentally ill-suited for complex real-world activities, as it fails to capture their internal compositional structure and adapt to dynamic, multi-stage variations across videos. To address this, we regard ZSTAL as a compositional reasoning task and introduce CASCADE, a Context-Aware Staged Action DEcomposition framework. Inspired by the human cognitive process of perceiving context, decomposing events, and reconstructing instances, CASCADE follows a training-free pipeline. It first perceives the video's context by leveraging a Multimodal Large Language Model (MLLM) to both filter out irrelevant actions and then generate a rich, video-specific caption for each action present in the video. An LLM then decomposes this caption into multiple, temporally ordered stages, which serve as fine-grained queries to guide the MLLM in estimating frame-level confidence scores. Recognizing that this decomposition can fragment a single action, a novel hierarchical merging logic then reconstructs complete instances by intelligently fusing these preliminary temporal segments based on their semantic progression and coherence. Extensive experiments and ablation studies on THUMOS14 and ActivityNet-1.3 show that CASCADE not only sets a new state-of-the-art among training-free methods but, most notably, significantly outperforms all prior training-based approaches on ActivityNet-1.3.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Pre-trained gaze models learn to identify useful patterns commonly found across users, but subtle user-specific variations (i.e., eyelid shape or facial structure) can degrade model performance.
Test-time personalization (TTP) adapts pre-trained models to these user-specific domain shifts using only a few unlabeled samples.
Efficient fine-tuning is critical in performing this domain adaptation: data and computation resources can be limited-especially for on-device customization.
While popular parameter-efficient fine-tuning (PEFT) methods address adaptation costs by updating only a small set of weights, they may not be taking full advantage of structures encoded in pre-trained filters.
To more effectively leverage existing structures learned during pre-training, we reframe personalization as a process to reweight existing features rather than learning entirely new ones.

We present Attentive Low-Rank Filter Adaptation (Alfa) to adapt gaze models by reweighting semantic patterns in pre-trained filters.
With Alfa, singular value decomposition (SVD) extracts dominant spatial components that capture eye and facial characteristics across users.
Via an attention mechanism, we need only a few unlabeled samples to adjust and reweight pre-trained structures, selectively amplifying those relevant to a target user.
Alfa achieves the lowest average gaze errors across four cross-dataset gaze benchmarks, outperforming existing TTP methods and low-rank adaptation (LoRA)-based variants.
We also show that Alfa's attentive low-rank methods can be applied to applications beyond vision, such as diffusion-based language models.

Alfa: Attentive Low-Rank Filter Adaptation for Structure-Aware Cross-Domain Personalized Gaze Estimation

Projection methods aim to reduce the dimensionality of the optimization instance, thereby improving the scalability of high-dimensional problems. Recently, \citet{sakaue2024generalization} proposed a data-driven approach for linear programs (LPs), where the projection matrix is learned from observed problem instances drawn from an application-specific distribution of problems. We analyze the generalization guarantee for the data-driven projection matrix learning for convex quadratic programs (QPs). Unlike in LPs, the optimal solutions of convex QPs are not confined to the vertices of the feasible polyhedron, and this complicates the analysis of the optimal value function. To overcome this challenge, we demonstrate that the solutions of convex QPs can be localized within a feasible region corresponding to a special active set, utilizing Carath\'{e}odory's theorem. Building on such observation, we propose the \textit{unrolled active set method}, which models the computation of the optimal value as a Goldberg-Jerrum (GJ) algorithm with bounded complexities, thereby establishing learning guarantees. We then further extend our analysis to the \textit{input-aware} setting , where we learn a mapping from QP problem instances to projection matrices.

Provably Data-Driven Projection Method for Quadratic Programming

Depression is a widespread mental disorder that affects millions worldwide. While automated depression assessment shows promise, most studies rely on limited or non-clinically validated data, and often prioritize complex model design over real-world effectiveness. In this paper, we aim to unveil the landscape of clinical depression assessment. We introduce C-MIND, a clinical neuropsychiatric multimodal diagnosis dataset collected over two years from real hospital visits. Each participant completes three structured psychiatric tasks and receives a final diagnosis from expert clinicians, with informative audio, video, transcript, and functional near-infrared spectroscopy (fNIRS) signals recorded. Using C-MIND, we first analyze behavioral signatures relevant to diagnosis. We train a range of classical models to quantify how different tasks and modalities contribute to diagnostic performance, and dissect the effectiveness of their combinations. We then explore whether LLMs can perform psychiatric reasoning like clinicians and identify their clear limitations in realistic clinical settings. In response, we propose to guide the reasoning process with clinical expertise and consistently improves LLM diagnostic performance by up to 10% in Macro-F1 score. We aim to build an infrastructure for clinical depression assessment from both data and algorithmic perspectives, enabling C-MIND to facilitate grounded and reliable research for mental healthcare.

Unveiling the Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning

In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low resource language settings where the benefits of the recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas and topics. We analyze how language choice, both in the prompt instructions and document grounding affects data quality and we compare translations of English content with native generation in Indic languages. In order to support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing
effective multilingual corpora. This work contributes practical insights for advanced pretraining recipes in low-resource and script-diverse settings, particularly in the Indian context.

BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages

Large language models (LLMs) are seeing growing adoption in multi-agent systems. In these systems, efficient failure attribution is critical for ensuring robustness and interpretability. Current LLM-based attribution methods often face challenges with lengthy logs and lacking expert knowledge. Drawing inspiration from human debugging strategies, we propose an automated failure attribution framework, Scope Delineation Before Localization, which operates in two key stages: (1) identifying the failure scope and (2) pinpointing the failure step. By decoupling failure attribution into the two stages, our approach alleviates the reasoning workload of LLMs, enabling more precise failure attribution. To support scope delineation, we further introduce two strategies: Stepwise Scope Delineation and Expertise-Assisted Scope Delineation. Experiments on the Who\&When dataset validate the efficacy of our two-stage framework, demonstrating substantial improvements over prior methods (up to 24.27\% on step-level accuracy).

Scope Delineation Before Localization: A Two-Stage Framework for Enhancing Failure Attribution in Multi-Agent Systems

As the pretraining-finetuning paradigm becomes dominant, it exposes new vulnerabilities in the model supply chain, particularly through sophisticated backdoor attacks. Prevailing research has largely focused on backdoors embedded during pretraining, viewing subsequent finetuning merely as a potential defense. This perspective overlooks the possibility of weaponizing the finetuning process itself, leaving a critical security blind spot. While emerging studies have explored finetuning-activated backdoors, their efficacy critically depends on white-box access to the downstream task's data distribution. This reliance on unobtainable prior knowledge severely limits their real-world feasibility. In this work, we propose the Dormant Backdoor, \textbf{a novel backdoor attack robust across unknown downstream tasks by weaponizing the finetuning process itself}. The key innovation is to shift the trigger from static data features to the universal dynamics of gradient-based optimization. We engineer the backdoor to be dormant and stealthy in the pretrained model, making it indistinguishable from a benign one. During finetuning, however, the very gradient updates intended for task adaptation are hijacked to progressively awaken and amplify the malicious functionality, turning the learning process against itself. Our comprehensive evaluations across multiple downstream datasets, finetuning techniques and backdoor detection schemes demonstrate that Dormant Backdoor persists reliably, revealing a new and dangerous class of process-as-trigger vulnerabilities inherent in the modern AI ecosystem.

Dormant Backdoor: Weaponizing Model Finetuning for Feasible Backdoor Attacks Against Pretrained Models

While \textbf{RE}trieval-\textbf{A}ugmented \textbf{L}LM-based \textbf{M}achine \textbf{T}ranslation (REAL-MT) shows promise, its behavior under noisy contexts remains poorly understood. In this work, we propose a noise synthesis framework and robustness metrics to assess REAL-MT under noisy contexts. We evaluate REAL-MT systems based on Qwen series models on idiomatic translation tasks across diverse languages and resource levels under noisy contexts. Our results reveal that LLMs exhibit severe degradation in translation quality, frequently generating nonsensical translations. Although large reasoning models (LRMs) possess enhanced reasoning capabilities, they show no improvement in error correction and are even more susceptible to noise. By analyzing attention patterns, we find that the model shifts its focus from essential idiomatic components to noisy contextual content, leading to erroneous translations. We investigate training-free and training-based strategies that enhance robustness but slightly degrade performance in clean contexts. These results highlight the limitations of current approaches and underscore the need for more effective methods that strike a balance between noise resistance and knowledge integration.

Exposing the Cracks: Vulnerabilities of Retrieval-Augmented LLM-based Machine Translation

While Vision-Language Models (VLMs) have garnered increasing attention in the AI community due to their promising practical applications, they exhibit persistent hallucination issues, generating outputs misaligned with visual inputs. Recent studies attribute these hallucinations to VLMs' over-reliance on linguistic priors and insufficient visual feature integration, proposing heuristic decoding calibration strategies to mitigate them. However, the non-trainable nature of these strategies inherently limits their optimization potential. To this end, we propose an adversarial parametric editing framework for Hallucination mitigation in VLMs, which follows an Activate-Locate-Edit Adversarially paradigm. Specifically, we first construct an activation dataset that comprises grounded responses (positive samples attentively anchored in visual features) and hallucinatory responses (negative samples reflecting LLM prior bias and internal knowledge artifacts). Next, we identify critical hallucination-prone parameter clusters by analyzing differential hidden states of response pairs. Then, these clusters are fine-tuned using prompts injected with adversarial prefixes optimized via prompt tuning to maximize visual neglect, thereby forcing the model to prioritize visual evidence over inherent parametric biases. Evaluations on both generative and discriminative VLM tasks demonstrate the significant effectiveness of ALEAHallu in alleviating hallucinations. Our code is available at https://anonymous.4open.science/r/knowledge_editing-C890/

Look Closer! An Adversarial Parametric Editing Framework for Hallucination Mitigation in VLMs

Swarm UAV autonomous flight for Long-Horizon (LH) tasks is crucial for advancing the low-altitude economy. However, existing methods focus only on specific basic tasks due to dataset limitations, failing in real-world deployment for LH tasks. LH tasks are not mere concatenations of basic tasks, requiring handling long-term dependencies, maintaining persistent states, and adapting to dynamic goal shifts. This paper presents U2UData-2, the first large-scale swarm UAV autonomous flight dataset for LH tasks and the first scalable swarm UAV data online collection and algorithm closed-loop verification platform. The dataset is captured by 15 UAVs in autonomous collaborative flights for LH tasks, comprising 12 scenes, 720 traces, 120 hours, 600 seconds per trajectory, 4.32M LiDAR frames, and 12.96M RGB frames. This dataset also includes brightness, temperature, humidity, smoke, and airflow values covering all flight routes. The platform supports the customization of simulators, UAVs, sensors, flight algorithms, formation modes, and LH tasks. Through a visual control window, this platform allows users to collect customized datasets through one-click deployment online and to verify algorithms by closed-loop simulation. U2UData-2 also introduces an LH task for wildlife conservation and provides comprehensive benchmarks with 9 SOTA models.

U2UData+: A Scalable Swarm UAVs Autonomous Flight Dataset for Embodied Long-horizon Tasks

Split inference (SI) enables users to access deep learning (DL) services without directly transmitting raw data. However, recent studies reveal that data reconstruction attacks (DRAs) can recover the original inputs from the smashed data sent from the client to the server, leading to significant privacy leakage. While various defenses have been proposed, they often result in substantial utility degradation, particularly when the client-side model is shallow. We identify a key cause of this trade-off: existing defenses apply excessive perturbation to redundant information in the smashed data. To address this issue in computer vision tasks, we propose InfoDecom, a defense framework that first decomposes and removes redundant information and then injects noise calibrated to provide theoretically guaranteed privacy. Experiments demonstrate that InfoDecom achieves a superior utility-privacy trade-off compared to existing baselines.

Downloads

Next from AAAI 2026

Alfa: Attentive Low-Rank Filter Adaptation for Structure-Aware Cross-Domain Personalized Gaze Estimation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Alfa: Attentive Low-Rank Filter Adaptation for Structure-Aware Cross-Domain Personalized Gaze Estimation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads