Singapore

Customized text-to-video generation (CTVG) has recently witnessed significant progress in generating tailored videos from user-specific text. However, existing CTVG methods unrealistically assume that personalized concepts remain static and do not expand incrementally over time. Additionally, they struggle with catastrophic forgetting and concept neglect when continuously learning new concepts, including subjects and motions. To resolve the above challenges, we develop a novel Continual Customized Video Diffusion (CCVD) model, which can continuously learn new concepts to generate videos across various text-to-video generation tasks by tackling catastrophic forgetting and concept neglect. Specifically, to address catastrophic forgetting, we introduce a concept-specific attribute retention module and a task-aware concept aggregation strategy. They can capture the unique characteristics and identities of old concepts during training, while combining all subject and motion adapters of old concepts based on their relevance during testing. Furthermore, to tackle concept neglect, we develop a controllable conditional synthesis to enhance regional features and align video contexts with user conditions, by incorporating layer-specific region attention and attention-guided noise estimation. Experimental comparisons demonstrate that our CCVD model outperforms existing CTVG models.

AAAI 2026

Bring Your Dreams to Life: Continual Text-to-Video Customization

text-to-video customization

customized text-to-video generation

continual learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Reasoning Video Object Segmentation (ReasonVOS) is a challenging task that requires stable object segmentation across video sequences using implicit and complex textual inputs. Previous methods fine-tune Multimodal Large Language Models (MLLMs) to produce segmentation outputs, which demand substantial resources. Additionally, some existing methods are coupled in the processing of spatio-temporal information, which affects the temporal stability of the model to some extent. To address these issues, we propose Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory (SDAM). We aim to design a training-free reasoning video segmentation framework that outperforms existing methods requiring fine-tuning, using only pre-trained models. Meanwhile, we propose an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences. Finally, we propose Spatio-temporal Decoupling for stable temporal propagation. In the spatial domain, we achieve precise localization and segmentation of target objects, while in the temporal domain, we leverage key object temporal information to drive stable cross-frame propagation. Our method achieves excellent results on five benchmark datasets, including Ref-YouTubeVOS, Ref-DAVIS17, MeViS, ReasonVOS, and ReVOS. We will release the code.

Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory

Irregular multivariate time series (IMTS), characterized by uneven sampling and inter‑variate asynchrony, fuel many forecasting applications yet remain challenging to model efficiently. Canonical Pre‑Alignment (CPA) has been widely adopted in IMTS modeling by padding zeros at every global timestamp, thereby alleviating inter-variate asynchrony and unifying the series length, but its dense zero‑padding inflates the pre‑aligned series length, especially when numerous variates are present, causing prohibitive compute overhead. Recent graph‑based models with patching strategies sidestep CPA, but their local message passing struggles to capture global inter‑variate correlations. Therefore, we posit that CPA should be retained, with the pre‑aligned series properly handled by the model, enabling it to outperform state‑of‑the‑art graph‑based baselines that sidestep CPA. Technically, we propose KAFNet, a compact architecture grounded in CPA for IMTS forecasting that couples (1) Pre‑Convolution module for sequence smoothing and sparsity mitigation, (2) Temporal Kernel Aggregation module for learnable compression and modeling of intra-series irregularity, and (3) Frequency Linear Attention blocks for the low‑cost inter-series correlations modeling in the frequency domain. Experiments on multiple IMTS datasets show that KAFNet achieves state-of-the-art forecasting performance, with a 7.2× parameter reduction and a 8.4× training-inference acceleration. The source code can be accessed at https://github.com/zhouziyu02/KAFNet.

Revitalizing Canonical Pre-Alignment for Irregular Multivariate Time Series Forecasting

Recent advances in multi-agent Large Language Model-based code generation enable collaborative software development through role-specialized agents. However, failure localization of code generation remains challenging due to inter-agent dependencies and solution-path multiplicity. Consequently, existing prompting-based localization methods exhibit vulnerability towards semantically valid but non-canonical strategies. To address this, we propose FLKR (Failure Localization via Knowledge-guided Reasoning), an self-supervised framework that combines behavior encoding, knowledge-strategy alignment, and consistency scoring for solution-path invariant localization. To evaluate, we also introduce COFL (Code Oriented Failure Localization), the first expert-annotated benchmark for fine-grained failure localization. Experiments show FLKR outperforms state-of-the-art prompting-based baselines by up to 14 points in Fault Localization Accuracy and 45 points in Top-1 accuracy, with strong performance in divergent, real-world, and refinement-critical cases. Such results demonstrate that our proposed FLKR generalizes well to real-world software development scenarios and opens up a new direction for failure-aware refinement recommendation by providing precise and interpretable responsibility signals.

Failure Localization in Multi-Agent Code Generation via Knowledge-Guided and Transferable Reasoning

Query suggestion plays a crucial role in enhancing user experience in e-commerce search systems by providing relevant query recommendations that align with users' initial input. This module helps users navigate towards personalized preference needs and reduces typing effort, thereby improving search experience. Traditional query suggestion modules usually adopt multi-stage cascading architectures, for making a well trade-off between system response time and business conversion. But they often suffer from inefficiencies and suboptimal performance due to inconsistent optimization objectives across stages. To address these, we propose $OneSug$, the first end-to-end generative framework for e-commerce query suggestion. OneSug incorporates a prefix2query representation enhancement module to enrich prefixes using semantically and interactively related queries to bridge content and business characteristics, an encoder-decoder generative model that unifies the query suggestion process, and a reward-weighted ranking strategy with behavior-level weights to capture fine-grained user preferences. Extensive evaluations on large-scale industry datasets demonstrate OneSug's ability for effective and efficient query suggestion. Furthermore, OneSug has been successfully deployed for the entire traffic on the e-commerce search engine in TEST platform for over 1 month, with statistically significant improvements in user top click position (-9.33%), CTR (+2.01%), Order (+2.04%), and Revenue (+1.69%) over the online multi-stage strategy, showing great potential in e-commercial conversion.

OneSug: The Unified End-to-End Generative Framework for E-commerce Query Suggestion

LLM-based multi-agent systems have demonstrated significant capabilities across diverse domains. However, the task performance and efficiency are fundamentally constrained by their collaboration strategies. Prevailing approaches rely on static topologies and centralized global planning, a paradigm that limits their scalability and adaptability in open, decentralized networks. Effective collaboration planning in distributed systems using only local information thus remains a formidable challenge. To address this, we propose BiRouter, a novel dual-criteria routing method for Self-Organizing Multi-Agent Systems (SO-MAS). This method enables each agent to autonomously execute "next-hop" task routing at runtime, relying solely on local information. Its core decision-making mechanism is predicated on balancing two metrics: (1) the ImpScore, which evaluates a candidate agent's long-term importance to the overall goal, and (2) the GapScore, which assesses its contextual continuity for the current task state. Furthermore, we introduce a dynamically updated reputation mechanism to bolster system robustness in untrustworthy environments and have developed a large-scale, cross-domain dataset, comprising thousands of annotated task-routing paths, to enhance the model's generalization. Extensive experiments demonstrate that BiRouter achieves superior performance and token efficiency over existing baselines, while maintaining strong robustness and effectiveness in information-limited, decentralized, and untrustworthy settings.

Augmented Runtime Collaboration for Self-Organizing Multi-Agent Systems: A Hybrid Bi-Criteria Routing Approach

While deep learning-based super-resolution (SR) methods have shown impressive outcomes with synthetic degradation scenarios such as bicubic downsampling, they frequently struggle to perform well on real-world images that feature complex, non-linear degradations like noise, blur, and compression artifacts. Recent efforts to address this issue have involved the painstaking compilation of real low-resolution and high-resolution (HR) image pairs, usually limited to several specific downscaling factors. To address these challenges, our work introduces a novel framework capable of synthesizing authentic LR images from a single given HR image by leveraging the latent degradation space with flow matching. Our approach generates LR images with realistic artifacts at unseen degradation levels, which facilitates the creation of large-scale, real-world SR training datasets. Comprehensive quantitative and qualitative assessments verify that our synthetic LR images accurately replicate real-world degradations. Furthermore, both traditional and arbitrary-scale SR models trained using our datasets consistently yield much better HR outcomes.

Continuous Degradation Modeling via Latent Flow Matching for Real-World Super-Resolution

Listwise reranking with Large Language Models (LLMs) has emerged as the state-of-the-art approach, consistently establishing new performance benchmarks in passage reranking. However, their practical application faces two critical hurdles: the prohibitive computational overhead and high latency of processing long token sequences, and the performance degradation caused by phenomena like "lost in the middle" in long contexts. To address these challenges, we introduce Compress-then-Rank (C2R), an efficient framework that performs listwise reranking not on original passages, but on their compact multi-vector surrogates. These surrogates can be pre-computed and cached for all passages in the corpus. The effectiveness of C2R hinges on three key innovations. First, the compressor model is pre-trained on a combination of text restoration and continuation objectives, enabling high-fidelity compressed vector sequences that mitigate the semantic loss common in single-vector methods. Second, a novel input scheme prepends embeddings of each ordinal index (e.g., [1]:) to its corresponding compressed vector sequence, which both delineates passage boundaries and guides the reranker LLM to generate a ranked list. Finally, the compressor and reranker are jointly optimized, making the compression explicitly ranking-aware for the ranking objective. Extensive experiments on major reranking benchmarks demonstrate that C2R provides substantial speedups while achieving competitive and even superior ranking performance compared to full-text reranking methods. The related code is provided in the supplementary materials.

Compress-then-Rank: Faster and Better Listwise Reranking with Large Language Models via Ranking-Aware Passage Compression

Inspired by Segment Anything 2, which generalizes segmentation from images to videos, we propose SAM2MOT—a novel segmentation-driven paradigm for multi-object tracking that breaks away from the conventional detection-association framework. In contrast to previous approaches that treat segmentation as auxiliary information, SAM2MOT places it at the heart of the tracking process, systematically tackling challenges like false positives and occlusions. Its effectiveness has been thoroughly validated on major MOT benchmarks. Furthermore, SAM2MOT integrates pre-trained detector, pre-trained segmentor with tracking logic into a zero-shot MOT system that requires no fine-tuning. This significantly reduces dependence on labeled data and paves the way for transitioning MOT research from task-specific solutions to general-purpose systems. Experiments on DanceTrack, UAVDT, and BDD100K show state-of-the-art results. Notably, SAM2MOT outperforms existing methods on DanceTrack by +2.1 HOTA and +4.5 IDF1, highlighting its effectiveness in MOT.

SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation

The rapid expansion of materials databases offers unprecedented opportunities for accelerating materials discovery via machine learning. However, the widespread assumption that larger datasets inherently produce better models does not hold in practice. We propose FUSION (**F**using **U**ncertainty with **S**tructural **I**nformation for **O**ptimal **N**eural training), an offline dataset pruning strategy that synergistically combines uncertainty quantification with crystallographic structure analysis via geometric fingerprinting, framing dataset pruning as a discrete optimization problem. Through evaluation across 3 benchmark datasets, FUSION consistently outperforms baselines, including random pruning, uncertainty sampling, weighting factor pruning, diversity sampling, and active learning. It demonstrates robust transferability across 11 diverse architectures, outperforming random pruning by 1.91–13.65\% across different datasets, with an average improvement of 6.36\%. Moreover, our analysis suggests that different models exhibit varying robustness characteristics when faced with pruned training data, highlighting the importance of model selection tailored to dataset composition. We identify optimal pruning points where removing just 0–8\% of training data improves model performance, yielding gains up to 12.67\% in specific model–dataset combinations. These results establish a new paradigm for materials informatics that prioritizes data quality over quantity, offering a pathway toward more efficient and sustainable machine learning workflows in computational materials science.

FUSION: Dataset Pruning via Fusing Uncertainty with Structural Information for Optimal Neural Training in Crystal Property Prediction

Few-shot Semantic Segmentation (FSS) aims to segment the novel target objects with the guidance of minimal annotated reference examples. 
The affinity-based method has great advantages in the FSS inference stage for both specialist model and foundation model. However, current affinity calculation merely relies on only support-query matching, without considering the query-specific semantic or the semantic correlation among inter-support samples, which limits the representation ability of affinity map. In this paper, we propose the Generalizing Semantic Mining (GSM) that focuses on exploiting generalizing semantic to improve the affinity calculation. Concretely, we first organize the affinity-based inference into three main steps to reveal the crucial role of affinity map. To address the low-data problem, Target Semantic Reusing module considers the query sample as a proxy reference and assigns it with proxy mask identifying its most generalizing semantic regions. Then, to generate the high-fidelity proxy mask, Query-specific Semantic Modeling module pinpoints the most generalizing regions through prior semantic analysis. Finally, Representative Re-weighting module explicitly modulates affinity calculation via generalization-aware weighting. Experiments on FSS benchmarks demonstrate that our GSM can serve as a plug-and-play free lunch for both specialist models and foundation models.

Content not yet available

Next from AAAI 2026

Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES