Singapore

Text-to-video generation poses significant challenges due to the inherent complexity of video data, which spans both temporal and spatial dimensions. It introduces additional redundancy, abrupt variations, and a domain gap between language and vision tokens while generation. Addressing these challenges requires an effective video tokenizer that can efficiently encode video data while preserving essential semantic and spatiotemporal information, serving as a critical bridge between text and vision. Inspired by the observation in VQ-VAE-2, we propose \textbf{HiTVideo}, a novel approach for text-to-video generation with hierarchical tokenizers. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks. Higher layers capture semantic information with higher compression, while lower layers focus on fine-grained spatiotemporal details, striking a balance between compression efficiency and reconstruction quality. Our approach efficiently encodes longer video sequences (e.g., 8 seconds, 64 frames), reducing bits per pixel (bpp) by approximately 70\% compared to previous tokenizers, while maintaining competitive reconstruction quality. We explore the trade-offs between compression and reconstruction, while emphasizing the advantages of high-compressed semantic tokens in text-to-video tasks. HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks, striving for higher compression ratios, improved token quality, and simplify LLMs modeling under language guidance, offering a scalable and promising framework for advancing text to video generation.

AAAI 2026

HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models

large vision models

multi-modal vision

image & video synthesis

computational photography

language and vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Mental health assessment is crucial for early intervention and effective treatment, yet traditional clinician-based approaches are limited by the shortage of qualified professionals. Recent advances in artificial intelligence have sparked growing interest in automated psychological assessment, yet most existing approaches are constrained by their reliance on static text analysis, limiting their ability to capture deeper and more informative insights that emerge through dynamic interaction and iterative questioning. Therefore, in this paper, we propose a multi-agent framework for mental health evaluation that simulates clinical doctor-patient dialogues, with specialized agents assigned to questioning, adequacy evaluation, scoring, and updating. In detail, we introduce an adaptive questioning mechanism in which an evaluation agent assesses the adequacy of user responses to determine the necessity of generating targeted follow-up queries to address ambiguity and missing information. Additionally, we employ a tree-structured memory in which the root node encodes the user's basic information, while child nodes (e.g., topic and statement) organize key information according to distinct symptom categories and interaction turns. This memory is dynamically updated throughout the interaction to reduce redundant questioning and enhance the information extraction and contextual tracking capabilities. Experimental results on the DAIC-WOZ dataset illustrate the effectiveness of our proposed method, which achieves better performance than existing approaches.

AgentMental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment

Accurately recognizing distracted driving activities in real-world scenarios is essential for improving road and pedestrian safety. However, existing approaches are prone to attending to irrelevant scene context and are susceptible to interference from redundant frames, compromising their robustness in complex driving environments. To overcome these limitations, we propose DualScope, a novel framework that captures behaviorally critical information from both spatial and temporal perspectives.
In the spatial domain, we introduce a Synergistic Behavior-Centric Distillation mechanism that leverages two key information sources: (1) position-aware knowledge derived from the SAM model, which enhances the perception of critical regions and their semantic interaction structures; and (2) fine-grained visual details obtained from cropped key regions, which improve the model's ability to capture detailed patterns within behavior-relevant areas.
In the temporal domain, we present the Saliency-Aware Fine-to-Coarse Temporal Modeling module, comprising three components: a Fine-Grained Motion Encoder for capturing local inter-frame dependencies; a Dynamic Difference Extractor for generating salient motion dynamics; and a Saliency-Aware Temporal Pyramid Mamba for integrating these representations to enable multi-scale temporal modeling. This design effectively captures both short-term motions and long-term behavioral patterns. Furthermore, incorporating salient dynamics enhances the model's focus on significant behavioral variations. Extensive experiments on seven publicly available DDAR datasets demonstrate that DualScope consistently outperforms state-of-the-art methods, validating its effectiveness in capturing behavioral cues across spatial and temporal dimensions.

DualScope: Capturing Critical Spatial and Temporal Cues for Distracted Driving Activity Recognition

Federated learning has drawn widespread interest from researchers, yet the data heterogeneity across edge clients remains a key challenge, often degrading model performance. Existing methods enhance model compatibility with data heterogeneity by splitting models and knowledge distillation. However, they neglect the insufficient communication bandwidth and computing power on the client, failing to strike an effective balance between addressing data heterogeneity and accommodating limited client resources. To tackle this limitation, we propose a personalized federated learning method based on cosine sparsification parameter packing and dual-weighted aggregation (FedCSPACK), which effectively leverages the limited client resources and reduces the impact of data heterogeneity on model performance. In FedCSPACK, the client packages model parameters and selects the most contributing parameter packages for sharing based on cosine similarity, effectively reducing bandwidth requirements. The client then generates a mask matrix anchored to the shared parameter package to improve the alignment and aggregation efficiency of sparse updates on the server. Furthermore, directional and distribution distance weights are embedded in the mask to implement a weighted-guided aggregation mechanism, enhancing the robustness and generalization performance of the global model. Extensive experiments across four datasets using ten state-of-the-art methods demonstrate that FedCSPACK effectively improves communication and computational efficiency while maintaining high model accuracy.

Tackling Resource-Constrained and Data-Heterogeneity in Federated Learning with Double-Weight Sparse Pack

The learnware paradigm aims to help users solve new tasks by reusing existing well-trained models instead of starting from scratch, where a learnware consists of a model and the specification describing its capabilities. Numerous learnwares are accommodated by the learnware dock system. Note that it is very likely that when a new task passed by a user has never been tackled before, and there is no model that can be directly taken to address the user task. In this paper, we focus on the tabular task, and propose a method for reusing tabular learnwares for classification tasks with significantly different feature and label spaces, exploiting the potential of numerous existing specialized tabular models developed for various tasks. We find tabular learnwares that seem semantically irrelevant can be beneficial with new user tasks sometimes. The proposed method relies solely on model-predicted probabilities and does not require gradient information, making it applicable to a wide range of tabular models. Experiments demonstrate that tabular learnwares can be reused beyond their original purpose across heterogeneous tasks.

Tabular Learnwares Can Be Repurposed for Seemingly Irrelevant New Tasks

Recently, time series prediction models based on deep neural networks have demonstrated excellent capabilities in capturing the hidden relationships within time steps.
However, due to these models directly outputting scalar values at each time step, it is challenging to account for uncertainty associated with their predictions.
To address such challenge, we propose a novel model that directly constructs discrete probability distributions per step instead of a scalar.
The regression output at each time step is derived by computing the expectation of the predictive distribution over a predefined support set.
To mitigate prediction anomalies, a dual-branch architecture is introduced with interleaved support sets, augmented by coarse temporal-scale branches for long-term trend forecasting.
Outputs from another branch are treated as auxiliary signals to impose self-supervised consistency constraints on the current branch's prediction.
Extensive experiments on multiple real-world datasets demonstrate the superior performance of the proposed model.
All source codes will be released upon acceptance.

Time Series Forecasting via Direct Per-Step Probability Distribution Modeling

Reconstructing fine-grained geometry of clothed human from single-view image is a challenging task, particularly in accurately recovering complex shapes and generating clothes details. To address these limitations, we propose a novel approach named HumanPro, which estimates high-quality human normals via a generative model, and progressively deforms a parametric body into the final clothed human mesh guided by normals. First, we propose a geometry-aware latent diffusion model with a normal enhancer to estimate high-quality human normals from four views. Then, we propose a progressive mesh optimization consisting of shape-aware deformation alignment and global-to-patch detail refinement for human mesh reconstruction. The shape-aware deformation alignment applies image morphing to learn the shape-level gap of normals, addressing large-scale deformation of complex clothes. It can recover the overall silhouette of a clothed human, and serves as an initialization for the global-to-patch detail refinement. Our detail refinement combines global and patch-wise optimization strategies to iteratively produce the clothed human mesh by minimizing the pixel-level difference of normals. This way effectively recovers fine-grained details while avoiding local minima. Extensive experiments demonstrate that HumanPro can deal with various challenging scenarios and outperforms state-of-the-art methods.

HumanPro: Single-view 3D Clothed Human Reconstruction with Progressive Normal Guidance

Once trained, neural networks memorize information in diffusely encoded parameters, making it difficult to forget in support of the right to be forgotten. Unlearning aims to remove the influence of data, with performance measured against a retrained model that excludes the data. However, understanding the behavior of gold-standard retraining remains underexplored. We compare original and retrained models and observe that most prediction changes occur in peripheral samples near decision boundaries. Consequently, we propose PeriUn, a selective strategy that unlearns only peripheral samples to mimic retrained model behavior with minimal disruption, unlike prior works that remove the entire request. Combined with the Random Label based method, PeriUn significantly improves both generalization and privacy metrics. Specifically, on TinyImageNet with VGG16, PeriUn increases the Tug-of-War score by 22 points compared to the strongest. Besides, the MIA gap score surpasses the state-of-the-art method, improving by 8.7 points after applying PeriUn. Further analyses confirm that PeriUn better preserves the feature space and aligns closely with the retrained model.

PeriUn: Enhancing Unlearning by Selectively Forgetting Peripheral Samples

The rise of 3D generative models has enabled automatic 3D geometry and texture synthesis from multimodal inputs (e.g., text or images). However, these methods often ignore physical constraints and manufacturability considerations. In this work, we address the challenge of producing 3D designs that are both lightweight and self-supporting. We present DensiCrafter, a framework for generating lightweight, self-supporting 3D hollow structures by optimizing the density field. Starting from coarse voxel grids produced by Trellis, we interpret these as continuous density fields to optimize and introduce three differentiable, physically constrained, and simulation-free loss terms. Additionally, a mass regularization penalizes unnecessary material, while a restricted optimization domain preserves the outer surface. Our method seamlessly integrates with pretrained Trellis-based models (e.g., Trellis, DSO) without any architectural changes. In extensive evaluations, we achieve up to 43% reduction in material mass on the text-to-3D task. Compared to state-of-the-art baselines, our method could improve the stability and maintain high geometric fidelity. Real-world 3D-printing experiments confirm that our hollow designs can be reliably fabricated and could be self-supporting.

DensiCrafter: Physically-Constrained Generation and Fabrication of Self-Supporting Hollow Structures

Existing stereotype auditing methods for large language models (LLM) typically rely on isolated rating schemes or task-specific probes, lacking a theoretical grounding and failing to reveal the internal organization beyond surface-level output patterns. In this paper, we introduce SCoUT (Stereotype Content oriented Utility structure via Thurstonian modeling), a closed-loop framework that structurally models, explicitly probes, and causally intervenes on stereotype dimensions(warmth and competence) in LLMs. SCoUT first reconstructs a global stereotype utility structure aligned with Stereotype Content Model theory via Thurstonian comparative judgments. Across multiple open-source LLMs, this modeling achieves high pairwise-preference prediction accuracy ($\ge0.90$ on larger-scale models) and exhibits strong cross-model consistency. Probing internal attention mechanisms localizes this structure to specific heads (Spearman’s $\rho$ up to 0.83 for warmth and 0.90 for competence) and surfaces a salient asymmetry between warmth and competence. Further, targeted inference-time activation modifications on these dimension-sensitive heads consistently steer model outputs along the intended axes. By bridging behavioral measurement with internal representation and controllable steering, SCoUT offers an end-to-end framework that uncovers and interprets the latent structure of stereotypes, advancing stereotype auditing from surface detection to structural analysis.

SCoUT: A Framework for Structured Stereotype Analysis in Language Models

High-quality long-context data is essential for training large language models (LLMs) capable of processing extensive documents, yet existing synthesis approaches using relevance-based aggregation face challenges of computational efficiency. We present LiteLong, a resource-efficient method for synthesizing long-context data through structured topic organization and multi-agent debate. Our approach leverages the BISAC book classification system to provide a comprehensive hierarchical topic organization, and then employs a debate mechanism with multiple LLMs to generate diverse, high-quality topics within this structure. For each topic, we use lightweight BM25 retrieval to obtain relevant documents and concatenate them into 128K-token training samples. Experiments on HELMET and Ruler benchmarks demonstrate that LiteLong achieves competitive long-context performance and can seamlessly integrate with other long-dependency enhancement methods. LiteLong makes high-quality long-context data synthesis more accessible by reducing both computational and data engineering costs, facilitating further research in long-context language training.

Downloads

Next from AAAI 2026

AgentMental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES