Singapore

Video generation using Large Language Models (LLMs) has shown promising potential, effectively leveraging the extensive LLM infrastructure to provide a unified framework for multimodal understanding and content generation. However, these methods face critical challenges, i.e., token redundancy and inefficiencies arising from long sequences, which constrain their performance and efficiency compared to diffusion-based approaches. In this study, we investigate the impact of token redundancy in LLM-based video generation by information-theoretic analysis and propose Vision Representation Compression (VRC), a novel framework designed to achieve more in both performance and efficiency with less video token representations. VRC introduces learnable representation compressor and decompressor to compress video token representations, enabling autoregressive next-sequence prediction in a compact latent space. Our approach reduces redundancy, shortens token sequences, and improves model&#39;s ability to capture underlying video structures. Our experiments demonstrate that VRC reduces token sequence lengths by a factor of 4, achieving more than 9~14x acceleration in inference while maintaining performance comparable to state-of-the-art video generation models. VRC not only accelerates the inference but also significantly reduces memory requirements during both model training and inference.

AAAI 2026

Less Is More: Vision Representation Compression for Efficient Video Generation with Large Language Models

autoregressive video generation

text-to-video

large language model

Video generation using Large Language Models (LLMs) has shown promising potential, effectively leveraging the extensive LLM infrastructure to provide a unified framework for multimodal understanding and content generation. However, these methods face critical challenges, i.e., token redundancy and inefficiencies arising from long sequences, which constrain their performance and efficiency compared to diffusion-based approaches. In this study, we investigate the impact of token redundancy in LLM-based video generation by information-theoretic analysis and propose Vision Representation Compression (VRC), a novel framework designed to achieve more in both performance and efficiency with less video token representations. VRC introduces learnable representation compressor and decompressor to compress video token representations, enabling autoregressive next-sequence prediction in a compact latent space. Our approach reduces redundancy, shortens token sequences, and improves model's ability to capture underlying video structures. Our experiments demonstrate that VRC reduces token sequence lengths by a factor of 4, achieving more than 9~14x acceleration in inference while maintaining performance comparable to state-of-the-art video generation models. VRC not only accelerates the inference but also significantly reduces memory requirements during both model training and inference.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Foundational vision-language models (VLMs), such as CLIP, are emerging as a promising paradigm in vision tasks due to their strong generalization ability. Nevertheless, adapting them to downstream tasks remains challenging, especially in biomedical imaging, where scarce annotations, low-contrast features and complex patterns hinder model adaptation. Thus, prompt tuning is employed to facilitate the adaptation of VLMs. However, current prompt tuning methods like Context Optimization (CoOp) mainly learn a single yet static prompt which is applied to all images, and such one-size-fits-all prompt cannot describe the case-specific diagnostic cues in biomedical data, compromising the adaptation of VLMs.
To this end, we propose a Dynamic Prompt Policy learning method that enables efficient adaptation of Biomedical VLMs (BioDPP) for accurate and highly generalizable few-shot biomedical image classification. Specifically, we conceptualize the learnable context as an agent, and present a paradigm of learning a dynamic prompting policy, rather than obtaining a single yet static prompt. Wherein, a dual-reward mechanism is developed to guide policy learning via the feedback on both classification decision and the consistency between the prompt and the context, steering the agent to generate context-aware prompts. Moreover, we devise adaptive baseline stabilization to dynamically regulate reward advantage value throughout the training process, enabling policy refinement in a complex reward space tailored to biomedical VLMs. Extensive experiments are conducted on 10 biomedical datasets, and the results reveal that our BioDPP achieves superior performance, demonstrating more efficient prompt optimization in biomedical VLMs.

BioDPP: Dynamic Prompt Policy Learning for Biomedical Vision-Language Models

Lifelong person Re-IDentification (LReID) aims to match the same person employing continuously collected individual data from different scenarios. To achieve continuous all-day person matching across day and night, Visible-Infrared Lifelong person Re-IDentification (VI-LReID) focuses on sequential training on data from visible and infrared modalities and pursues average performance over all data. To this end, existing methods typically exploit cross-modal knowledge distillation to alleviate the catastrophic forgetting of old knowledge. However, these methods ignore the mutual interference of modality-specific knowledge acquisition and modality-common knowledge anti-forgetting, where conflicting knowledge leads to collaborative forgetting. To address the above problems, this paper proposes a Cross-modality Knowledge Disentanglement and Alignment method, called CKDA, which explicitly separates and preserves modality-specific knowledge and modality-common knowledge in a balanced way. Specifically, a Modality-Common Prompting (MCP) module and a Modality-Specific Prompting (MSP) module are proposed to explicitly disentangle and purify discriminative information that coexists and is specific to different modalities, avoiding the mutual interference between both knowledge. In addition, a Cross-modal Knowledge Alignment (CKA) module is designed to further align the disentangled new knowledge with the old one in two mutually independent inter- and intra-modality feature spaces based on dual-modality prototypes in a balanced manner. Extensive experiments on four benchmark datasets verify the effectiveness and superiority of our CKDA against state-of-the-art methods. Our code will be released.

CKDA: Cross-modality Knowledge Disentanglement and Alignment for Visible-Infrared Lifelong Person Re-identification

Detecting Out-of-Distribution (OOD) graphs is a critical task for ensuring the safety and reliability of Graph Neural Networks. The main challenge for unsupervised graph-level Out-of-Distribution detection is its common reliance on purely in-distribution (ID) data. This ID-only training paradigm yields an incomplete characterization of the feature space, leading to decision boundaries that lack the robustness required to effectively separate ID from OOD samples. While incorporating synthesized outliers into the training process is a promising approach, existing generation methods are constrained by their reliance on pre-defined, non-adaptive sampling heuristics (e.g., distance or density). Such fixed strategies lack the adaptability to systematically explore the most informative OOD regions for refining the decision boundaries. To overcome this limitation, we propose a novel Policy-Guided Outlier Synthesis (PGOS) framework that replaces static heuristics with a learned and adaptive exploration policy. PGOS trains a reinforcement learning agent to autonomously navigate the low-density voids within a structured latent space, sampling representations that are maximally effective for regularizing the OOD decision boundary. These sampled points are then decoded into high-quality pseudo-OOD graphs to improve the detector's robustness. Extensive experiments demonstrate our method's strong performance, including state-of-the-art results on several graph OOD and anomaly detection benchmarks.

Learning to Explore: Policy-Guided Outlier Synthesis for Graph Out-of-Distribution Detection

While Vision Language Models (VLMs) have demonstrated remarkable capabilities in general visual understanding, their application in the chemical domain has been limited, with previous works predominantly focusing on text and thus ignoring critical visual information like molecular structures. Current approaches that directly adopt standard VLMs for chemical tasks suffer from two primary issues: (i) computational inefficiency of processing entire chemical images with non-informative backgrounds. (ii) narrow scope on molecular-level tasks that restricts progress in chemical reasoning. In this work, we propose TinyChemVL, an efficient and powerful chemical VLM that leverages visual token reduction and reaction-level tasks to improve model efficiency and reasoning capacity. Furthermore, we propose ChemVR, a reaction-level benchmark for assessing vision-based reaction recognition and prediction tasks. Directly predicting reaction products from molecular images poses a non-trivial challenge, requiring models to integrate recognition and reasoning capacities. Our results demonstrate that with only 4B parameters, TinyChemVL achieves superior performance on both molecular and reaction tasks, while also demonstrating faster inference and training speeds compared to existing models. Notably, TinyChemVL outperforms ChemVLM while utilizing only 1/16th of the visual tokens. This work highlights a path toward building efficient yet powerful VLMs for chemical domains by co-designing model architecture and task complexity. The code, pretrained model and benchmark will be released.

TinyChemVL: Advancing Chemical Vision-Language Models via Efficient Visual Token Reduction and Complex Reaction Tasks

Denoising Diffusion Probabilistic Models (DDPMs) have shown success in robust 3D object detection tasks. 
Existing methods often rely on the score matching from 3D boxes or pre-trained diffusion priors. However, they typically require multi-step iterations in inference, which limits efficiency.
To address this, we propose a Robust single-stage fully Sparse 3D object Detection Network with a Detachable Latent Framework (DLF) of DDPMs, named RSDNet. Specifically, RSDNet learns the denoising process in latent feature spaces through lightweight denoising networks like multi-level denoising autoencoders (DAEs). This enables RSDNet to effectively understand scene distributions under multi-level perturbations, achieving robust and reliable detection. Meanwhile, we reformulate the noising and denoising mechanisms of DDPMs, enabling DLF to construct multi-type and multi-level noise samples and targets, enhancing RSDNet robustness to multiple perturbations. Furthermore, a semantic-geometric conditional guidance is introduced to perceive the object boundaries and shapes, alleviating the center feature missing problem in sparse representations, enabling RSDNet to perform in a fully sparse detection pipeline. Moreover, the detachable denoising network design of DLF enables RSDNet to perform single-step detection in inference, further enhancing detection efficiency. Extensive experiments on public benchmarks show that RSDNet can outperform existing methods, achieving state-of-the-art detection.

Robust Single-Stage Fully Sparse 3D Object Detection via Detachable Latent Diffusion

The success of 3DGS in generative and editing applications has sparked growing interest in 3DGS-based style transfer. However, current methods still face two major challenges: (1) multi-view inconsistency often leads to style conflicts, resulting in appearance smoothing and distortion; and (2) heavy reliance on VGG features, which struggle to disentangle style and content from style images, often causing content leakage and excessive stylization. To tackle these issues, we introduce FantasyStyle, a 3DGS-based style transfer framework, and the first to rely entirely on diffusion model distillation. It comprises two key components: (1) Multi-View Frequency Consistency. We enhance cross-view consistency by applying a 3D filter to multi-view noisy latent, selectively reducing low-frequency components to mitigate stylized prior conflicts. (2) Controllable Stylized Distillation. To suppress content leakage from style images, we introduce negative guidance to exclude undesired content. In addition, we identify the limitations of Score Distillation Sampling and Delta Denoising Score in 3D style transfer and remove the reconstruction term accordingly. Building on these insights, we propose a controllable stylized distillation that leverages negative guidance to more effectively optimize the 3D Gaussians. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving higher stylization quality and visual realism across various scenes and styles.

FantasyStyle: Controllable Stylized Distillation for 3D Gaussian Splatting

Efficient retrieval of external knowledge bases and web pages is crucial for enhancing the reasoning abilities of LLMs.
Previous works on training LLMs to leverage external retrievers for solving complex problems have predominantly employed end-to-end reinforcement learning. However, these approaches neglect supervision over the reasoning process, making it difficult to guarantee logical coherence and rigor.
To address these limitations, we propose Thinker, a hierarchical thinking model for deep search through multi-turn interaction, making the reasoning process supervisable and verifiable.
It decomposes complex problems into independently solvable sub-problems, each dually represented in both natural language and an equivalent logical function to support knowledge base and web searches. 
Concurrently, dependencies between sub-problems are passed as parameters via these logical functions, enhancing the logical coherence of the problem-solving process. 
To avoid unnecessary external searches, we perform knowledge boundary determination to check if a sub-problem is within the LLM's intrinsic knowledge, allowing it to answer directly.
Experimental results indicate that with as few as several hundred training samples, the performance of Thinker is competitive with established baselines. 
Furthermore, when scaled to the full training set, Thinker significantly outperforms these methods across various datasets and model sizes.

Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction

3D Gaussian Splatting (3DGS) has demonstrated impressive Novel View Synthesis (NVS) results in a real-time rendering manner. During training, it relies heavily on the average magnitude of view-space positional gradients to grow Gaussians to reduce rendering loss. However, this average operation smooths the positional gradients from different viewpoints and rendering errors from different pixels, hindering the growth and optimization of many defective Gaussians. This leads to strong spurious artifacts in some areas. To address this problem, we propose Hard Gaussian Splatting, dubbed HGS, which considers multi-view significant positional gradients and rendering errors to grow hard Gaussians that fill the gaps of classical Gaussian Splatting on 3D scenes, thus achieving superior NVS results. In detail, we present positional gradient driven HGS, which leverages multi-view significant positional gradients to uncover hard Gaussians. Moreover, we propose rendering error guided HGS, which identifies noticeable pixel rendering errors and potentially over-large Gaussians to jointly mine hard Gaussians. By growing and optimizing these hard Gaussians, our method helps to resolve blurring and needle-like artifacts. Experiments on various datasets demonstrate that our method achieves state-of-the-art rendering quality while maintaining real-time efficiency, yielding LPIPS improvements of **5.1\%**, **19.7\%** and **6.3\%** on Mip-NeRF360, Tanks\&Temples and Deep Blending, respectively..

Pushing Rendering Boundaries: Hard Gaussian Splatting

Vision-Language-Action (VLA) models enable robotic systems to perform embodied tasks but face deployment challenges due to the high computational demands of the dense Large Language Models (LLMs), with existing early-exit-based sparsification methods often overlooking the critical semantic role of final layers in downstream tasks. Aligning with the recent breakthrough of the Shallow Brain Hypothesis (SBH) in neuroscience and the mixture of experts in model sparsification, we conceptualize each LLM layer as an expert and propose a Mixture-of-LayEr Vision Language Action model (MoLe-VLA or simply MoLe) architecture for dynamic LLM layer activation. Specifically, we introduce a Spatial-Temporal Aware Router (STAR) for MoLe to selectively activate only parts of the layers based on the robot’s current state, mimicking the brain's distinct signal pathways specialized for cognition and causal reasoning. Additionally, to compensate for the cognition ability of LLM lost during the layer-skipping, we devise a Cognitive self-Knowledge Distillation (CogKD) to enhance the understanding of task demands and generate task-relevant action sequences by leveraging cognition features. Extensive experiments in RLBench simulations and real-world environments demonstrate the superiority of MoLe-VLA in both efficiency and performance, improving the mean success rate by 9.7\% across ten simulation tasks while accelerating inference by 36.8\% over OpenVLA.

MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities, yet their ability to ground language in complex, interactive environments such as video games remains a critical frontier. Existing benchmarks are inadequate for this purpose: real-world datasets like RefCOCO introduce a domain gap; GUI-centric benchmarks lack the complexity of modern game interfaces; and existing game-specific benchmarks are often too simplistic or narrow, failing to assess fine-grained, generalizable grounding capabilities.
To address this issue, we propose GGBench — a large-scale, cross-genre benchmark designed to probe the grounding capabilities of LVLMs in diverse gaming scenarios. GGBench features unprecedented genre diversity, encompassing 10 categories including card games, first-person shooters, and role-playing games, with a total of 1335 test images. It focuses on tasks that require connecting natural language instructions to specific in-game objects and UI elements.
Experimental results show existing models perform poorly on GGBench, with weak grounding abilities, especially in complex game scenarios. Due to limited data scale, fine-tuning them for gaming scenarios is also challenging. To address this, we propose Game-R1, a novel training method centered on the Grounded Reinforcement Policy Optimization (GRPO) algorithm. GRPO maximizes limited interaction data utility and enables robust few-shot generalization across games. Extensive experiments show Game-R1 significantly outperforms existing LVLMs on GGBench, validating our approach.
GGBench provides a solid and comprehensive evaluation platform for subsequent research on agents in gaming environments, which strongly promotes development in this field.

Content not yet available

Next from AAAI 2026

BioDPP: Dynamic Prompt Policy Learning for Biomedical Vision-Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES