Singapore

Training-free video understanding methods leverage the strong image comprehension capabilities of pre-trained vision language models (VLMs) by treating videos as a sequences of static frames, thus obviating the need for costly video-specific training. However, this paradigm often suffers from severe visual redundancy and high computational overhead, especially when processing long videos. Crucially, existing keyframe selection strategies, especially those based on CLIP similarity, are prone to biases and may inadvertently overlook critical frames, resulting in suboptimal video comprehension.
To address these significant challenges, we propose KTV, a novel two-stage framework for efficient and effective training-free video understanding.
In the first stage, KTV performs question-agnostic keyframe selection by clustering frame-level visual features, yielding a compact, diverse, and representative subset of frames that mitigates temporal redundancy. In the second stage, KTV applies key visual token selection, pruning redundant or less informative tokens from each selected keyframe based on token importance and redundancy, which significantly reduces the number of tokens fed into the LLM.
Extensive experiments on the Multiple-Choice VideoQA task demonstrate that KTV outperforms state-of-the-art training-free baselines while using significantly fewer visual tokens, e.g., only 504 tokens for a 60 min video with 10800 frames, achieving 44.8\% accuracy on the MLVU-Test benchmark. In particular, KTV also exceeds several training-based approaches on certain benchmarks. The code is released anonymously in the supplementary materials.

AAAI 2026

KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs

vision pruning

keyframes

training-free video llm

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Multimodal graphs, where nodes contain heterogeneous features such as images and text, are increasingly common in real-world applications. Effectively learning on such graphs requires both adaptive intra-modal message passing and efficient inter-modal aggregation. However, most existing approaches to multimodal graph learning are typically extended from conventional graph neural networks and rely on static structures or dense attention, which limit flexibility and expressive node embedding learning. In this paper, we propose a novel multimodal graph representation learning framework with Dynamic information Pathways (DiP). By introducing modality-specific pseudo nodes, DiP enables dynamic message routing within each modality via proximity-guided pseudo-node interactions and captures inter-modality dependence through efficient information pathways in a shared state space. This design achieves adaptive, expressive, and sparse message propagation across modalities with linear complexity. We conduct the link prediction and node classification tasks to evaluate performance and carry out full experimental analyses. Extensive experiments across multiple benchmarks demonstrate that DiP consistently outperforms baselines.

Multimodal Graph Representation Learning with Dynamic Information Pathways

Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks. We attribute this bottleneck to the agents' excessive reliance on static, internal knowledge within MLLMs, which leads to two critical failure points: 1) strategic hallucinations in high-level planning and 2) operational errors during low-level execution on user interfaces (UI). The core insight of this paper is that high-level planning and low-level UI operations require fundamentally distinct types of knowledge. Planning demands high-level, strategy-oriented experiences, whereas operations necessitate low-level, precise instructions closely tied to specific app UIs. Motivated by these insights, we propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation. At the planning stage, we introduce Manager-RAG to reduce strategic hallucinations by retrieving human-validated comprehensive task plans that provide high-level guidance. At the execution stage, we develop Operator-RAG to improve execution accuracy by retrieving the most precise low-level guidance for accurate atomic actions, aligned with the current app and subtask. To accurately deliver these knowledge types, we construct two specialized retrieval-oriented knowledge bases. Furthermore, we introduce Mobile-Eval-RAG, a challenging benchmark for evaluating such agents on realistic multi-app, long-horizon tasks. Extensive experiments demonstrate that Mobile-Agent-RAG significantly outperforms SoTA baselines, improving task completion rate by 11.0\% and step efficiency by 10.2\%, establishing a robust paradigm for context-aware, reliable multi-agent mobile automation.

Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation

Chinese opera is celebrated for preserving classical art. However, early filming equipment limitations have degraded videos of last-century performances by renowned artists (e.g., low frame rates and resolution), hindering archival efforts. Although space-time video super-resolution (STVSR) has advanced significantly, applying it directly to opera videos remains challenging. The scarcity of datasets impedes the recovery of high-frequency details, and existing STVSR methods lack global modeling capabilities—compromising visual quality when handling opera’s characteristic large motions. To address these challenges, we pioneer a large-scale Chinese Opera Video Clip (COVC) dataset and propose the Mamba-based multiscale fusion network for space-time Opera Video Super-Resolution (MambaOVSR). Specifically, MambaOVSR involves three novel components: the Global Fusion Module (GFM) for motion modeling through a multiscale alternating scanning mechanism, and the Multiscale Synergistic Mamba Module (MSMM) for alignment across different sequence lengths. Additionally, our MambaVR block resolves feature artifacts and positional information loss during alignment. Experimental results on the COVC dataset show that MambaOVSR significantly outperforms the SOTA STVSR method by an average of 1.86 dB in terms of PSNR. Dataset and Code will be publicly released.

MambaOVSR: Multiscale Fusion with Global Motion Modeling for Chinese Opera Video Super-Resolution

Vision-Language-Action (VLA) models often struggle with generalization to real-world scenarios due to the mismatch between observation and action spaces. While training data comes from diverse camera perspectives, the models predict end-effector poses in the robot base coordinate system, leading to inconsistencies. To address this issue, we propose an Observation-Centric VLA (OC-VLA) framework, which directly grounds action predictions in the camera's observation space. By using the camera's extrinsic matrix to transform end-effector poses from the robot frame to the camera frame, our approach unifies prediction targets across different viewpoints. This simple, plug-and-play method ensures consistent alignment between perception and action, improving model robustness to camera viewpoint variations. Our method offers a straightforward solution that can be easily integrated into existing VLA models without significant architectural changes. Extensive experiments on both simulated and real-world robotic manipulation tasks demonstrate that OC-VLA achieves better convergence, improves task success rates, and enhances generalization across camera viewpoints. The code will be publicly available.

Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy

In recent years, with the rapid development of large language models (LLMs), LLM-based agents have achieved remarkable progress across a wide range of tasks. However, reasoning inconsistencies in LLMs still significantly limit the performance of agents in complex decision-making scenarios. Cognitive science research suggests that individuals can benefit from observing others' explicit thinking processes to improve their strategy-making. Inspired by this mechanism, we propose Reference-guided Reasoning with meta-cognition (RefRea), a novel approach that enhances decision-making by introducing a reference language model to guide and calibrate the reasoning model's actions. RefRea enhances reasoning accuracy and stability by integrating a reference model and a meta-cognition module. The reference model relies solely on validated meta-cognition for consistent guidance, while the reasoning model interacts with the environment using both validated and exploratory meta-cognition. Guidance is provided by comparing the action similarity between the reference and reasoning models. This process is supported by the meta-cognition module, which generates summary knowledge by reflecting on action history and environmental feedback, leading to more adaptive and reliable behavior. We evaluate our algorithm in the text-based reasoning environment ScienceWorld. Experimental results demonstrate that RefRea outperforms state-of-the-art methods. Comprehensive ablation studies further highlight the effectiveness of both the reference model and the meta-cognition module.

RefRea: Reference-Guided Reasoning with Meta-Cognition for Accurate Language Model Agents

Achieving zero-shot adversarial robustness without sacrificing generalization remains challenging for foundation models such as CLIP, especially under large adversarial perturbations. Through empirical analyses, we identify three critical yet overlooked issues: (1) Logit margins exhibit a stable offset between small and large adversarial perturbations, suggesting that explicitly adjusting margins could improve robustness against unseen large perturbations. (2) A significant negative correlation exists between logit margin and inter-class semantic similarity, indicating that semantic structures are insufficiently leveraged by existing methods. (3) Existing methods for adjusting text embeddings disrupt the intrinsic semantic consistency established by pre-trained models, undermining generalization capability. Motivated by these findings, we propose a novel Text-Image Mutual Awareness (TIMA) framework, including a Text-Aware Image (TAI) tuning module with an Adaptive Semantic-Aware Margin (ASAM) to explicitly calibrate logit margins, and an Image-Aware Text (IAT) tuning module with Semantic Consistent Minimum Hyperspherical Energy (SC-MHE) to preserve semantic consistency. Comprehensive experiments validate that TIMA significantly outperforms existing approaches by effectively addressing the identified limitations.

TIMA: Text-Image Mutual Awareness for Balancing Zero-Shot Adversarial Robustness and Generalization Ability

Continual forgetting task aims to continuously remove multiple target knowledge subsets from pre-trained models while maintaining the integrity of remaining knowledge. Existing methods suffer from both incomplete forgetting of target knowledge and unintended forgetting of indistinguishable remaining knowledge. To address these challenges, we propose the forgetting knowledge localization and isolation for continual forgetting in pre-trained vision models which precisely forgets target knowledge while reducing over-forgetting of remaining knowledge. To achieve precise forgetting, we first propose the forgetting knowledge layer localization to explore layers in the model which are more related to forgetting knowledge. Then, we design the forgetting knowledge parameter isolation to isolate the parameters sensitive to forgetting knowledge in these selected layers, mitigating over-forgetting of remaining knowledge. Finally, we fine-tune these isolated parameters and freeze the remaining parameters to achieve efficient forgetting while maintaining high performance on retained datasets. Extensive experimental results demonstrate that our method achieves superior performance over state-of-the-art methods across multiple continual forgetting tasks. We will release the source codes and pre-trained models.

Forgetting Knowledge Localization and Isolation for Continual Forgetting of Pre-trained Vision Models

Mixture of Experts (MoE) LLMs face significant obstacles due to their massive parameter scale, which imposes memory, storage, and deployment challenges. Although recent expert merging methods aim to achieve greater efficiency by consolidating several experts, they are fundamentally hindered by parameter conflicts arising from expert specialization. In this paper, we present Sub-MoE, a novel MoE compression framework via Subspace Expert Merging. Our key insight is to perform joint Singular Value Decomposition (SVD) on concatenated expert weights, reducing conflicting parameters by extracting shared $U$-matrices while enabling effective merging of the expert-specific $V$ components. Specifically, Sub-MoE consists of two innovative stages: (1) Adaptive Expert Clustering, which groups functionally coherent experts via K-means clustering based on cosine similarity of expert outputs; and (2) Subspace Expert Merging, which first performs Experts Union Decomposition to derive the shared $U$-matrix across experts in the same group, then applies frequency-based merging for individual $V$-matrices, and completes expert reconstruction using the merged $V$-matrix. In this way, we align and fuse experts in a shared subspace. Additionally, the framework can be extended with intra-expert compression for further inference optimization. Extensive experiments on Mixtral, DeepSeek, and Qwen-1.5/3 MoE LLMs demonstrate that our Sub-MoE significantly outperforms existing expert pruning and merging methods. Notably, our Sub-MoE maintains 96\%/86\% of original performance with 25\%/50\% expert reduction on Mixtral-8×7B in zero-shot benchmarks. Code is available in the supplementary materials.

Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging

Real-world dark images commonly exhibit not only low visibility and contrast but also complex noise and blur, posing significant restoration challenges. Existing methods often rely on paired data or fail to model dynamic illumination and blur characteristics, leading to poor generalization. To tackle this, we propose a generative framework based on visual autoregressive (VAR) modeling, guided by perceptual priors from the vision-language model (VLM). Specifically, to supply informative conditioning cues for VAR models, we deploy an adaptive curve estimation scheme to modulate the diverse illumination based on VLM-derived visibility scores.
In addition, we integrate dynamic and spatial-frequency-aware Rotary Positional Encodings (SF-RoPE) into VAR to enhance its ability to model structures degraded by blur. Furthermore, we propose a recursive phase-domain modulation strategy that mitigates blur-induced artifacts in the phase domain via bounded iterative refinement guided by VLM-assessed blur scores. Our framework is fully unsupervised and achieves state-of-the-art performance on benchmark datasets and downstream detection tasks in dark conditions. Code will be released upon acceptance.

Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation

Multi-modal entity alignment aims to identify equivalent entities between two multi-modal Knowledge graphs by integrating multi-modal data, such as images and text, to enrich the semantic representations of entities. However, existing methods often overlook the structural contextual information within each modality, making them vulnerable to interference from shallow features. To address these challenges, we propose MyGram, a modality-aware graph transformer with global distribution for multi-modal entity alignment. Specifically, we develop a modality diffusion learning module to capture deep structural contextual information within modalities and enable fine-grained multi-modal fusion. In addition, we introduce a Gram Loss that acts as a regularization constraint by minimizing the volume of a 4-dimensional parallelotope formed by multi-modal features, thereby achieving global distribution consistency across modalities. We conduct experiments on five public datasets. Results show that MyGram outperforms baseline models, achieving a 4.05% improvement in Hits@1 on FBDB15K, 10.25% improvement on FBYG15K, and a 3.75% improvement on DBP15K.

Content not yet available

Next from AAAI 2026

Multimodal Graph Representation Learning with Dynamic Information Pathways

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES