Singapore

The quadratic complexity of Multimodal Large Language Models (MLLMs) with respect to context length poses significant computational and memory challenges, hindering their real-world deployment.
In the paper, we devise a ``\textbf{\textit{filter-correlate-compress}}&#39;&#39; framework to accelerate the MLLM by systematically optimizing multimodal context length during prefilling. The framework first implements \textbf{\textit{FiCoCo-V}}, a training-free method operating within the vision encoder.
It employs a redundancy-based token discard mechanism that uses a novel integrated metric to accurately \textit{filter} out redundant visual tokens.
To mitigate information loss, the framework introduces a correlation-based information recycling mechanism that allows preserved tokens to selectively recycle information from \textit{correlate}d discarded tokens with a self-preserving \textit{compress}ion, thereby preventing the dilution of their own core content. The framework&#39;s \textbf{\textit{FiCoCo-L}} variant further leverages task-aware textual priors to perform token reduction directly within the LLM decoder. Extensive experiments demonstrate that the \textit{FiCoCo} series effectively accelerates a range of MLLMs, achieves up to \textbf{14.7×} FLOPs reduction with \textbf{93.6\%} performance retention. Our methods consistently outperform state-of-the-art training-free approaches, showcasing effectiveness and generalizability across model architectures, sizes, and tasks without requiring retraining. \textit{Code is available in supplementary materials.}

AAAI 2026

Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration

token compress

mllm

efficient

The quadratic complexity of Multimodal Large Language Models (MLLMs) with respect to context length poses significant computational and memory challenges, hindering their real-world deployment.
In the paper, we devise a ``\textbf{\textit{filter-correlate-compress}}'' framework to accelerate the MLLM by systematically optimizing multimodal context length during prefilling. The framework first implements \textbf{\textit{FiCoCo-V}}, a training-free method operating within the vision encoder.
It employs a redundancy-based token discard mechanism that uses a novel integrated metric to accurately \textit{filter} out redundant visual tokens.
To mitigate information loss, the framework introduces a correlation-based information recycling mechanism that allows preserved tokens to selectively recycle information from \textit{correlate}d discarded tokens with a self-preserving \textit{compress}ion, thereby preventing the dilution of their own core content. The framework's \textbf{\textit{FiCoCo-L}} variant further leverages task-aware textual priors to perform token reduction directly within the LLM decoder. Extensive experiments demonstrate that the \textit{FiCoCo} series effectively accelerates a range of MLLMs, achieves up to \textbf{14.7×} FLOPs reduction with \textbf{93.6\%} performance retention. Our methods consistently outperform state-of-the-art training-free approaches, showcasing effectiveness and generalizability across model architectures, sizes, and tasks without requiring retraining. \textit{Code is available in supplementary materials.}

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Despite its success in enriching LLMs with external knowledge, RAG remains plagued by faithfulness hallucinations, where generated text contradicts the retrieved source information. Previous research on faithfulness hallucination in LLMs is frequently hindered by prohibitive manual annotation costs and a dependency on static datasets, which caps their performance and adaptability. Furthermore, these models lack a clear training mechanism to explicitly promote contextual focus. In this work, we propose a novel iterative self-evolution framework to enhance model faithfulness. This framework autonomously generates high-quality data and leverages it for the continuous self-optimization of the model, leading to significant improvements in faithfulness. Our experimental analysis reveals that improving model faithfulness encourages a closer alignment of the attention distribution with the given context. Based on this finding, we design an attention-based loss function to further promote this process. Experimental results show that our model achieves state-of-the-art faithfulness on a range of context-based question-answering datasets, marking a significant advancement over previous approaches.

REFO: Reinforced Evolutionary Faithfulness Optimization for Large Language Models

Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, DoRA, and HiRA, enable lightweight adaptation of large pre-trained models via low-rank updates. However, existing PEFT approaches apply static, input-agnostic updates to all tokens, disregarding the varying importance and difficulty of different inputs. This uniform treatment can lead to overfitting on trivial content or under-adaptation on more informative regions, especially in autoregressive settings with distinct prefill and decoding dynamics. In this paper, we propose GateRA, a unified framework that introduces token-aware modulation to dynamically adjust the strength of PEFT updates. By incorporating adaptive gating into standard PEFT branches, GateRA enables selective, token-level adaptation—preserving pre-trained knowledge for well-modeled inputs while focusing capacity on challenging cases. Empirical visualizations reveal phase-sensitive behaviors, where GateRA automatically suppresses updates for redundant prefill tokens while emphasizing adaptation during decoding. To promote confident and efficient modulation, we further introduce an entropy-based regularization that encourages near-binary gating decisions. This regularization prevents diffuse update patterns and leads to interpretable, sparse adaptation without hard thresholding. Finally, we present a theoretical analysis showing that GateRA induces a soft gradient-masking effect over the PEFT path, enabling continuous and differentiable control over adaptation. Experiments on multiple commonsense reasoning benchmarks demonstrate that GateRA consistently outperforms or matches prior PEFT methods.

GateRA: Token-aware Modulation for Parameter-Efficient Fine-tuning

Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (i.e., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.05\% relative improvement across three simulation suites, with up to 19.2\% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL’s generalization under unseen and noisy object states.

Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning

Protein-Ligand Affinity (PLA) prediction quantifies interaction strength to guide rational drug design. Existing approaches typically analyze protein-ligand interaction at a single granularity and overlook their tightly coupled relationships in both structure and functionality, consequently yielding suboptimal representations, leading to significant performance drops in real-world scenarios. To address this problem, we propose PLA-MGRA, a minimalist and effective PLA prediction framework. Specifically, PLA-MGRA captures both fine-grained atomic details and coarse-grained functional semantics within the 3D structure of protein–ligand complexes, through multi-granularity learning. To further parse the coupled protein–ligand relationships, we design relation-aware learning for enhancing binding nature of representations. Extensive experiments demonstrate our method achieves state-of-the-art performance on multiple protein–ligand affinity prediction benchmarks, while also offering generalizability and interpretability.

PLA-MGRA: Multi-Granularity and Relation-Aware Learning for Efficient and Generalizable Protein-Ligand Binding Affinity Prediction

Continual Generalized Category Discovery (C-GCD) requires identifying novel classes from unlabeled data while retaining knowledge of known classes over time. Existing methods typically update classifier weights dynamically, resulting in forgetting and inconsistent feature alignment. We propose GOAL, a unified framework that introduces a fixed Equiangular Tight Frame (ETF) classifier to impose a consistent geometric structure throughout learning. GOAL conducts supervised alignment for labeled samples and confidence-guided alignment for novel samples, enabling stable integration of new classes without disrupting old ones. Experiments on four benchmarks show that GOAL outperforms the prior method Happy, reducing forgetting by 16.1% and boosting novel class discovery by 3.2%, establishing a strong solution for long-horizon continual discovery.

GOAL: Geometrically Optimal Alignment for Continual Generalized Category Discovery

Robust signal enhancement under nonstationary and low SNR conditions remains challenging, as methods based on the short-time Fourier transform (STFT) with fixed resolution struggle to represent complex and time–frequency structures. While leveraging the fractional domain as an auxiliary view offers flexibility in modeling time-frequency structures, existing methods typically adopt fixed transform orders and overlook alignment between views, hindering effective integration of complementary representations and leaving frequency domain misalignment unresolved. Therefore, we propose FracFusion, a novel framework that integrates a learnable short time fractional Fourier Transform (STFrFT) module to generate dynamic auxiliary views, combined with two stage alignment-aware fusion modules: Pearson Channel Fusion for correlation-guided consistency and Efficient Align Fusion for fine-grained, frequency aligned interaction. Experiments on speech and EM datasets show that FracFusion consistently outperforms state-of-the-art baselines across diverse noise levels and signal types, demonstrating robust adaptability across domains.

Signal Enhancement via Multi-view Dynamic Representation and Alignment-aware Fusion

Diffusion models have become a leading class of generative models, especially conditional ones that support prompt-driven image synthesis. While recent research emphasizes the pivotal role of noise seeds in enhancing text-image alignment and generating human-preferred outputs, current approaches predominantly rely on random Gaussian noise or heuristic local adjustments, lacking a comprehensive global optimization framework. To bridge this gap, we propose Seed Optimization based on Evolution (SOE), a novel hybrid approach integrating a global search mechanism—an evolutionary algorithm coupled with multi-scale random sampling, guided by a dual-seed evaluation framework combining CLIP-based text-image alignment scores and ImageReward-based human-preference rewards—and a local refinement strategy that employs inversion techniques to inject conditional information into noise seeds. This local optimization leverages the diffusion inversion process to encode prompt semantics into noise.
Extensive experiments across various diffusion models validate the effectiveness and generalizability of SOE in optimizing noise seeds.

Dual-Seed Evolutionary Algorithm for Noise Optimization in Diffusion Models

Radiology report generation from longitudinal medical data is critical for assessing disease progression and automating diagnostic workflows. While recent methods incorporate longitudinal information, they primarily rely on multimodal feature fusion, with limited capacity for explicit disease evolution modeling and temporal reasoning. To address this, we propose MARE, an end-to-end framework that formulates longitudinal radiology report generation as a multimodal analogical reasoning task. Inspired by the Abduction–Mapping–Induction paradigm, MARE models latent relational structures underlying disease evolution by aligning lesion-level visual features across time and mapping them to the textual domain for temporally coherent and clinically meaningful report generation. To mitigate the spatial misalignment caused by patient positioning or imaging variation, we introduce an Adaptive Region Alignment (ARA) module for robust temporal correspondence. Additionally, we design Dual Evolution Consistency (DEC) losses to regularize analogical reasoning by enforcing temporal coherence in both visual and textual evolution paths. Extensive experiments on the Longitudinal-MIMIC dataset demonstrate that MARE significantly outperforms state-of-the-art baselines across both natural language generation and clinical effectiveness metrics, highlighting the value of structured analogical reasoning for disease evolution-aware report generation.
Code will be released upon publication.

MARE: Multimodal Analogical Reasoning for Disease Evolution-Aware Radiology Report Generation

Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands.
While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall efficiency.
In this work, we propose SlimInfer, an innovative framework that aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass.
Our key insight is an *information diffusion phenomenon*: As information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This diffusion process suggests that LLMs can maintain their semantic integrity when excessive tokens, even including these critical ones, are pruned in hidden states.
Motivated by this, SlimInfer introduces a dynamic fine-grained pruning mechanism that accurately removes redundant tokens of hidden state at intermediate layers. This layer-wise pruning naturally enables an asynchronous KV cache manager that prefetches required token blocks without complex predictors, reducing both memory usage and I/O costs.
Extensive experiments show that SlimInfer can achieve up to $\mathbf{2.53\times}$ time-to-first-token (TTFT) speedup and $\mathbf{1.88\times}$ end-to-end latency reduction for LLaMA3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench.

SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

Egocentric gaze prediction serves as a critical indicator for decoding human visual attention and cognitive processes, but its inherently limited field of view creates prediction challenges. Although exo-view data provides supplementary contextual information, it exhibits significant spatial and semantic gaps. Existing methods focus solely on isolated feature encoding in single-view paradigms, neglecting cross-view gaze correlations. To make up for this gap, we make the first exploration of cross-view gaze relationship for egocentric gaze prediction, and propose Ego-PMOVE, a novel Prompt-aware Mixture of View Experts network. Unlike prior cross-view studies that forcibly align cross-view features thereby introducing inference noise, we leverage the popular Mixture-of-Experts (MoE) and a set of flexible prompts to disentangle features from different views into three parallel experts: a view-shared expert directly modeling common semantic relationships, a view-discrepancy expert adaptively adjusting the spatial position, scale and shifts based on different view-specific features, and an egocentric expert extracting independent features to compensate for the case of missing exocentric data. To balance these experts, we further design a soft router to dynamically weight them for mining useful information while suppressing noise. A view-query gaze decoder then generates view-specific gaze attention maps, jointly optimized by gaze-heamap and cross-view contrastive loss that regularize both shared and divergent features for accurate gaze prediction. Extensive experiments across the multi-view EgoMe dataset and single-view Ego4D and EGTEA Gaze++ datasets demonstrate the effectiveness and generalizability of our approach. Our code will be released soon.

Downloads

Next from AAAI 2026

REFO: Reinforced Evolutionary Faithfulness Optimization for Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

REFO: Reinforced Evolutionary Faithfulness Optimization for Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads