Singapore

Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user&#39;s prompt and spatial redundancy. To address this, we introduce D²Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D²Pruner has exceptional efficiency and fidelity. Applied to LLaVA-1.5-7B for general understanding tasks, it reduces FLOPs by 74.2\% while retaining 99.2\% of its original performance. Furthermore, in challenging localization benchmarks with InternVL-2.5-8B, it maintains 85.7\% performance at a 90\% token reduction rate, marking a significant advancement with up to 63. 53\% improvement over existing methods. The code will be released on GitHub.

AAAI 2026

D²Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning

efficient ml

green ai

multimodal learning

Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user's prompt and spatial redundancy. To address this, we introduce D²Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D²Pruner has exceptional efficiency and fidelity. Applied to LLaVA-1.5-7B for general understanding tasks, it reduces FLOPs by 74.2\% while retaining 99.2\% of its original performance. Furthermore, in challenging localization benchmarks with InternVL-2.5-8B, it maintains 85.7\% performance at a 90\% token reduction rate, marking a significant advancement with up to 63. 53\% improvement over existing methods. The code will be released on GitHub.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Graph augmentation is a cornerstone of effective graph contrastive learning, yet existing methods often rely on random or heuristically designed perturbations, which may distort latent semantics and impair representation quality. In this work, we argue that semantic consistency can be effectively approximated by low-frequency components in the spectral domain, offering a principled proxy for guiding augmentation. Based on this insight, we propose Frequency-Aware Graph Contrastive Learning (FA-GCL), a novel framework that explicitly preserves low-frequency signals while selectively perturbing high-frequency components. By aligning augmentation with frequency-aware decomposition, FA-GCL generates diverse yet semantically coherent views, mitigating semantic drift and enhancing representational discrimination. Extensive experiments across multiple benchmarks demonstrate that FA-GCL consistently outperforms state-of-the-art baselines with statistically significant gains, validating its effectiveness and robustness.

From Semantics to Spectrum: A New Lens on Graph Augmentation Strategy

While Graph Foundation Models (GFMs) have achieved notable progress across diverse tasks recently, their robustness under domain noise, structural perturbations, and even adversarial attacks remains largely underexplored. A core limitation lies in the inadequate modeling of hierarchical structural semantics, which are intrinsic priors and critical for generalization. In this work, we propose **SA²GFM**, a robust **GFM** framework that enhances the domain-adaptable representations through **S**tructure-**A**ware **S**emantic **A**ugmentation. First, to embed the hierarchical structural priors, we transform entropy-based encoding trees into structure-aware textual prompts for feature augmentation. The enriched inputs are processed by a novel self-supervised Information Bottleneck mechanism that distills the robust and transferable representations through structure-guided compression. To mitigate the negative transfer in cross-domain adaptation, we develop an expert adaptive routing mechanism that integrates a mixture-of-experts architecture with a null expert design. To enable efficient downstream adaptation, we propose a fine-tuning module that efficiently optimizes the hierarchical structures through the joint intra- and inter-community structure learning. Extensive experiments validate the superiority of **SA²GFM** over effectiveness and robustness against random noise and adversarial perturbations on node and graph classification compared with 9 state-of-the-art baselines.

SA²GFM: Enhancing Robust Graph Foundation Models with Structure-Aware Semantic Augmentation

To cultivate students' aesthetic development, teachers must objectively interpret and evaluate the artistic qualities and emotional resonance within their paintings—a process known as aesthetic perception. This evaluation process is labor-intensive and susceptible to biases due to variations among individual teachers. Advances in artificial intelligence (AI) motivate the use of AI-driven models to automate and enhance this aesthetic perception task. However, building effective AI-driven aesthetic perception models requires extensive datasets, which are typically labor-intensive and costly to gather. To address this, we propose a novel framework that selectively identifies the most challenging dimensions of aesthetic perception for expert annotation, using AI-generated pseudo-annotations to reduce cost and improve model performance. Our framework integrates a multi-agent active learning strategy to systematically annotate scores across multiple dimensions of aesthetic perception. Initially, we train an aesthetic perception model using a small, manually annotated dataset, establishing primary annotation capabilities. Then, this trained model generates pseudo-annotations for unlabeled data across various aesthetic dimensions (e.g., humor, happiness). To ensure annotation quality and relevance, a multi-agent system evaluates these pseudo-annotations, identifying dimensions requiring expert human input based on metrics such as model estimation confidence. Human experts provide targeted annotations selectively, refining the dataset and guiding an iterative improvement cycle. Through repeated refinement, the model progressively enhances both its predictive accuracy and its automated annotation proficiency. Our optimization approach dynamically balances accuracy, annotation relevance, and human effort. Extensive experiments conducted on two real-world datasets demonstrate the effectiveness of our framework.

Dimension-Aware Active Annotation for Aesthetic Perception via Multi-Agent Human–AI Collaboration

The rapid development of image manipulation technologies poses significant challenges to multimedia forensics, especially in accurate localization of manipulated regions. Existing methods often fail to fully explore the intrinsic discrepancies between manipulated and authentic regions, resulting in sub-optimal performance. To address this limitation, we propose the Focus Region Discrepancy Network (FRD-Net), a novel and efficient framework that significantly enhances manipulation localization by amplifying discrepancies at both macro- and micro-levels. Specifically, our proposed Iterative Clustering Module (ICM) groups features into two discriminative clusters and refines representations via backward propagation from cluster centers, improving the distinction between tampered and authentic regions at the macro level. Thereafter, our Differential Progressive Module (DPM) is constructed to capture fine-grained structural inconsistencies within local neighborhoods and integrate them into a Central Difference Convolution, increasing sensitivity to subtle manipulation details at the micro level. Finally, these complementary modules are seamlessly integrated into a compact architecture that achieves a favorable balance between accuracy and efficiency. Extensive experiments on multiple benchmarks demonstrate that FRD-Net consistently surpasses state-of-the-art methods in terms of manipulation localization performance while maintaining a lower computational cost.

Amplifying Discrepancies: Exploiting Macro and Micro Inconsistencies for Image Manipulation Localization

Large language model (LLM) services now answer billions of queries per day, and industry reports show that inference, not training, accounts for more than 90% of total power consumption. However, existing benchmarks focus on either training/fine-tuning or performance of inference and provide little support for power consumption measurement and analysis of inference. We introduce TokenPowerBench, the first lightweight and extensible benchmark designed for LLM-inference power consumption studies. The benchmark combines a declarative configuration interface covering model choice, prompt set, and inference engine, a measurement layer that captures GPU-, node-, and system-level power without specialized power meters, and a phase-aligned metrics pipeline that attributes energy to the prefill and decode stages of every request. These elements make it straightforward to explore the power consumed by an LLM inference run; furthermore, by varying batch size, context length, parallelism strategy and quantization, users can quickly assess how each setting affects joules per token and other energy-efficiency metrics. We evaluate TokenPowerBench on four of the most widely used model series (Llama, Falcon, Qwen, and Mistral). Our experiments cover from 1 billion parameters up to the frontier-scale Llama3-405B model. Furthermore, we release TokenPowerBench as open source to help users to measure power consumption, forecast operating expenses, and meet sustainability targets when deploying LLM services.

TokenPowerBench: Benchmarking the Power Consumption of LLM Inference

Decoding visual features from EEG signals is a central challenge in neuroscience, with cross-modal alignment as the dominant approach. We argue that the relationship between visual and brain modalities is fundamentally asymmetric, characterized by two critical gaps: a Fidelity Gap (stemming from EEG's inherent noise and signal degradation, vs. vision's high-fidelity features) and a Semantic Gap (arising from EEG's shallow conceptual representation, vs. vision's rich semantic depth). Previous methods often overlook this asymmetry, forcing alignment between the two modalities as if they were equal partners and thereby leading to poor generalization. To address this, we propose the adaptive teaching paradigm. This paradigm empowers the "teacher" modality (vision) to dynamically shrink and adjust its knowledge structure under task guidance, tailoring its semantically dense features to match the "student" modality (EEG)'s capacity. We implement this paradigm with the ShrinkAdapter, a simple yet effective module featuring a residual-free design and a bottleneck structure. Through extensive experiments, we validate the underlying rationale and effectiveness of our paradigm. Our method achieves a top-1 accuracy of 60.2\% on the zero-shot brain-to-image retrieval task, surpassing previous state-of-the-art methods by a margin of 9.3\%. Our work introduces a new perspective for asymmetric alignment: enabling the teacher to adapt is key to bridging the vision-brain gap. Source code in supplementary, public upon publication.

Shrinking the Teacher: An Adaptive Teaching Paradigm for Asymmetric EEG-Vision Alignment

High-scale image super-resolution (SR) has become increasingly important with the rapid growth of mobile devices and high-resolution displays. However, current SR methods primarily focus on lower scales and generalize poorly to high-scale scenarios due to severe information loss and complex real-world degradations. In this paper, we propose a novel Selective Diffusion Distillation (SDD) framework for real-world high-scale SR, which distills reliable knowledge from a low-scale diffusion teacher to a high-scale student. Specifically, considering severe information loss in high-scale inputs, directly distilling from low-scale models may result in feature misalignment. To address this, we introduce a Degradation-aware Metric Learning (DML) approach to align feature distributions across different degradation levels. In addition, since the diffusion-based teacher may hallucinate artifacts in ambiguous regions, blindly imitating these unreliable outputs can degrade the student’s fidelity. To tackle this, we propose a Region-aware Selective Distillation (RSD) strategy to filter out uncertain predictions and adaptively supervise only on reliable areas. To evaluate the effectiveness of our method, we introduce Real-UltraSR, a new real-world benchmark that contains diverse high-scale LR-HR pairs, including $\times$8, $\times$10, $\times$12, and $\times$14. Extensive experiments demonstrate that our SDD framework achieves state-of-the-art performance across multiple benchmarks.

Selective Diffusion Distillation for Real-World High-Scale Image Super-Resolution

Aerial Visual Object Search (AVOS) tasks in urban environments require Unmanned Aerial Vehicles (UAVs) to autonomously search for and identify target objects based on visual inputs without external guidance. Existing approaches struggle in complex urban environments due to redundant semantic processing, similar object ambiguity, and the exploration-exploitation dilemma. To advance research and support the AVOS task, we introduce CityAVOS, the first benchmark dataset for autonomous search of static urban objects. It features 2,420 tasks of varying difficulty across six object categories, designed to rigorously evaluate UAV search strategies. To solve the AVOS task, we also propose PRPSearcher (Perception-Reasoning-Planning Searcher), a novel agentic method powered by multi-modal large language models (MLLMs) that enables a UAV agent to think and reason like humans on visual cues when searching for objects. Specifically, PRPSearcher constructs three specialized maps: an object-centric dynamic semantic map enhancing spatial perception, a 3D cognitive map based on semantic "attraction" values for target reasoning, and a 3D uncertainty map for balanced exploration-exploitation search. Moreover, we propose a denoising mechanism to mitigate interference from similar objects and design an Inspiration Promote Thought prompting mechanism for adaptive action planning. Experimental results on CityAVOS demonstrate that PRPSearcher surpasses existing baselines in both success rate and search efficiency (on average: +37.69% SR, +28.96% SPL, -30.69% MSS, and -46.40% NE). Our work paves the way for future advances in embodied visual target search.

Towards Autonomous UAV Visual Object Search in City Space: Benchmark and Agentic Methodology

Backdoor attacks on deep neural networks (DNNs) have garnered significant attention, particularly in edge computing applications. Given the complexity and opacity of DNNs, defending against backdoor attacks remains a formidable challenge. To address this, we propose CL-Guard, a dual-network-based defense framework designed to effectively eliminate potential backdoors in models. First, it leverages an inter-layer backpropagation algorithm to quantify each neuron's contribution to model prediction. Next, it constructs a critical neuron set through a recursive hierarchical partitioning method and an adaptive search strategy, identifying neurons critical to model prediction while minimizing the inclusion of backdoor-related neurons. Then, we perform sparse training on the non-critical neuron set, effectively strengthening the weights of critical neurons while disrupting the association between trigger features and backdoor-related neurons. Finally, we design a dual-network architecture that incorporates a fine-grained gradient backpropagation mechanism and dynamic collaborative learning, enabling the model to retain its original accuracy while preventing backdoor reactivation. The experimental results indicate that CL-Guard achieves an average Security Effectiveness Index (SEI) of approximately 95.42\%, representing a 21.23\% improvement over the state-of-the-art FT-SAM method.

CL-Guard: Defending DNNs Against Backdoors via Fine-Grained Neuron Analysis and Collaborative Dual-Network Learning

RGB-T tracking is increasingly deployed in safety-critical applications such as autonomous driving, surveillance, and rescue robotics, where tracking reliability is essential under adverse conditions. Although the fusion of RGB and thermal infrared (TIR) modalities offers improved robustness in low-light and occluded scenes, recent findings show that RGB-T trackers remain highly susceptible to subtle input perturbations, human-imperceptible modifications that exploit cross-modal inconsistencies to mislead tracking outputs. In real-world scenarios, such perturbations can arise from sensor spoofing, infrared camouflage, or physical-world attacks, posing serious risks to operational safety.
To address this, we propose SFPT, a Semantic Feature Purification framework that enhances RGB-T tracking at the representation level. Rather than filtering corrupted inputs at the pixel level, SFPT introduces task-specific semantic anchors into the feature space to reinforce perturbation-invariant cues. These anchors are derived from descriptive language, interact with visual features to purify representations. To further suppress modality-specific interference, we design an Adaptive Perturbation-Guided Cross-Modal Fusion (APG-CMF) module, which leverages language and visual signals to estimate reliability and dynamically reweight cross-modal features, ensuring robust fusion under perturbation conditions.
Extensive experiments under diverse perturbation conditions validate the effectiveness of our approach. Notably, SFPT maintains performance comparable to clean settings even when subjected to perturbations of strength $\mathbf{\frac{1}{255}}$ and $\mathbf{\frac{4}{255}}$, demonstrating strong resilience to real-world interference.

Content not yet available

Next from AAAI 2026

From Semantics to Spectrum: A New Lens on Graph Augmentation Strategy

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES