Singapore

To cultivate students&#39; aesthetic development, teachers must objectively interpret and evaluate the artistic qualities and emotional resonance within their paintings—a process known as aesthetic perception. This evaluation process is labor-intensive and susceptible to biases due to variations among individual teachers. Advances in artificial intelligence (AI) motivate the use of AI-driven models to automate and enhance this aesthetic perception task. However, building effective AI-driven aesthetic perception models requires extensive datasets, which are typically labor-intensive and costly to gather. To address this, we propose a novel framework that selectively identifies the most challenging dimensions of aesthetic perception for expert annotation, using AI-generated pseudo-annotations to reduce cost and improve model performance. Our framework integrates a multi-agent active learning strategy to systematically annotate scores across multiple dimensions of aesthetic perception. Initially, we train an aesthetic perception model using a small, manually annotated dataset, establishing primary annotation capabilities. Then, this trained model generates pseudo-annotations for unlabeled data across various aesthetic dimensions (e.g., humor, happiness). To ensure annotation quality and relevance, a multi-agent system evaluates these pseudo-annotations, identifying dimensions requiring expert human input based on metrics such as model estimation confidence. Human experts provide targeted annotations selectively, refining the dataset and guiding an iterative improvement cycle. Through repeated refinement, the model progressively enhances both its predictive accuracy and its automated annotation proficiency. Our optimization approach dynamically balances accuracy, annotation relevance, and human effort. Extensive experiments conducted on two real-world datasets demonstrate the effectiveness of our framework.

AAAI 2026

Dimension-Aware Active Annotation for Aesthetic Perception via Multi-Agent Human–AI Collaboration

hai: human-ai collaboration / human-ai teaming

ml: reinforcement learning

dmkm: applications

To cultivate students' aesthetic development, teachers must objectively interpret and evaluate the artistic qualities and emotional resonance within their paintings—a process known as aesthetic perception. This evaluation process is labor-intensive and susceptible to biases due to variations among individual teachers. Advances in artificial intelligence (AI) motivate the use of AI-driven models to automate and enhance this aesthetic perception task. However, building effective AI-driven aesthetic perception models requires extensive datasets, which are typically labor-intensive and costly to gather. To address this, we propose a novel framework that selectively identifies the most challenging dimensions of aesthetic perception for expert annotation, using AI-generated pseudo-annotations to reduce cost and improve model performance. Our framework integrates a multi-agent active learning strategy to systematically annotate scores across multiple dimensions of aesthetic perception. Initially, we train an aesthetic perception model using a small, manually annotated dataset, establishing primary annotation capabilities. Then, this trained model generates pseudo-annotations for unlabeled data across various aesthetic dimensions (e.g., humor, happiness). To ensure annotation quality and relevance, a multi-agent system evaluates these pseudo-annotations, identifying dimensions requiring expert human input based on metrics such as model estimation confidence. Human experts provide targeted annotations selectively, refining the dataset and guiding an iterative improvement cycle. Through repeated refinement, the model progressively enhances both its predictive accuracy and its automated annotation proficiency. Our optimization approach dynamically balances accuracy, annotation relevance, and human effort. Extensive experiments conducted on two real-world datasets demonstrate the effectiveness of our framework.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The rapid development of image manipulation technologies poses significant challenges to multimedia forensics, especially in accurate localization of manipulated regions. Existing methods often fail to fully explore the intrinsic discrepancies between manipulated and authentic regions, resulting in sub-optimal performance. To address this limitation, we propose the Focus Region Discrepancy Network (FRD-Net), a novel and efficient framework that significantly enhances manipulation localization by amplifying discrepancies at both macro- and micro-levels. Specifically, our proposed Iterative Clustering Module (ICM) groups features into two discriminative clusters and refines representations via backward propagation from cluster centers, improving the distinction between tampered and authentic regions at the macro level. Thereafter, our Differential Progressive Module (DPM) is constructed to capture fine-grained structural inconsistencies within local neighborhoods and integrate them into a Central Difference Convolution, increasing sensitivity to subtle manipulation details at the micro level. Finally, these complementary modules are seamlessly integrated into a compact architecture that achieves a favorable balance between accuracy and efficiency. Extensive experiments on multiple benchmarks demonstrate that FRD-Net consistently surpasses state-of-the-art methods in terms of manipulation localization performance while maintaining a lower computational cost.

Amplifying Discrepancies: Exploiting Macro and Micro Inconsistencies for Image Manipulation Localization

Large language model (LLM) services now answer billions of queries per day, and industry reports show that inference, not training, accounts for more than 90% of total power consumption. However, existing benchmarks focus on either training/fine-tuning or performance of inference and provide little support for power consumption measurement and analysis of inference. We introduce TokenPowerBench, the first lightweight and extensible benchmark designed for LLM-inference power consumption studies. The benchmark combines a declarative configuration interface covering model choice, prompt set, and inference engine, a measurement layer that captures GPU-, node-, and system-level power without specialized power meters, and a phase-aligned metrics pipeline that attributes energy to the prefill and decode stages of every request. These elements make it straightforward to explore the power consumed by an LLM inference run; furthermore, by varying batch size, context length, parallelism strategy and quantization, users can quickly assess how each setting affects joules per token and other energy-efficiency metrics. We evaluate TokenPowerBench on four of the most widely used model series (Llama, Falcon, Qwen, and Mistral). Our experiments cover from 1 billion parameters up to the frontier-scale Llama3-405B model. Furthermore, we release TokenPowerBench as open source to help users to measure power consumption, forecast operating expenses, and meet sustainability targets when deploying LLM services.

TokenPowerBench: Benchmarking the Power Consumption of LLM Inference

Decoding visual features from EEG signals is a central challenge in neuroscience, with cross-modal alignment as the dominant approach. We argue that the relationship between visual and brain modalities is fundamentally asymmetric, characterized by two critical gaps: a Fidelity Gap (stemming from EEG's inherent noise and signal degradation, vs. vision's high-fidelity features) and a Semantic Gap (arising from EEG's shallow conceptual representation, vs. vision's rich semantic depth). Previous methods often overlook this asymmetry, forcing alignment between the two modalities as if they were equal partners and thereby leading to poor generalization. To address this, we propose the adaptive teaching paradigm. This paradigm empowers the "teacher" modality (vision) to dynamically shrink and adjust its knowledge structure under task guidance, tailoring its semantically dense features to match the "student" modality (EEG)'s capacity. We implement this paradigm with the ShrinkAdapter, a simple yet effective module featuring a residual-free design and a bottleneck structure. Through extensive experiments, we validate the underlying rationale and effectiveness of our paradigm. Our method achieves a top-1 accuracy of 60.2\% on the zero-shot brain-to-image retrieval task, surpassing previous state-of-the-art methods by a margin of 9.3\%. Our work introduces a new perspective for asymmetric alignment: enabling the teacher to adapt is key to bridging the vision-brain gap. Source code in supplementary, public upon publication.

Shrinking the Teacher: An Adaptive Teaching Paradigm for Asymmetric EEG-Vision Alignment

High-scale image super-resolution (SR) has become increasingly important with the rapid growth of mobile devices and high-resolution displays. However, current SR methods primarily focus on lower scales and generalize poorly to high-scale scenarios due to severe information loss and complex real-world degradations. In this paper, we propose a novel Selective Diffusion Distillation (SDD) framework for real-world high-scale SR, which distills reliable knowledge from a low-scale diffusion teacher to a high-scale student. Specifically, considering severe information loss in high-scale inputs, directly distilling from low-scale models may result in feature misalignment. To address this, we introduce a Degradation-aware Metric Learning (DML) approach to align feature distributions across different degradation levels. In addition, since the diffusion-based teacher may hallucinate artifacts in ambiguous regions, blindly imitating these unreliable outputs can degrade the student’s fidelity. To tackle this, we propose a Region-aware Selective Distillation (RSD) strategy to filter out uncertain predictions and adaptively supervise only on reliable areas. To evaluate the effectiveness of our method, we introduce Real-UltraSR, a new real-world benchmark that contains diverse high-scale LR-HR pairs, including $\times$8, $\times$10, $\times$12, and $\times$14. Extensive experiments demonstrate that our SDD framework achieves state-of-the-art performance across multiple benchmarks.

Selective Diffusion Distillation for Real-World High-Scale Image Super-Resolution

Aerial Visual Object Search (AVOS) tasks in urban environments require Unmanned Aerial Vehicles (UAVs) to autonomously search for and identify target objects based on visual inputs without external guidance. Existing approaches struggle in complex urban environments due to redundant semantic processing, similar object ambiguity, and the exploration-exploitation dilemma. To advance research and support the AVOS task, we introduce CityAVOS, the first benchmark dataset for autonomous search of static urban objects. It features 2,420 tasks of varying difficulty across six object categories, designed to rigorously evaluate UAV search strategies. To solve the AVOS task, we also propose PRPSearcher (Perception-Reasoning-Planning Searcher), a novel agentic method powered by multi-modal large language models (MLLMs) that enables a UAV agent to think and reason like humans on visual cues when searching for objects. Specifically, PRPSearcher constructs three specialized maps: an object-centric dynamic semantic map enhancing spatial perception, a 3D cognitive map based on semantic "attraction" values for target reasoning, and a 3D uncertainty map for balanced exploration-exploitation search. Moreover, we propose a denoising mechanism to mitigate interference from similar objects and design an Inspiration Promote Thought prompting mechanism for adaptive action planning. Experimental results on CityAVOS demonstrate that PRPSearcher surpasses existing baselines in both success rate and search efficiency (on average: +37.69% SR, +28.96% SPL, -30.69% MSS, and -46.40% NE). Our work paves the way for future advances in embodied visual target search.

Towards Autonomous UAV Visual Object Search in City Space: Benchmark and Agentic Methodology

Backdoor attacks on deep neural networks (DNNs) have garnered significant attention, particularly in edge computing applications. Given the complexity and opacity of DNNs, defending against backdoor attacks remains a formidable challenge. To address this, we propose CL-Guard, a dual-network-based defense framework designed to effectively eliminate potential backdoors in models. First, it leverages an inter-layer backpropagation algorithm to quantify each neuron's contribution to model prediction. Next, it constructs a critical neuron set through a recursive hierarchical partitioning method and an adaptive search strategy, identifying neurons critical to model prediction while minimizing the inclusion of backdoor-related neurons. Then, we perform sparse training on the non-critical neuron set, effectively strengthening the weights of critical neurons while disrupting the association between trigger features and backdoor-related neurons. Finally, we design a dual-network architecture that incorporates a fine-grained gradient backpropagation mechanism and dynamic collaborative learning, enabling the model to retain its original accuracy while preventing backdoor reactivation. The experimental results indicate that CL-Guard achieves an average Security Effectiveness Index (SEI) of approximately 95.42\%, representing a 21.23\% improvement over the state-of-the-art FT-SAM method.

CL-Guard: Defending DNNs Against Backdoors via Fine-Grained Neuron Analysis and Collaborative Dual-Network Learning

RGB-T tracking is increasingly deployed in safety-critical applications such as autonomous driving, surveillance, and rescue robotics, where tracking reliability is essential under adverse conditions. Although the fusion of RGB and thermal infrared (TIR) modalities offers improved robustness in low-light and occluded scenes, recent findings show that RGB-T trackers remain highly susceptible to subtle input perturbations, human-imperceptible modifications that exploit cross-modal inconsistencies to mislead tracking outputs. In real-world scenarios, such perturbations can arise from sensor spoofing, infrared camouflage, or physical-world attacks, posing serious risks to operational safety.
To address this, we propose SFPT, a Semantic Feature Purification framework that enhances RGB-T tracking at the representation level. Rather than filtering corrupted inputs at the pixel level, SFPT introduces task-specific semantic anchors into the feature space to reinforce perturbation-invariant cues. These anchors are derived from descriptive language, interact with visual features to purify representations. To further suppress modality-specific interference, we design an Adaptive Perturbation-Guided Cross-Modal Fusion (APG-CMF) module, which leverages language and visual signals to estimate reliability and dynamically reweight cross-modal features, ensuring robust fusion under perturbation conditions.
Extensive experiments under diverse perturbation conditions validate the effectiveness of our approach. Notably, SFPT maintains performance comparable to clean settings even when subjected to perturbations of strength $\mathbf{\frac{1}{255}}$ and $\mathbf{\frac{4}{255}}$, demonstrating strong resilience to real-world interference.

Semantic Feature Purification for Adversarially-Aware RGB-T Tracking

Multivariate Time-Series (MTS) analysis is crucial across various domains. Considering the spatial and temporal consistency of MTS, existing methods leverage graph structures with temporal augmentation and contrastive learning to achieve robust learning of spatial dependencies and temporal patterns. Given the inherent high-order correlations in MTS, hypergraphs present a promising approach. However, two key challenges limit their further development: 1) Feature-based perspectives capture limited spatial information, while structural perspectives encode richer spatial consistency and evolution dependency; 2) Various semantic patterns (e.g., synergy, inhibition) entangle in sensor correlations, leading to semantic ambiguity. The underlying reason is that conventional hypergraph structures cannot distinguish specific semantic roles within or across hyperedges. Thus, we propose Role Hypergraph Contrastive Learning for MTS analysis. Specifically, we introduce the concept of role to generalize hypergraphs to Role Hypergraphs, enabling precise modeling of sensor correlations by assigning each vertex-hyperedge pair with a semantic role. Building on this structure, we design a role hypergraph contrastive learning paradigm to comprehensively capture the spatial and temporal dependencies: From a structural perspective, role hypergraph structural contrasting captures spatial short-term consistency and long-term evolution; from a feature perspective, alignment of complementary role information ensures sensor-level temporal consistency. Experiments on classification and forecasting tasks demonstrate the effectiveness and interpretability of our method.

Role Hypergraph Contrastive Learning for Multivariate Time-Series Analysis

Message passing is the core operation in graph neural networks, where each node updates its embeddings by aggregating information from its neighbors. However, in deep architectures, this process often leads to diminished expressiveness. A popular solution is the use of residual connections, where the input from the current (or initial) layer is added to the aggregated neighbor information to preserve embeddings across layers. Following a recent line of research, we investigate an adaptive residual scheme in which different nodes have varying residual strengths. We prove that this approach prevents oversmoothing; particularly, we show that the Dirichlet energy of the embeddings remains bounded away from zero. This is the first theoretical guarantee not only for the adaptive setting, but also for static residual connections (where residual strengths are shared across nodes) with activation functions. Furthermore, based on an extensive set of experiments, this adaptive approach is shown to outperform the standard and state-of-the-art message passing mechanisms, especially on heterophilic graphs. To improve the time complexity of our approach, we introduce a variant in which residual strengths are not learned but instead set heuristically, a choice that performs as well as the learnable version.

Adaptive Initial Residual Connections for GNNs with Theoretical Guarantees

Vision-Language Models (VLMs) have made significant progress in quality assessment tasks. However, a fundamental paradox arises when applying them to Point Cloud Quality Assessment (PCQA). Existing VLMs, designed for image-text pairs, are inherently incompatible with 3D point cloud data due to the modality gap. While some PCQA research attempts to adapt point clouds to VLMs by projecting them directly onto 2D planes, this approach inevitably sacrifices crucial spatial structure information essential for accurate quality assessment. Conversely, directly integrating a dedicated 3D branch into a VLM-based PCQA framework introduces feature space misalignment and an influx of quality-insensitive information. To bridge these fundamental conflicts hindering the adaptation of VLMs to the PCQA domain, we propose the **PMP-PCQA** framework, which leverages the inherent mapping relationship between points and pixels to seamlessly apply VLMs in PCQA. Our approach introduces three key innovations: a **Spatial Awareness Enhancer(SAE)** module that enriches the image features with spatial coordinate clues to reinforce geometric awareness in 2D visual representations; a **Fine-to-coarse Consistency Alignment(FCA)* module that bridges the gap between 2D and 3D modalities by leveraging point-pixel correspondences to construct bridging features; a **Text-Guided Adaptive Miner(TAM)** module that dynamically suppresses quality-insensitive features to mine discriminative visual clues for PCQA. Extensive evaluations demonstrate that PMP-PCQA consistently outperforms state-of-the-art methods across multiple benchmarks.

Downloads

Next from AAAI 2026

Amplifying Discrepancies: Exploiting Macro and Micro Inconsistencies for Image Manipulation Localization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES