Singapore

The ability of Large Language Models (LLMs) to use ex
ternal tools unlocks powerful real-world interactions, mak
ing rigorous evaluation essential. However, current bench
marks primarily report final accuracy, revealing what mod
els can do but obscuring the cognitive bottlenecks that define
their true capability boundaries. To move from simple per
formance scoring to a diagnostic tool, we introduce a frame
workgroundedinCognitive LoadTheory.Ourframeworkde
constructs task complexity into two quantifiable components:
Intrinsic Load, the inherent structural complexity of the solu
tion path, formalized with a novel Tool Interaction Graph; and
Extraneous Load, the difficulty arising from ambiguous task
presentation. To enable controlled experiments, we construct
ToolLoad-Bench, the first benchmark with parametrically ad
justable cognitive load. Our evaluation reveals distinct per
formance cliffs as cognitive load increases, allowing us to
precisely map each model’s capability boundary. We validate
that our framework’s predictions are highly calibrated with
empirical results, establishing a principled methodology for
understanding an agent’s limits and a practical foundation for
building more efficient systems.

AAAI 2026

Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents

tool learning

llm agent

evaluation

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R1, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R1, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.

Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent’s actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent’s interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on 16 leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness. While safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems.

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

With the rapid advancement and strong generalization capabilities of large language models (LLMs), 
they have been increasingly incorporated into the active learning pipelines as annotators to reduce annotation costs. 
However, considering the annotation quality, labels generated by LLMs often fall short of real-world applicability.
To address this, we propose a novel active learning framework, Mixture of LLMs in the Loop Active Learning, 
replacing human annotators with labels generated through a Mixture-of-LLMs-based annotation model, aimed at enhancing LLM-based annotation robustness by aggregating the strengths of multiple LLMs. 
To further mitigate the impact of the noisy labels, 
we introduce annotation discrepancy and negative learning to identify the unreliable annotations and enhance learning affectiveness.
Extensive experiments demonstrate that our framework achieves performance comparable to human annotation and consistently outperforms single-LLM baselines and other LLM-ensemble-based approaches. 
Moreover, our framework is built on lightweight LLMs, enabling it to operate fully on local machines in real-world applications.

Next Generation Active Learning: Mixture of LLMs in the Loop

Open-set object detection (OSOD) aims to recognize known object categories while localizing previously unseen instances. However, real-world scenarios often involve co-occurring domain shifts and novel object categories. Existing OSOD methods typically overlook domain shifts, relying on source-trained representations that entangle domain-specific style with semantic content, thereby hindering generalization to both unseen domains and novel categories. To address this challenge, we propose a unified framework, termed DecOmpose and ATtribute (DOAT), which disentangles domain-specific style from semantic structure, thereby facilitating generalizable object detection. DOAT employs wavelet-based feature decomposition to separate style information from high-frequency structural details, thus enabling an explicit separation of domain and category shifts. To account for domain shift, the low-frequency components are perturbed within a style subspace to simulate diverse domain appearances. For unknown object discovery, the high-frequency components are utilized to estimate objectness scores via an attribution mechanism that fuses wavelet energy with semantic distance to known-category prototypes. Extensive experiments on standard open-set benchmarks have demonstrated the superior generalization performance of DOAT.

Decompose and Attribute: Boosting Generalizable Open-Set Object Detection via Objectness Score

Hyperspectral image (HSI) denoising is a crucial step in enhancing the quality of HSIs. Noise modeling methods can fit noise distributions to generate synthetic HSIs to train denoising networks. However, the noise in captured HSIs is usually complex and difficult to model accurately, which significantly limits the effectiveness of these approaches. In this paper, we propose a $\textbf{multi-stage noise-decoupling framework}$ that decomposes complex noise into explicitly modeled and implicitly modeled components. This decoupling reduces the complexity of noise and enhances the learnability of HSI denoising methods when applied to real paired data.
Specifically, for $\textbf{explicitly modeled noise}$, we utilize an existing noise model to generate paired data for pre-training a denoising network, equipping it with prior knowledge to handle the explicitly modeled noise effectively. For $\textbf{implicitly modeled noise}$, we introduce a high-frequency wavelet guided network. Leveraging the prior knowledge from the pre-trained module, this network adaptively extracts high-frequency features to target and remove the implicitly modeled noise from real paired HSIs. Furthermore, to effectively eliminate all noise components and mitigate error accumulation across stages, a $\textbf{multi-stage learning strategy}$, comprising separate pre-training and joint fine-tuning, is employed to optimize the entire framework. Extensive experiments on public and our captured datasets demonstrate that our proposed framework outperforms state-of-the-art methods, effectively handling complex real-world noise and significantly enhancing HSI quality.

Real Noise Decoupling for Hyperspectral Image Denoising

One global model in federated learning (FL) might not be sufficient to serve many clients with non-IID tasks and distributions. Despite recent advances in FL to train multiple global models for better personalization, they only provide limited model choices to clients, so local finetuning of multiple models is still indispensable. This paper proposes a novel ``FedMerge'' approach that can create a single personalized model per client by simply merging multiple global models with automatically optimized and customized weights. We formulate this problem as a joint optimization of global models and the merging weights per client. Unlike existing FL approaches, where the server broadcasts one or multiple global models to all clients, the server only needs to send a customized, merged model to each client. Moreover, instead of periodically interrupting the local training and re-initializing it to a global model, the merged model aligns better with each client's task and data distribution, smoothening the local-global gap between consecutive rounds caused by client drift. We evaluate FedMerge on different non-IID settings applied to various domains with diverse tasks and data types, in which FedMerge consistently outperforms existing FL approaches, including clustering-based and mixture-of-experts (MoE) based methods.

FedMerge: Federated Model Merging for Personalization

Large Language Models have shown growing ability to generate fluent and coherent texts that are highly similar to the writing style of humans. Current detectors for Machine-Generated Text (MGT) perform well when they are trained and tested in the same domain but generalize poorly to unseen domains, due to domain shift between data from different sources. In this work, we propose MGT-Prism , an MGT detection method from the perspective of the frequency domain for better domain generalization. Our key insight stems from analyzing text representations in the frequency domain, where we observe consistent spectral patterns across diverse domains, while significant discrepancies in magnitude emerge between MGT and human-written texts (HWTs). The observation initiates the design of a low frequency domain filtering module for filtering out the document-level features that are sensitive to domain shift, and a dynamic spectrum alignment strategy to extract the task-specific and domain-invariant features for improving the detector's performance in domain generalization. Extensive experiments demonstrate that MGT-Prism outperforms state‑of‑the‑art baselines by an average of 0.90% in accuracy and 0.92% in F1 score on 11 test datasets across three domain‑generalization scenarios.

MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment

Semi-supervised multi-label learning (SSMLL) aims to address the challenge of limited labeled data in multi-label learning (MLL) by leveraging unlabeled data to improve the model’s performance. While pseudo-labeling has become a dominant strategy in SSMLL, most existing methods assign **equal weights** to all pseudo-labels regardless of their quality, which can amplify the impact of noisy or uncertain predictions and degrade the overall performance. In this paper, we **theoretically** verify that the optimal weight for a pseudo-label should reflect its correctness likelihood. **Empirically**, we observe that on the same dataset, the correctness likelihood distribution of unlabeled data remains stable, even as the number of labeled training samples varies. Building on this insight, we propose **Di**stribution-**Ca**librated **P**seudo-labeling (**DiCaP**), a correctness-aware framework that estimates posterior precision to calibrate pseudo-label weights. We further introduce a dual-thresholding mechanism to separate confident and ambiguous regions: confident samples are pseudo-labeled and weighted accordingly, while ambiguous ones are explored by unsupervised contrastive learning. Experiments conducted on multiple benchmark datasets verify that our method achieves consistent improvements, surpassing state-of-the-art methods by up to **4.27%**. **Code is available in the supplementary material**.

DiCaP: Distribution-Calibrated Pseudo-labeling for Semi-Supervised Multi-Label Learning

With the advancement of vision-language models, image captioning has made significant progress, leading to the generation of more accurate and detailed descriptions.
Current image captioning primarily focuses on describing the apparent visual characteristics, which are easily observed by most humans, but less helpful in real-world scenarios.
When users seek a deeper understanding of visual content, they may be concerned with fine-grained categories, function properties, and other background knowledge, rather than merely appearances.
Additionally, as users' interests vary, there is a growing demand for customizable content generation.
To address these challenges, we propose the task of image narrative generation, which aims to produce knowledge-rich natural language responses for input images, customized to the user preference.
Furthermore, we propose $T^4$, an image narrative generation model progressing through cascade steps: Tailor, reTrieve, Think, and Tell.
Specifically, it takes the image and various types of prompts as input, and first refines or predicts potentially interesting queries that are tailored to the user expertise level.
Subsequently, the model enriches contextual knowledge through retrieval-augmentation and employs chain-of-thoughts to decompose the generation process step by step, thereby telling an accurate and logically coherent image narrative.
In addition, we construct the ImgNarr-23K dataset to support task training and evaluation.
Experimental results demonstrate that the proposed approach generates image narrative that better satisfy user requirements, and achieves state-of-the-art performance in knowledge-based VQA tasks without additional finetuning.
$T^4$ presents a promising solution for customized content generation in specialized domains.

Tell as You Want: Customizing Image Narrative with Knowledge and Thoughts

Recent advances in rotation-invariant (RI) learning for 3D point clouds typically replace raw coordinates with handcrafted RI features to ensure robustness under arbitrary rotations. However, these approaches often suffer from the loss of global pose information, making them incapable of distinguishing geometrically similar but spatially distinct structures. We identify that this limitation stems from the restricted receptive field in existing RI methods, leading to Wing–tip feature collapse, a failure to differentiate symmetric components (e.g., left and right airplane wings) due to indistinguishable local geometries. To overcome this challenge, we introduce the Shadow-informed Pose Feature (SiPF), which augments local RI descriptors with a globally consistent reference point (referred to as the “shadow”) derived from a learned shared rotation. This mechanism enables the model to preserve global pose awareness while maintaining rotation invariance. We further propose Rotation-invariant Attention Convolution (RIAttnConv), an attention-based operator that integrates SiPFs into the feature aggregation process, thereby enhancing the model’s capacity to distinguish structurally similar components. Additionally, we design a task-adaptive shadow locating module based on the Bingham distribution over unit quaternions, which dynamically learns the optimal global rotation for constructing consistent shadows. Extensive experiments on 3D classification and part segmentation benchmarks demonstrate that our approach substantially outperforms existing RI methods, particularly in tasks requiring fine-grained spatial discrimination under arbitrary rotations.

Downloads

Next from AAAI 2026

Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads