Singapore

The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining mainstream prominent LLM benchmarks using results from diverse models. We first propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities. Based on PSN-IRT, we conduct extensive analysis on 11 LLM benchmarks comprising 41,871 items, revealing significant and varied shortcomings in their measurement quality. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.

AAAI 2026

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

and evaluation of nlp models

nlp: interpretability

ml: evaluation and analysis

analysis

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Web agents, like OpenAI's Operator and Google's Project Mariner, are powerful agentic systems pushing the boundaries of Large Language Models (LLM). They can autonomously interact with the internet at the user's behest, such as navigating websites, filling search masks, and comparing price lists. Though web agent research is thriving, induced sustainability issues remain largely unexplored. To highlight the urgency of this issue, we provide an initial exploration of the energy and CO₂ cost associated with web agents from both a theoretical —via estimation— and an empirical perspective —by benchmarking. Our results show how different philosophies in web agent creation can severely impact the associated expended energy, and that more energy consumed does not necessarily equate to better results. We highlight a lack of transparency regarding disclosing model parameters and processes used for some web agents as a limiting factor when estimating energy consumption. Our work contributes towards a change in thinking of how we evaluate web agents, advocating for dedicated metrics measuring energy consumption in benchmarks.

Promoting Sustainable Web Agents: Benchmarking and Estimating Energy Consumption Through Empirical and Theoretical Analysis

Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided editing methods rely on pretrained diffusion models by involving forward-backward diffusion processes. However, these methods often struggle to maintain the music content. Additionally, text instructions alone usually fail to accurately describe the desired music. In this paper, we propose two music editing methods that enhance the consistency between the original and edited music by leveraging score distillation. The first method, _SteerMusic_, is a coarse-grained zero-shot editing approach using delta denoising score. The second method, _SteerMusic+_, enables fine-grained personalized music editing by manipulating a concept token that represents a user-defined musical style. _SteerMusic+_ allows for the editing of music into any user-defined musical styles that cannot be achieved by the text instructions alone. Experimental results show that our methods outperform existing approaches in preserving both music content consistency and editing fidelity. User studies further validate that our methods achieve superior music editing quality. Demonstrations and implementation codes are available in our supplementary materials.

SteerMusic: Enhanced Musical Consistency for Zero-shot Text-Guided and Personalized Music Editing

Recent advances in medical multi-modal models focus on specialized image analysis like dermatology, pathology, or radiology. However, they do not fully capture the complexity of real-world clinical diagnostics, which involve heterogeneous inputs and require ongoing contextual understanding during patient-physician interactions.
To bridge this gap, we introduce PulseMind, a new family of multi-modal diagnostic models that integrates a systematically curated dataset, a comprehensive evaluation benchmark, and a tailored training framework. Specifically, we first construct a diagnostic dataset, MediScope, which comprises 98,000 real-world multi-turn consultations and 601,500 medical images, spanning over 10 major clinical departments and more than 200 sub-specialties. Then, to better reflect the requirements of real-world clinical diagnosis, we develop the PulseMind Benchmark, a multi-turn diagnostic consultation benchmark with a four-dimensional evaluation protocol comprising proactiveness, accuracy, usefulness, and language quality. Finally, we design a training framework tailored for multi-modal clinical diagnostics, centered around a core component named Comparison-based Reinforcement Policy Optimization (CRPO). Compared to absolute score rewards, CRPO uses relative preference signals from multi-dimensional comparisons to provide stable and human-aligned training guidance. Extensive experiments demonstrate that PulseMind achieves competitive performance on both the diagnostic consultation benchmark and public medical benchmarks.

PulseMind: A Multi-Modal Medical Model for Real-World Clinical Diagnosis

Though deep neural models adopted to realize the perception of autonomous driving have proven vulnerable to adversarial examples, known attacks often leverage 2D patches and target mostly monocular perception. Therefore, the effectiveness of Physical Adversarial Examples (PAEs) on stereo-based binocular depth estimation remains largely unexplored. To this end, we propose the first texture-enabled physical adversarial attack against stereo matching models in the context of autonomous driving. Our method employs a 3D PAE with global camouflage texture rather than a local 2D patch-based one, ensuring both visual consistency and attack effectiveness across different viewpoints of stereo cameras. To cope with the disparity effect of these cameras, we also propose a new 3D stereo matching rendering module that allows the PAE to be aligned with real-world positions and headings in binocular vision. We further propose a novel merging attack that seamlessly blends the target into the environment through fine-grained PAE optimization. It has significantly enhanced stealth and lethality upon existing hiding attacks that fail to get seamlessly merged into the background. Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.

Cheating Stereo Matching in Full-Scale: Physical Adversarial Attack Against Binocular Depth Estimation in Autonomous Driving

In this paper, we investigate the application of heuristics based on Graph Neural Networks (GNNs) to lifted numeric
planning problems, an area that has been relatively unexplored. Building upon the GNN approach for learning general
policies proposed by Staahlberg et al., we extend the architecture to make it sensitive to the numeric components inherent in the planning problems we address. We achieve this by observing that, although the state space of a numeric planning problem is infinite, the finite subgoal structure of the problem can be incorporated into the architecture, enabling the construction of a finite structure. Instead of learning general policies, we train our models to serve as heuristics within a best-first search algorithm. We explore various configurations of this architecture and demonstrate that the resulting heuristics are highly informative and, in certain domains, offer a better trade-off between guidance and computational cost compared to state-of-the-art heuristics.

Learning Heuristic Functions with Graph Neural Networks for Numeric Planning

Theory of Mind (ToM) refers to the ability to infer others' mental states, which is an essential capability for embodied AI agents to effectively collaborate and interact with humans. While improving Large Language Models' ability to reason about characters' mental states in text-based stories/dialogues has been extensively studied, enhancing Multimodal Large Language Models' ToM capabilities, particularly in egocentric video from an embodied perspective, remains unexplored. In this paper, we propose a contrastive Reinforcement Learning (RL) paradigm that explicitly encourages models to leverage temporal and causal evolutionary patterns in user action sequences to infer user's mental states (goals, beliefs, and potential next actions). Evaluation results on in-domain and out-of-domain demonstrate that our method achieves performance improvements of (+30.00\%, +2.00\%) and (+5.83\%, +5.00\%) compared to the backbone model and vanilla Group Relative Preference Optimization (GRPO) model, respectively. Additionally, we compare the performance of two post-training paradigms (Supervise Fine-Tuning and RL) and systematically analyze the reasoning trajectories across the base model, vanilla GRPO model, and our proposed method.

Reality vs Counterfactual: Multi-World Contrastive Reinforcement Learning for Enhancing MLLM’s Theory of Mind in Egocentric Videos

Knockout tournaments are a widely used competition format in sports, elections, and decision-making processes. In such tournaments, players compete in successive rounds, with losers eliminated and winners advancing until a single champion remains. Given a tournament digraph $D$, which encodes the outcomes of all possible matches, and a designated player $v^* \in V(D)$, the Tournament Fixing problem (TFP) asks whether the tournament can be scheduled in a way that guarantees $v^\*$ emerges as the winner. TFP is known to be NP-hard in general (AAAI'14), but is _fixed-parameter tractable_ (FPT) when parameterized by structural measures such as the feedback arc set (fas) or feedback vertex set (fvs) number of the tournament digraph (AAAI'17; IJCAI'18; AAAI'23). In this paper, we introduce and study two new structural parameters: the number of players who can defeat $v^\*$ (i.e., the in-degree of $v^\*$, denoted $d^+$) and the number of players that $v^\*$ can defeat (i.e., the out-degree of $v^\*$, denoted $d^-$). These parameters are motivated by the observation that when either the in-degree or out-degree is zero, the problem becomes trivial. This leads to a natural question: can TFP be efficiently solved when $d^+$ or $d^-$ is small? We answer this question affirmatively by showing that TFP is FPT when parameterized by either the in-degree or out-degree of $v^*$. Our algorithm for the in-degree parameterization is particularly involved and technically intricate. Notably, the in-degree $d^+$ can remain small even when other structural parameters such as fas or fvs are large. Hence, our results offer a new perspective and significantly broaden the parameterized algorithmic understanding of the Tournament Fixing problem.

How Hard Is It to Rig a Tournament When Few Players Can Beat or Be Beaten by the Favorite?

Determining and verifying product provenance remains a critical challenge in global supply chains, particularly as geopolitical conflicts and shifting borders create new incentives for misrepresentation of commodities, such as hiding the origin of illegally harvested timber or stolen agricultural products. Stable Isotope Ratio Analysis (SIRA), combined with Gaussian process regression-based isoscapes, has emerged as a powerful tool for geographic origin verification. While these models are now actively deployed in operational settings supporting regulators, certification bodies, and companies, they remain constrained by data scarcity and suboptimal dataset selection. In this work, we introduce a novel deployed data valuation framework designed to enhance the selection and utilization of training data for machine learning models applied in SIRA. By quantifying the marginal utility of individual samples using Shapley values, our method guides strategic, cost-effective, and robust sampling campaigns within active monitoring programs. By prioritizing high-informative samples, our approach improves model robustness and predictive accuracy across diverse datasets and geographies. Our framework has been implemented and validated in a live provenance verification system currently used by enforcement agencies, demonstrating tangible, real-world impact. Through extensive experiments and deployment in a live provenance verification system, we show that this system significantly enhances provenance verification, mitigates fraudulent trade practices, and strengthens regulatory enforcement of global supply chains.

Optimizing Product Provenance Verification Using Data Valuation Methods

In emerging clinical applications such as ultrasound-based burn assessment, the lack of domain-specific data presents a significant challenge for developing robust AI systems. Vision-language models (VLMs) have shown strong performance in general computer vision tasks, yet their application to medical imaging remains limited, particularly due to insufficient reasoning capabilities and the scarcity of high-quality training data. We introduce AURA (Automated Unified Reasoning for Burn Assessment), a multi-modal approach that integrates pre-trained VLMs with symbolic first-order logic (FOL) reasoning to improve diagnostic accuracy and interpretability in this data-limited setting. For this study, we collected real-patient data over a one-year period at a U.S. burn center, performing all experiments in a real clinical setting to ensure practical relevance. The dataset includes both conventional B-Mode ultrasound and Tissue Doppler Imaging (TDI), with TDI introduced here for the first time in burn assessment, underscoring the emerging nature of this work. Beyond burn severity classification, we assess the system’s ability to produce expert-level surgical insight directly from imaging data. On the retrospective dataset, it achieves up to 93% accuracy in surgical classification and 87% in fine-grained burn depth prediction, comparable to expert-informed predictions and substantially exceeding the 70% accuracy of traditional visual inspection by human experts. These results, obtained from a novel multi-modal dataset collected in a real clinical burn center setting, highlight the potential of this approach to improve decision-making in burn care. To further support future deployment, we demonstrate a prototype integration with an Electronic Medical Record (EMR) system that aligns with clinical workflows and supports scalable, real-world implementation.

Automated Unified Reasoning with Vision-Language Models for Multi-modal Burn Assessment

Recent advances in machine learning have driven a
step-change in robot perception with modalities such as
vision, where large amounts of training data are readily
available or cheap to collect. However, in tactile
perception, the relatively high cost of data collection
still largely impedes the adoption of such data-driven
learning solutions. In this article, we introduce TactGen,
a novel, cross-modal framework to tackle this challenge. In
particular, using a two-step data generation pipeline, we
leverage easily accessible vision data to synthesise
artificial tactile data for downstream classifier training.
Specifically, we use readily collected video data of
objects of interest to efficiently learn neural radiance
field (NeRF) representations. The NeRF models are then used
to render red–green–blue-depth (RGBD) images from any
desired vantage points. In the second stage, the RGBD
images are translated into corresponding tactile images
typically generated by camera-based tactile sensors using a
conditional generative adversarial network (cGAN). The cGAN
model is itself trained with a large set of visuo-tactile
images collected in simulation, and can be transferred into
the real world without fine-tuning or additional data
collection. We extensively validate this approach in the
context of tactile object classification, showing that it
effectively reduces data collection time by a factor of 20
while achieving similar performance to training a
classifier on manually collected real data.

Downloads

Next from AAAI 2026

Promoting Sustainable Web Agents: Benchmarking and Estimating Energy Consumption Through Empirical and Theoretical Analysis

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES