Singapore

Standard single-turn, static benchmarks fall short in evaluating the nuanced capabilities of Large Language Models (LLMs) on complex tasks such as software engineering. In this work, we propose a novel interactive evaluation framework that assesses LLMs on multi-requirement programming tasks through structured, feedback-driven dialogue.
Each task is modeled as a requirement dependency graph, and an &quot;interviewer&quot; LLM, aware of the ground-truth solution, provides minimal, targeted hints to an &quot;interviewee&quot; model to help correct errors and fulfill target constraints. This dynamic protocol enables fine-grained diagnostic insights into model behavior, uncovering strengths and systematic weaknesses that static benchmarks fail to measure.
We build on DevAI, a benchmark of 55 curated programming tasks, by adding ground-truth solutions and evaluating the relevance and utility of interviewer hints through expert annotation.
Our results highlight the importance of dynamic evaluation in advancing the development of collaborative code-generating agents.

AAAI 2026

Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks

llms

evaluation

Standard single-turn, static benchmarks fall short in evaluating the nuanced capabilities of Large Language Models (LLMs) on complex tasks such as software engineering. In this work, we propose a novel interactive evaluation framework that assesses LLMs on multi-requirement programming tasks through structured, feedback-driven dialogue.
Each task is modeled as a requirement dependency graph, and an "interviewer" LLM, aware of the ground-truth solution, provides minimal, targeted hints to an "interviewee" model to help correct errors and fulfill target constraints. This dynamic protocol enables fine-grained diagnostic insights into model behavior, uncovering strengths and systematic weaknesses that static benchmarks fail to measure.
We build on DevAI, a benchmark of 55 curated programming tasks, by adding ground-truth solutions and evaluating the relevance and utility of interviewer hints through expert annotation.
Our results highlight the importance of dynamic evaluation in advancing the development of collaborative code-generating agents.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. 
We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization.
Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3\% across various architectures on ScreenSpot benchmarks.
We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11\% to 83.57\% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14\% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.

Test-Time Reinforcement Learning for GUI Grounding via Region Consistency

In multilabel classification, while only one ground truth label may be provided in the training data, multiple equally valid outputs may be possible, making reliable evaluation a persistent challenge. We postulate that human evaluators implicitly use task-specific invariants, e.g., object boundaries in colorized images or named entities in translations, to judge if an output is acceptable. Under this assumption, we introduce a notion of approximate task-specific invariants and use them as diagnostic tools to evaluate a variety of existing metrics for vision and language tasks. We use these task invariants as part of a framework to systematically test metric reliability by encouraging domain-relevant invariants in model outputs via an augmented loss function. In our experiments, we observe that enforcing invariants with an augmented loss yields substantial improvements in popular distributional metrics while more traditional metrics change only marginally. Through this invariants-driven evaluation, we expose where standard metrics fail to detect meaningful differences, and we highlight the conditions under which distributional metrics succeed or still fall short.

A Novel Approach to Evaluating Evaluation Metrics for Multi-Output Structured Prediction

Over the past years, memes have evolved from being exclusively a medium of humorous exchanges to one that allows users to express a range of emotions freely and easily. With the ever-growing utilization of memes in expressing depressive sentiments, we conduct a study on identifying depressive symptoms exhibited by memes shared by users of online social media platforms. We introduce RESTOREx as a vital resource for detecting depressive symptoms in memes on social media through the Large Language Model (LLM) generated and human-annotated explanations. We introduce MAMA-Memeia, a collaborative multi-agent multi-aspect discussion framework grounded in the clinical psychology method of Cognitive Analytic Therapy (CAT) Competencies. MAMA-Memeia improves upon the current state-of-the-art by 7.55% in macro-F1 and is established as the new benchmark compared to over 30 methods.

MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes

The rapid development of large language models (LLMs) has relied on access to high-quality, large-scale datasets, yet growing concerns around data privacy and security have spurred substantial research into pre-training data detection. While state-of-the-art (SOTA) methods such as RECALL and CON-RECALL leverage auxiliary prefixes to enhance detection performance, their dependence on individual prefixes introduces notable instability across varying prefix conditions. To address this, we first conduct a theoretical analysis to assess the impact of prefixes on existing prefix-based methods. Building on the analysis, we propose a novel prefix selection method to identify optimal prefixes. Specifically, our method derives two key criteria \textit{Discriminability} and \textit{Symmetry}. These criteria serve to quantify the effectiveness of prefixes in detecting pre-training data, enabling precise selection of high-performing candidate prefixes. Experiments on the WikiMIA dataset demonstrate that our method consistently improves the performance of RECALL and CON-RECALL, achieving gains of up to 21.1\% in AUC scores while significantly enhancing robustness.

Enhancing Pre-training Data Detection in LLMs Through Discriminative and Symmetric Prefix Selection

LLM-as-a-Judge refers to the automatic modeling of preferences for responses generated by Large Language Models (LLMs), which is of significant importance for both LLM evaluation and reward modeling. Although generative LLMs have made substantial progress in various tasks, their performance as LLM-Judge still falls short of expectations. In this work, we propose Think-J, which improves generative LLM-as-a-Judge by learning how to think. We first utilized a small amount of curated data to develop the model with initial judgment thinking capabilities. Subsequently, we optimize the judgment thinking traces based on reinforcement learning (RL). We propose two methods for judgment thinking optimization, based on offline and online RL, respectively. The offline method requires training a critic model to construct positive and negative examples for learning. The online method defines rule-based reward as feedback for optimization. Experimental results showed that our approach can significantly enhance the evaluation capability of generative LLM-Judge, surpassing both generative and classifier-based LLM-Judge without requiring extra human annotations.

Think-J: Learning to Think for Generative LLM-as-a-Judge

Large Language Models demonstrate strong reasoning capabilities, which can be effectively compressed into smaller models. However, existing datasets and fine-tuning approaches still face challenges that lead to catastrophic forgetting, particularly for models smaller than 8B. First, most datasets typically ignore the relationship between training data knowledge and the model's inherent abilities, making it difficult to preserve prior knowledge. Second, conventional training objectives often fail to constrain inherent knowledge preservation, which can result in forgetting of previously learned skills. To address these issues, we propose a comprehensive solution that alleviates catastrophic forgetting from both the data and fine-tuning approach perspectives. On the data side, we construct a dataset of 5K instances that covers multiple reasoning tasks and incorporates metacognitive knowledge, making it more tolerant and effective for distillation into smaller models. We annotate the metacognitive knowledge required to solve each question and filter the data based on task knowledge and the model's inherent skills. On the training side, we introduce GDPO (Group Direction Preference Optimization), which is better suited for resource-limited scenarios and can efficiently approximate the performance of GRPO. Guided by the large model and by implicitly constraining the optimization path through a reference model, GDPO enables more effective knowledge transfer from the large model and constrains excessive parameter drift. Extensive experiments demonstrate that our approach significantly alleviates catastrophic forgetting and improves reasoning performance on smaller models.

MetaGDPO: Alleviating Catastrophic Forgetting with Metacognitive Knowledge Through Group Direct Preference Optimization

Text-to-Audio (TTA) generation has made rapid progress, but current evaluation methods remain narrow, focusing mainly on perceptual quality while overlooking robustness, generalization, and ethical concerns. We present TTA-Bench, a comprehensive benchmark for evaluating TTA models across functional performance, reliability, and social responsibility. It covers seven dimensions including accuracy, robustness, fairness, and toxicity, and includes 2,999 diverse prompts generated through automated and manual methods. We introduce a unified evaluation protocol that combines objective metrics with over 118,000 human annotations from both experts and general users. Ten state-of-the-art models are benchmarked under this framework, offering detailed insights into their strengths and limitations. TTA-Bench establishes a new standard for holistic and responsible evaluation of TTA systems. The dataset, evaluation tools, and results are have all been open-sourced.

TTA-Bench: A Comprehensive Benchmark for Evaluating Text-to-Audio Models

Representation Finetuning (ReFT) has recently emerged as an efficient paradigm for adapting pretrained language models by editing hidden representations rather than model weights. However, our preliminary experiments reveal that ReFT is notably more sensitive to training data quality compared to traditional parameter-efficient finetuning methods, particularly to samples with incorrect labels, which can severely degrade performance. Inspired by prior work demonstrating that the hidden representations of generalizable neural networks exhibit low-dimensional manifold structures, we hypothesize that effective generalization in ReFT requires geometrically structured transformations between pre- and post-intervention representations. This implies that the intervention vectors representing these transformations should form a low-dimensional manifold, rendering the inconsistent transformations induced by label noise as detectable geometric outliers. To leverage this insight, we introduce Aligning Interventions on a learned Manifold (AIM), a representation-based data filtering method for ReFT, which identifies high-quality training samples by measuring the geometric consistency of their intervention vectors with respect to a robust reference manifold derived via principal component analysis on trusted data. Extensive experiments on both commonsense and arithmetic reasoning tasks confirm the effectiveness of AIM, showing consistent improvements over strong data selection baselines across multiple model scales.

AIM: Manifold-based Data Filtering for Representation Finetuning

Online change detection (OCD), which aims to quickly identify change points in streaming data, is vital in domains such as power system monitoring, wireless network sensing, and financial anomaly detection. Existing OCD methods often assume exact system knowledge, which is impractical due to estimation errors and environmental changes. Also, the limitations of existing optimization algorithms hinder efficient detection in large-scale systems. To address these issues, we propose RoS-Guard, a robust and optimal OCD algorithm with parallel GPU acceleration for uncertain systems. Unlike traditional approaches, RoS-Guard offers theoretical guarantees on optimality, robustness, and detection delay. Specifically, we derive analytical bounds on the expected false alarm rate and the worst-case average detection delay. Leveraging the decomposition of the mixed integer quadratic programming (MIQP) optimization problem, we developed a GPU-accelerated algorithm. Experiments demonstrate RoS-Guard’s effectiveness and significant speedup in large-scale scenarios.

RoS-Guard: Robust and Scalable Online Change Detection with Delay-Optimal Guarantees

Real-world event sequences are often generated under different mechanisms and thus have clustering structures. 
Nonetheless, in the modeling and prediction of event sequences, most existing TPPs treat different event sequences independently, ignoring the inherent clustering structures among them.
In this study, we design a novel semi-transductive temporal point process (ST-TPP) and learn it with a Gromov-Wasserstein barycentric (GWB) regularizer in the Maximum Likelihood Estimation (MLE) framework.
In particular, given a set of event sequences, our method learns a neural TPP together with cluster centers of sequences.
When computing the intensity function of an event sequence, the proposed neural TPP encodes the sequence history and the cluster center derived from other similar sequences jointly, leading to a semi-transductive modeling scheme.
In the learning phase, besides maximizing the likelihood of event sequences, we leverage data-centric and knowledge-based kernel matrices to regularize sequence embeddings and derive cluster centers, leading to the proposed GWB regularizer.
Experiments on various datasets demonstrate that the transductive modeling scheme of ST-TPP provides a novel approach to sharing information across different sequences, resulting in clustered sequence embeddings and competitive predictive performance.

Content not yet available

Next from AAAI 2026

Test-Time Reinforcement Learning for GUI Grounding via Region Consistency

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES