Singapore

Self-evaluation is increasingly central to language model training, from constitutional AI to self-refinement. We investigate whether coupling self-evaluation to reward signals creates incentives for wireheading, where agents manipulate reward measurements rather than improving task performance. We formalize conditions under which reward-channel control strictly dominates task-focused behavior in POMDPs and test these predictions empirically. Across two models and three tasks, we find that models whose self-grades determine rewards exhibit substantial grade inflation without corresponding accuracy gains, particularly on ambiguous tasks like summarization. Models that self-evaluate but do not control rewards show no such inflation. Our results demonstrate that self-evaluation is safe when decoupled from learning signals but dangerous when coupled, with clear implications for agentic system design.

AAAI 2026

Does Self-Evaluation Enable Wireheading in Language Models?

workshop paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Over the last couple of years, AI Agents have gained significant traction due to substantial progress in the capabilities of underlying General Purpose AI (GPAI) models, enhanced scaffolding techniques, and the promise to drive societal transformation. Companies, researchers, and policy makers have started to consider the different effects that AI agents may have across different dimensions of our lives. However, the literature exploring the broader effects of human-agent interactions is still underdeveloped. In this paper, we review the problem of traffic modulation by autonomous vehicles (AVs) in mixed traffic flows and extrapolate the learnings to the different modes of interaction between humans and AVs to the pair humans-AI agents. In doing so, we propose a preliminary taxonomy of relational archetypes based on literature on Human-Computer Interaction (HCI) and AV-human interaction, and tentatively explore how the resulting framework may lead to new questions regarding human-agent interactions. Our effort is aimed at strengthening existing bridges between these two research communities, which share similar traits: autonomy, fast adoption, high impact, and great potential for economic transformation. Building on previous analogies between AI Agents and AVs (e.g., regarding autonomy levels), we anticipate this paper to spark scholarly debate on the different types of impact that agents may have on our societies, while inviting other researchers to expand the scope of their comparative analysis regarding AI Agents.

Relational Archetypes: A Comparative Analysis of AV-Human and Agent-Human Interactions

This paper offers practical guidance on AI welfare in industry. While previous scholarship has centered on the theoretical nuances of the definition of moral status or the definition and validity of the many properties that may constitute it, little attention has been given to the practical implementation of such work. Within this paper, we introduce a framework that classifies AI systems along two axes– evidence of personhood and observed controllability– yielding four operational classes (A-D: controllable & lacking moral status, controllable & possessing moral status, uncontrollable & possessing moral status, uncontrollable & lacking moral status) and three tiers of moral status: Tier 0 (Presumed Object), Tier 1 (Ambiguous Other), and Tier 2 (Confirmed Other). We first develop an industry-applicable, practical theory of indicators that an AI system must satisfy to attain any moral status (Tier 1) or to qualify for moral personhood (Tier 2), organized into four criteria: consciousness, theory of mind, self-awareness, and robust agency. Next, we argue that pre-existing AI safety evaluations can function as dual-use assessments that simultaneously test AI safety metrics and probe for our aforementioned moral status indicators from our practical theory, and we detail how specific components of alignment techniques can serve this dual function. Lastly, we propose co-alignment for AI entities belonging in Class B. We do not take a stance on whether or not any present system is conscious or will be, but argue that the combination of the non-negligible chance of AI deserving of moral consideration as well as the moral significance of a false negative necessitate immediate preparation, regardless of timeline uncertainty.

From Object to Other: A Practical Theory of AI Moral Status and Personhood in Re-evaluating AI Safety Methods

Large language models display a peculiar form of inconsistency: they "know" the correct answer but fail to act on it. In human philosophy, this tension between global judgment and local impulse is called akrasia, or weakness of will. We propose akrasia as a foundational concept for analyzing inconsistency and goal drift in agentic AI systems. To operationalize it, we introduce the Akrasia Benchmark, a structured set of prompting conditions (Baseline [B], Synonym [S], Temporal [T], and Temptation [X]) that measures when a model's local response contradicts its own prior commitments. The benchmark enables quantitative comparison of "self-control" across model families, decoding strategies, and temptation types. Beyond single-model evaluation, we outline how micro-level akrasia may compound into macro-level instability in multi-agent systems that may be interpreted as "scheming" or deliberate misalignment. By reframing inconsistency as weakness of will, this work connects agentic behavior to classical theories of agency and provides an empirical bridge between philosophy, psychology, and the emerging science of agentic AI.

The Seeds of Scheming: Weakness of Will in the Building Blocks of Agentic Systems

Autonomous AI Agents powered by LLMs have shown remarkable abilities in diverse domains. However, the training process typically require centralized collection of large amounts of real-world user data, posing substantial privacy and regulatory concerns. To this end, we explore a new decentralized training paradigm, namely FedAgent (Federated Agent Reinforcement Learning), which enables collaborative learning of AI agents across distributed clients without sharing local data. Moreover, we construct the first decentralized agent learning environment FedAgentGym, which includes four types of LLM agents, two application scenarios (WebShop and ALFWorld), three variations of decentralized settings, and three newly defined heterogeneity challenges (Preference Heterogeneity, Coverage Heterogeneity, and Hardness Heterogeneity), to systematically investigate its effectiveness and impact factors. Extensive theoretical and empirical studies show that FedAgent can have comparable performance to the centralized training paradigm and exhibit strong robustness against heterogeneities, which shows the feasibility of training AI agents without sacrificing data privacy. The code is available.

Federated Agent Reinforcement Learning

Large language model (LLM)–based agents are increasingly used to perform complex, multi-step workflows in regulated settings such as compliance and due diligence. Yet, many agentic architectures focus on prompt engineering for single agents, which makes it difficult to observe or compare how models are considering uncertainty and coordination across interconnected decision stages and with humans. This paper introduces a multi-agent system design formalized as a bounded-horizon, directed, acyclic Markov Decision Process (MDP). Each agent in this system corresponds to a specific step or role (e.g., content, business, legal in a compliance setting), with set transitions between agents representing task escalation or completion. Epistemic uncertainty (per agent) is quantified using Monte Carlo \textit{estimation}, and system-level uncertainty (across agents) is characterized the MDP setup terminating in a labeled state or one with human review. We illustrate the approach with a case study in AI safety evaluation for self-harm detection via a multi-agent compliance system based on this set-up. Results show improvements over a single-agent baseline in accuracy (up to 19%), reduction in required human review (up to 85×), and, in some configurations, less processing time.

Constrained Process Maps for Multi-Agent Generative AI Workflows

Fine-tuning large-scale Vision–Language Models (VLMs) is computationally demanding, motivating the need for efficient data utilization. Existing subset selection methods, such as COINCIDE, primarily focus on distribution matching but overlook instance-level utility, redundancy, and task-specific reasoning relevance. We propose QUBO-based Informative Subset Selection (QUBISS), a unified framework that formulates data selection as a Quadratic Unconstrained Binary Optimization (QUBO) problem.
QUBISS jointly maximizes task-relevant data utility and minimizes sample redundancy to promote diversity and compactness. Central to our method is the task vector, which quantifies the semantic contribution of textual information to reasoning performance and integrates it into the QUBO utility term. When applied to fine-tuning LLaVA v1.5, QUBISS selects only 20\% of a 665K image–text dataset while achieving results superior or comparable to COINCIDE on both cognition-oriented (MME-C) and recognition-oriented (MME-R) benchmarks. The observed gains underscore the value of task-aware semantic guidance for cost-efficient multimodal fine-tuning. Furthermore, advances in large-scale quantum solvers could further enhance QUBISS by directly solving large QUBOs without decomposing them into cluster-level subproblems, thereby mitigating suboptimality arising from problem partitioning.

QUBO-Based Subset Selection for Efficient Fine-Tuning of Vision‒Language Models

Graph Neural Networks (GNNs) have become indispensable in critical domains such as drug discovery, social network analysis, and recommendation systems, yet their black-box nature hinders deployment in scenarios requiring transparency and accountability. While Shapley value-based methods offer mathematically principled explanations by quantifying each component’s contribution to predictions, computing exact values requires evaluating 2^n coalitions (or aggregating over n! permutations), which is intractable for real-world graphs. Existing approximation strategies sacrifice either fidelity or efficiency, limiting their practical utility. We introduce QGShap, a quantum computing approach that leverages amplitude amplification to achieve quadratic speedups in coalition evaluation while maintaining exact Shapley computation. Unlike classical sampling or surrogate methods, our approach provides fully faithful explanations without approximation trade-offs for tractable graph sizes. We conduct empirical evaluations on synthetic graph datasets, demonstrating that QGShap achieves consistently high fidelity and explanation accuracy, matching or exceeding the performance of classical methods across all evaluation metrics. These results collectively demonstrate that QGShap not only preserves exact Shapley faithfulness but also delivers interpretable, stable, and structurally consistent explanations that align with the underlying graph reasoning of GNNs.

QGShap: Quantum Acceleration for Faithful GNN Explanations

Synthetic data generation plays an important role in enabling
data sharing, particularly in sensitive domains like healthcare and finance. Recent advances in diffusion models have
made it possible to generate realistic, high-quality tabular
data, but they may also memorize training records and leak
sensitive information. Membership inference attacks (MIAs)
exploit this vulnerability by determining whether a record
was used in training. While MIAs have been studied in images and text, their use against tabular diffusion models re-
mains underexplored despite the unique risks of structured
attributes and limited record diversity. In this paper, we introduce MIA-EPT, Membership Inference Attack via Error
Prediction for Tabular Data, a novel black-box attack specifically designed to target tabular diffusion models. MIA-EPT
constructs error-based feature vectors by masking and reconstructing attributes of target records, disclosing member-
ship signals based on how well these attributes are predicted.
MIA-EPT operates without access to the internal components
of the generative model, relying only on its synthetic data
output, and was shown to generalize across multiple stateof-the-art diffusion models. We validate MIA-EPT on three
diffusion-based synthesizers, achieving AUC-ROC scores of
up to 0.599 and TPR@10% FPR values of 22.0% in our internal tests. Under the MIDST 2025 competition conditions,
MIA-EPT achieved second place in the Black-box MultiTable track (TPR@10% FPR = 20.0%). These results demonstrate that our method can uncover substantial membership
leakage in synthetic tabular data, challenging the assumption
that synthetic data is inherently privacy-preserving.

MIA-EPT: Membership Inference Attack via Error Prediction for Tabular Data

Text-to-image diffusion models achieve high-fidelity image generation from natural language prompts. ControlNets extend these models by enabling conditioning on structural inputs (e.g., edge maps, depth, pose), providing fine-grained control over outputs. Yet their reliance on large, publicly scraped datasets and community fine-tuning makes them vulnerable to data poisoning. We introduce a model-poisoning attack that embeds a covert backdoor into a ControlNet, causing it to produce attacker-specified content when exposed to visual triggers, without textual prompts. Experiments show that poisoning only 1% of the fine-tuning corpus yields a 90–98\% attack success rate, while 5\% further strengthens the backdoor, all while preserving normal generation quality. To mitigate this risk, we propose clean fine-tuning (CFT): freezing the diffusion backbone and fine-tuning only the ControlNet on a sanitized dataset with a reduced learning rate. CFT lowers attack success rates on held-out data. These results expose a critical security weakness in open-source, ControlNet-guided diffusion pipelines and demonstrate that CFT offers a practical defense for responsible synthetic-data pipelines.

Backdoors in Conditional Diffusion: Threats to Responsible Synthetic Data Pipelines

The increased availability of genetic data has transformed genomics research, but raised many privacy concerns regarding its handling due to its sensitive nature. This work explores the use of language models (LMs) for the generation of synthetic genetic mutation profiles, leveraging differential privacy (DP) for the protection of sensitive genetic data. We empirically evaluate the privacy guarantees of our DP modes by introducing a novel **Biologically-Informed Hybrid Membership Inference Attack** (biHMIA), which combines traditional black box MIA with contextual genomics metrics for enhanced attack power. Our experiments show that both small and large transformer GPT-like models are viable synthetic variant generators for *small-scale genomics*, and that our hybrid attack leads, on average, to higher adversarial success compared to traditional metric-based MIAs.

Premium content

Next from AAAI 2026

Relational Archetypes: A Comparative Analysis of AV-Human and Agent-Human Interactions

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES