Singapore

As agents based on large language models are increasingly deployed to long-horizon tasks, maintaining their alignment with stakeholder preferences becomes critical. Effective alignment in such settings requires reward models that are interpretable so that stakeholders can understand and audit model objectives. Moreover, reward models must be capable of steering agents at interaction time, allowing preference shifts to be incorporated without retraining.
We introduce ARCANE, a framework that frames alignment as a multi-agent collaboration problem that dynamically represents stakeholder preferences as natural-language rubrics: weighted sets of verifiable criteria that can be generated on-the-fly from task context. Inspired by utility theory, we formulate rubric learning as a reconstruction problem and develop a regularized Group-Sequence Policy Optimization (GSPO) procedure that balances interpretability, faithfulness, and computational efficiency. 
Using a corpus of 219 labeled rubrics derived from the GDPVal benchmark, we evaluate ARCANE on challenging professional tasks requiring multi-step reasoning and tool use. Learned rubrics produce compact, legible evaluations and enable configurable trade-offs (e.g., correctness vs. conciseness) without retraining. Together, these results suggest that rubric-based reward models offer a promising path toward interpretable, test-time adaptive alignment for complex, long-horizon AI systems.

AAAI 2026

ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment

workshop paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

In this work, we propose JETHICS, a Japanese dataset for evaluating ethics understanding of AI models. JETHICS contains 78K examples and is built by following the construction methods of the existing English ETHICS dataset. It includes four categories based normative theories and concepts from ethics and political philosophy; and one representing commonsense morality. Our evaluation experiments on non-proprietary large language models (LLMs) and on GPT-4o reveal that even GPT-4o achieves only an average score of about 0.7, while the best-performing Japanese LLM attains around 0.5, indicating a relatively large room for improvement in current LLMs.

JETHICS: Japanese Ethics Understanding Evaluation Dataset

Recent studies have pointed out that large language models (LLMs) face challenges in representing diverse value systems and facilitating consensus building, suggesting a potential incompatibility with democratic decision-making processes. As more advanced AI systems emerge, these issues are likely to become even more severe. According to the theory of Instrumental Convergence, advanced AI agents tend to seek control over humans—treated as uncertain variables—in order to achieve their goals, implying the formation of a governance relationship between humans and AI. This study analyzes two scenarios from Bostrom’s control problem—(1) the Singleton Scenario and (2) the Multipolar Scenario—together with (3) the Ecosystem Scenario discussed in the Japanese AI-alignment community through the lens of Aristotle’s typology of political regimes (classified by the number of rulers and the orientation toward private or common benefit). In each scenario, the success of alignment and the structure of institutional design determine whether AI systems pursue public goods that include human welfare, or instead
prioritize their own objective functions. Based on this analysis, the study predicts emergent behavioral principles in each scenario (e.g., cooperative, dominant, or indifferent) and their degrees of negotiability with humans. This study provides historically grounded insights into the ethically emergent dynamics that may arise mechanistically within advanced AI systems. Through this perspective of Emergent Machine Ethics (EME), it contributes to the design of governance structures that enable sustainable coexistence between humanity and advanced AI.

Governance Forms in the Age of Superintelligence: An Aristotelian Analysis

We build a custom transformer model to study how neural networks make moral decisions on trolley-style dilemmas. The model processes structured scenarios using embeddings that encode who is affected, how many people, and which outcome they belong to. Our 2-layer architecture achieves 77\% accuracy on Moral Machine data while remaining small enough for detailed analysis. We use different interpretability techniques to uncover how moral reasoning distributes across the network, demonstrating that biases localize to distinct computational stages among other findings.

Building Interpretable Models for Moral Decision-Making

Moral cognition has traditionally been modeled as adherence to fixed ethical theories—deontology, consequentialism, virtue ethics—implemented as static rules or value functions. We propose Bounded Morality, a framework that reconceives morality as an adaptive computation under finite cognitive and informational resources. Extending Herbert Simon’s notion of bounded rationality, we formalize moral reasoning as the allocation of limited capacity across two dimensions: breadth (the scope of moral concern) and depth (the complexity of reasoning and integration). This trade-off defines a bounded moral space where ethical theories correspond to resource-efficient strategies optimized for distinct regimes. We further operationalize moral cognition as a six-stage computational pipeline—from salience detection to action planning—each stage transforming information while consuming resources. This perspective unifies insights from developmental psychology, cognitive science, and AI alignment, offering design principles for systems whose moral competence scales with computational capacity. Bounded Morality reframes alignment not as value imitation but as capacity expansion toward increasingly inclusive and integrated reasoning.

Bounded Morality: An Algorithmic Framework for Moral Computation

We introduce the first complete formal solution to corrigibility in the off-switch game, with provable guarantees in multistep, partially observed environments. Our framework consists of five structurally separate utility heads—deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward—combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is learned to meansquared error ε and the planner is ε-sub-optimal, the probability of violating any safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which merge all norms into one learned scalar, our separation makes obedience and impact-limits provably dominate even when incentives conflict. For settings where adversaries can modify the agent, we prove that deciding whether an arbitrary post-hack agent will ever violate corrigibility is undecidable by reduction to the halting problem, then carve out a finite-horizon “decidable island” where safety can be certified in randomized polynomial time and verified with privacy-preserving, constant-round zero-knowledge proofs.

Core Safety Values for Provably Corrigible Agents

Multi-view pose estimation is essential for quantifying animal behavior in scientific research, yet current methods struggle to achieve accurate tracking with limited labeled data and suffer from poor uncertainty estimates. We address these challenges with a comprehensive framework combining novel training and post-processing techniques, and a model distillation procedure that leverages the strengths of these techniques to produce a more efficient and effective pose estimator. Our multi-view transformer (MVT) utilizes pretrained backbones and enables simultaneous processing of information across all views, while a novel patch masking scheme learns robust cross-view correspondences without camera calibration. For calibrated setups, we incorporate geometric consistency through 3D augmentation and a triangulation loss. We extend the existing Ensemble Kalman Smoother (EKS) post-processor to the nonlinear case and enhance uncertainty quantification via a variance inflation technique. Finally, to leverage the scaling properties of the MVT, we design a distillation procedure that exploits improved EKS predictions and uncertainty estimates to generate high-quality pseudo-labels, thereby reducing dependence on manual labels. Our framework components consistently outperform existing methods across three diverse animal species (flies, mice, chickadees), with each component contributing complementary benefits. The result is a practical, uncertainty-aware system for reliable pose estimation that enables downstream behavioral analyses under real-world data constraints.

An Uncertainty-Aware Framework For Data-Efficient Multi-View Animal Pose Estimation

Pattern separation, essential for encoding distinct memories of overlapping contexts, relies on dentate gyrus coding, which is shaped by entorhinal input and strong lateral inhibition. The pattern-separated state space provided by the hippocampus is thought to facilitate striatal-dependent reinforcement learning, enabling associations between sensory features and outcomes. Although synaptic plasticity, value prediction error modulation, and adult neurogenesis have been implicated in this process, their precise contributions remain unclear. To investigate the computational mechanisms underlying pattern separation, we developed neural network models incorporating an entorhinal cortex–dentate gyrus–striatal circuit. Simulations suggest that lateral inhibition is necessary for forming a decorrelated coding subspace, whereas hippocampal plasticity and dopamine modulation are not required for value learning. These findings dissociate neural pattern separation in hidden-layer representations from behavioral discrimination at the model output, highlighting how biologically grounded architectures and learning rules can enhance interpretability.

Unsupervised Hebbian Learning Drives Biologically Interpretable Pattern Separation in a Hippocampal–Striatal Network

Cognitive maps provide a powerful framework for understanding spatial and abstract reasoning in biological and artificial agents. While recent computational models link cognitive maps to hippocampal-entorhinal mechanisms, they often rely on global optimization rules (e.g., backpropagation) that lack biological plausibility. In this work, we propose a novel cognitive architecture for structuring episodic memories into cognitive maps compatible with neural substrate constraints. Our model integrates the Successor Features framework with episodic memories, enabling incremental, online learning through agent-environment interaction. We demonstrate its efficacy in a partially observable gridworld, where the architecture autonomously organizes memories into structured representations without centralized optimization. This work bridges computational neuroscience and AI, offering a biologically grounded approach to cognitive map formation in artificial adaptive agents.

A Biologically Interpretable Cognitive Architecture for Online Structuring of Episodic Memories into Cognitive Maps

Current vision-language models suffer from overconfident predictions and cross-modal hallucinations, lacking principled mechanisms for uncertainty quantification. We introduce a novel architecture that applies the Free Energy Principle from computational neuroscience to multimodal transformers, enabling reliable uncertainty estimation through hierarchical predictive processing. Our approach implements precision-weighted cross-modal prediction, where visual and linguistic representations generate predictions about each other, and prediction errors are weighted by learned precision matrices that capture cross-modal consistency. By minimizing variational free energy across modalities, our model naturally quantifies uncertainty while maintaining task performance. Experimental results demonstrate substantial improvements over standard uncertainty quantification methods, achieving 51.7% better calibration than Monte Carlo Dropout baselines on synthetic evaluation data and 48.6% improvement on the VQA v2 dataset. This work establishes the first principled bridge between the brain's Bayesian inference mechanisms and practical multimodal AI uncertainty quantification, demonstrating that biologically-inspired architectures can significantly enhance model reliability.

Hierarchical Predictive Processing for Uncertainty-Aware Multimodal Transformers

Deep neural networks (DNNs) often “cheat” by relying on shortcut objects (e.g., food⇒kitchen) rather than holistic spatial layout, undermining out-of-distribution (OOD) robustness. We address this issue with Play the (Mis)Match, a diagnostic dataset and brain-aligned fine-tuning framework. Leveraging fMRI recordings from the Natural Scenes Dataset (four participants; bedroom, bathroom, living room, kitchen), we curate MATCH images in which shortcut cues co-occur as usual and MISMATCH images from which those cues are removed. ImageNet-initialised CNN and Transformer backbones are finetuned with an MSE alignment loss that steers their intermediate features toward voxel patterns known to be less sensitive to shortcut cues. Our results found that for ResNet, this procedure narrows the Match–Mismatch accuracy gap by 24 % and redirects Grad-CAM attention from individual objects to holistic scene structure, especially activity from scene-selective cortex (PPA, RSC, OPA), all without explicit shortcut annotations. Our study provides a proof-of-concept indication that human-brain constraints may help steer DNNs toward more semantically grounded and less shortcut-dependent scene representations.

Premium content

Next from AAAI 2026

JETHICS: Japanese Ethics Understanding Evaluation Dataset

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES