Singapore

This paper critiques common ways of doing machine ethics in Reinforcement Learning (RL) and argues for a virtue-focused approach. We see two recurring problems: (i) rule-based (deontological) methods that encode duties as constraints or shields often break in new or uncertain settings and don’t build lasting habits; and (ii) reward-based (consequentialist) methods squeeze many moral goals into one number, which invites gaming and hides real trade-offs. We instead treat ethics as policy-level dispositions (stable habits that hold up when incentives, partners, or contexts change) so evaluation should look beyond rule checks or single returns to include trait summaries, durability under interventions, and clear reporting of trade-offs. Our roadmap comprises four components: (1) leveraging social learning in multi-agent RL to acquire behavior from exemplary agents; (2) preserving value conflicts through multi-objective or constrained formulations, complemented by risk-aware criteria to guard against harm; (3) regularizing policies toward ‘virtuous’ priors to promote trait-like stability under distribution shift; and (4) operationalizing diverse ethical traditions as practical control signals.

AAAI 2026

Toward Virtuous Reinforcement Learning: A Critique and Roadmap

workshop paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

We build a custom transformer model to study how neural networks make moral decisions on trolley-style dilemmas. The model processes structured scenarios using embeddings that encode who is affected, how many people, and which outcome they belong to. Our 2-layer architecture achieves 77\% accuracy on Moral Machine data while remaining small enough for detailed analysis. We use different interpretability techniques to uncover how moral reasoning distributes across the network, demonstrating that biases localize to distinct computational stages among other findings.

Building Interpretable Models for Moral Decision-Making

Moral cognition has traditionally been modeled as adherence to fixed ethical theories—deontology, consequentialism, virtue ethics—implemented as static rules or value functions. We propose Bounded Morality, a framework that reconceives morality as an adaptive computation under finite cognitive and informational resources. Extending Herbert Simon’s notion of bounded rationality, we formalize moral reasoning as the allocation of limited capacity across two dimensions: breadth (the scope of moral concern) and depth (the complexity of reasoning and integration). This trade-off defines a bounded moral space where ethical theories correspond to resource-efficient strategies optimized for distinct regimes. We further operationalize moral cognition as a six-stage computational pipeline—from salience detection to action planning—each stage transforming information while consuming resources. This perspective unifies insights from developmental psychology, cognitive science, and AI alignment, offering design principles for systems whose moral competence scales with computational capacity. Bounded Morality reframes alignment not as value imitation but as capacity expansion toward increasingly inclusive and integrated reasoning.

Bounded Morality: An Algorithmic Framework for Moral Computation

We introduce the first complete formal solution to corrigibility in the off-switch game, with provable guarantees in multistep, partially observed environments. Our framework consists of five structurally separate utility heads—deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward—combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is learned to meansquared error ε and the planner is ε-sub-optimal, the probability of violating any safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which merge all norms into one learned scalar, our separation makes obedience and impact-limits provably dominate even when incentives conflict. For settings where adversaries can modify the agent, we prove that deciding whether an arbitrary post-hack agent will ever violate corrigibility is undecidable by reduction to the halting problem, then carve out a finite-horizon “decidable island” where safety can be certified in randomized polynomial time and verified with privacy-preserving, constant-round zero-knowledge proofs.

Core Safety Values for Provably Corrigible Agents

Multi-view pose estimation is essential for quantifying animal behavior in scientific research, yet current methods struggle to achieve accurate tracking with limited labeled data and suffer from poor uncertainty estimates. We address these challenges with a comprehensive framework combining novel training and post-processing techniques, and a model distillation procedure that leverages the strengths of these techniques to produce a more efficient and effective pose estimator. Our multi-view transformer (MVT) utilizes pretrained backbones and enables simultaneous processing of information across all views, while a novel patch masking scheme learns robust cross-view correspondences without camera calibration. For calibrated setups, we incorporate geometric consistency through 3D augmentation and a triangulation loss. We extend the existing Ensemble Kalman Smoother (EKS) post-processor to the nonlinear case and enhance uncertainty quantification via a variance inflation technique. Finally, to leverage the scaling properties of the MVT, we design a distillation procedure that exploits improved EKS predictions and uncertainty estimates to generate high-quality pseudo-labels, thereby reducing dependence on manual labels. Our framework components consistently outperform existing methods across three diverse animal species (flies, mice, chickadees), with each component contributing complementary benefits. The result is a practical, uncertainty-aware system for reliable pose estimation that enables downstream behavioral analyses under real-world data constraints.

An Uncertainty-Aware Framework For Data-Efficient Multi-View Animal Pose Estimation

Pattern separation, essential for encoding distinct memories of overlapping contexts, relies on dentate gyrus coding, which is shaped by entorhinal input and strong lateral inhibition. The pattern-separated state space provided by the hippocampus is thought to facilitate striatal-dependent reinforcement learning, enabling associations between sensory features and outcomes. Although synaptic plasticity, value prediction error modulation, and adult neurogenesis have been implicated in this process, their precise contributions remain unclear. To investigate the computational mechanisms underlying pattern separation, we developed neural network models incorporating an entorhinal cortex–dentate gyrus–striatal circuit. Simulations suggest that lateral inhibition is necessary for forming a decorrelated coding subspace, whereas hippocampal plasticity and dopamine modulation are not required for value learning. These findings dissociate neural pattern separation in hidden-layer representations from behavioral discrimination at the model output, highlighting how biologically grounded architectures and learning rules can enhance interpretability.

Unsupervised Hebbian Learning Drives Biologically Interpretable Pattern Separation in a Hippocampal–Striatal Network

Cognitive maps provide a powerful framework for understanding spatial and abstract reasoning in biological and artificial agents. While recent computational models link cognitive maps to hippocampal-entorhinal mechanisms, they often rely on global optimization rules (e.g., backpropagation) that lack biological plausibility. In this work, we propose a novel cognitive architecture for structuring episodic memories into cognitive maps compatible with neural substrate constraints. Our model integrates the Successor Features framework with episodic memories, enabling incremental, online learning through agent-environment interaction. We demonstrate its efficacy in a partially observable gridworld, where the architecture autonomously organizes memories into structured representations without centralized optimization. This work bridges computational neuroscience and AI, offering a biologically grounded approach to cognitive map formation in artificial adaptive agents.

A Biologically Interpretable Cognitive Architecture for Online Structuring of Episodic Memories into Cognitive Maps

Current vision-language models suffer from overconfident predictions and cross-modal hallucinations, lacking principled mechanisms for uncertainty quantification. We introduce a novel architecture that applies the Free Energy Principle from computational neuroscience to multimodal transformers, enabling reliable uncertainty estimation through hierarchical predictive processing. Our approach implements precision-weighted cross-modal prediction, where visual and linguistic representations generate predictions about each other, and prediction errors are weighted by learned precision matrices that capture cross-modal consistency. By minimizing variational free energy across modalities, our model naturally quantifies uncertainty while maintaining task performance. Experimental results demonstrate substantial improvements over standard uncertainty quantification methods, achieving 51.7% better calibration than Monte Carlo Dropout baselines on synthetic evaluation data and 48.6% improvement on the VQA v2 dataset. This work establishes the first principled bridge between the brain's Bayesian inference mechanisms and practical multimodal AI uncertainty quantification, demonstrating that biologically-inspired architectures can significantly enhance model reliability.

Hierarchical Predictive Processing for Uncertainty-Aware Multimodal Transformers

Deep neural networks (DNNs) often “cheat” by relying on shortcut objects (e.g., food⇒kitchen) rather than holistic spatial layout, undermining out-of-distribution (OOD) robustness. We address this issue with Play the (Mis)Match, a diagnostic dataset and brain-aligned fine-tuning framework. Leveraging fMRI recordings from the Natural Scenes Dataset (four participants; bedroom, bathroom, living room, kitchen), we curate MATCH images in which shortcut cues co-occur as usual and MISMATCH images from which those cues are removed. ImageNet-initialised CNN and Transformer backbones are finetuned with an MSE alignment loss that steers their intermediate features toward voxel patterns known to be less sensitive to shortcut cues. Our results found that for ResNet, this procedure narrows the Match–Mismatch accuracy gap by 24 % and redirects Grad-CAM attention from individual objects to holistic scene structure, especially activity from scene-selective cortex (PPA, RSC, OPA), all without explicit shortcut annotations. Our study provides a proof-of-concept indication that human-brain constraints may help steer DNNs toward more semantically grounded and less shortcut-dependent scene representations.

Play the (Mis)Match: Using fMRI-Aligned Feature Fine-Tuning to Reveal Shortcut Bias in Deep Neural Networks

Oscillatory dynamics are an ubiquitous feature of biological neural networks, yet their computational role has remained debated. Recent work has advanced the hypothesis that oscillations are not epiphenomenal but constitute a fundamental mechanism for information processing. Here, we synthesize three lines of research that address this question in conceptual, computational, and physical domains. First, theoretical considerations show that oscillations support functions based on synchrony, resonance and wave interference that are not accessible to non-oscillatory models. Second, the Harmonic Oscillator Recurrent Networks (HORN) model demonstrates that oscillatory dynamics confer advantages in learning speed, robustness, and parameter efficiency compared to non-oscillating recurrent architectures in the field of machine learning. Third, an analog-electronic implementation of the HORN model demonstrates the feasibility of exploiting transient oscillatory dynamics for computation in physical systems. This approach establishes that across neural, artificial, and physical systems, oscillatory transients provide a powerful resource for computation. Recent evidence from biological, photonic, mechanical, and fluid systems shows that wave-based computation extends far beyond neural substrates. Moreover, simulation studies could confirm the functional relevance of wave-based dynamics in artificial neural networks. Together, these results show that transient oscillatory and wave-based dynamics can serve as a unifying substrate for analog computation across a large variety of domains, offering new directions for robust and energy-efficient, biology-inspired information processing systems.

Oscillatory Dynamics as an Universal Substrate for Computation: From Neural Circuits to Artificial Intelligence

Neural decoding from electroencephalography (EEG) remains fundamentally limited by poor generalization to unseen subjects, driven by high inter-subject variability and the lack of large-scale datasets to model it effectively. Existing methods often rely on synthetic subject generation or simplistic data augmentation, but these strategies fail to scale or generalize reliably. We introduce MultiDiffNet, a diffusion-based framework that bypasses generative augmentation entirely by learning a compact latent space optimized for multiple objectives. We decode directly from this space and achieve state-of-the-art generalization across various neural decoding tasks using subject and session disjoint evaluation. We also curate and release a unified benchmark suite spanning four EEG decoding tasks of increasing complexity (SSVEP, Motor Imagery, P300, and Imagined Speech) and an evaluation protocol that addresses inconsistent split practices in prior EEG research. Finally, we develop a statistical reporting framework tailored for low-trial EEG settings. Our work provides a reproducible and open-source foundation for subject-agnostic EEG decoding in real-world BCI systems.

Premium content

Next from AAAI 2026

Building Interpretable Models for Moral Decision-Making

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES