Singapore

Multimodal large language models (MLLMs) are proficient in perception and instruction-following, but they still struggle with spatial reasoning: the ability to mentally track and manipulate objects across multiple views and over time. Spatial reasoning is a key component of human intelligence, but most existing benchmarks focus on static images or final outputs, failing to account for the sequential and viewpoint-dependent nature of this skill. To close this gap, we introduce GamiBench, a benchmark designed to evaluate spatial reasoning and 2D-to-3D planning in MLLMs through origami-inspired folding tasks. GamiBench includes 186 regular and 186 impossible 2D crease patterns paired with their corresponding 3D folded shapes, produced from six distinct viewpoints across three visual question-answering (VQA) tasks: predicting 3D fold configurations, distinguishing valid viewpoints, and detecting impossible patterns. Unlike previous benchmarks that assess only final predictions, GamiBench holistically evaluates the entire reasoning process of the models; measuring cross-view consistency, physical feasibility through impossible-fold detection and interpretation of intermediate folding steps. It further introduces new diagnostic metrics—viewpoint consistency (VC) and impossible fold selection rate (IFSR)—to measure how well models handle folds of varying complexity. By linking geometric evaluation with sequential reasoning, GamiBench enables a comprehensive evaluation of state-of-the-art MLLMs, revealing significant limitations in spatial reasoning capabilities, such as multi-view inconsistency and difficulty detecting physically impossible folds. Our experiments show that even leading models such as GPT-5 and Gemini-2.5-Pro struggle on single-step spatial understanding, while other MLLMs tend to show highly variable or inconsistent answering trends. These contributions establish a standardized framework for evaluating and advancing geometric understanding and spatial reasoning in MLLMs. The GamiBench dataset and code will be made available upon publication.

AAAI 2026

GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Alzheimer's disease intervention planning requires both predictive modeling of biomarker trajectories and counterfactual reasoning about treatment timing. We propose a neuro-symbolic architecture that integrates Fourier Neural Operators (FNOs) for physics-informed biomarker prediction with Answer Set Programming (ASP) and SMT solving for verifiable intervention planning. Our approach adapts FNO methods to learn surrogate operators for AT(N) biomarker cascade dynamics, enabling fast multi-year trajectory forecasting while preserving cascade constraints. The symbolic layer formalizes clinical knowledge using first-order logic and temporal logic rules, allowing ASP/s(CASP) to generate candidate intervention strategies that are verified against safety properties using Z3 SMT solving. This combination provides both the predictive power of deep learning and the formal guarantees of symbolic reasoning, addressing critical translational challenges in precision medicine for Alzheimer's disease.

Neuro-Symbolic AI for Alzheimer's Disease: Physics-Informed Biomarker Prediction and Verifiable Intervention Planning

Rising provider turnover results in frequently needing to rematch patients with available providers. However, the rematching process is cumbersome for both patients and health systems, resulting in labor-intensive and ad hoc reassignments. We propose a novel patient-provider matching approach to address this issue by offering patients limited provider menus. The goal is to maximize match quality across the system while preserving patient choice. We frame this as a novel variant of assortment optimization, where patient-specific provider menus are offered upfront, and patients respond in a random sequence to make their selections. This hybrid offline-online setting is understudied in previous literature and captures system dynamics across various domains. We first demonstrate that a greedy baseline policy--which offers all providers to all patients--can maximize the match rate but lead to low-quality matches. Based on this, we construct a set of policies and demonstrate that the best policy depends on problem specifics, such as a patient's willingness to match and the ratio of patients to providers. On real-world data, our proposed policy improves average match quality by 13\% over a greedy solution by tailoring assortments based on patient characteristics. Our analysis reveals a tradeoff between menu size and system-wide match quality, highlighting the value of balancing patient choice with centralized planning.

Assortment Optimization for Patient-Provider Matching

NP-Hard scheduling problems are increasingly more prevalent in our daily lives. Their inherent difficulty and larger scale constitute a real challenge to both Artificial Intelligence and Operational Research communities that continuously seek efficient approaches to solve these theoretical problems and their real-life variants. Existing approaches are either exact or heuristic. Exact algorithms, which generally tackle small-sized academic type scheduling problems, have the property of identifying an optimal solution if one exists in the state-space. Algorithms belonging to the exact class are either time-and-space limited or time-limited. Time-and-space limited algorithms, such as Branch-and-Bound algorithms, have exponential time and space complexities. That is, such algorithms can exhaust the computational memory for a given size of the input problem instance, further confirming their unsuitability for solving very large problem instances. Time-limited algorithms, such as Depth-First Search, have polynomial space complexity and an exponential time complexity. While such algorithms are unlikely to exhaust computational memory, their runtime may be very long time, thus, restricting their suitability to address large-scale problem instances too.

F heuristic search to solve scheduling with deadlines problems

Mixed Binary Quadratic Programs (MBQPs) are an important and complex set of problems in combinatorial optimization. As solving large-scale combinatorial optimization problems is challenging, primal heuristics have been developed to quickly identify high-quality solutions within a short amount of time. Recently, a growing body of research has also used machine learning to accelerate solution methods for challenging combinatorial optimization problems. Despite the increasing popularity of these ML-guided methods, a large body of work has focused on Mixed-Integer Linear Programs (MILPs). MBQPs are challenging to solve due to the combinatorial complexity coupled with nonlinearities. This work proposes ML-guided primal heuristics for Mixed Binary Quadratic Programs (MBQPs) by adapting and extending existing work on ML-guided MILP solution prediction to MBQPs. We introduce a new neural network architecture for MBQP solution prediction and a new training data collection procedure. Moreover, we extend existing loss functions in solution prediction and propose to combine contrastive weighted cross-entropy losses. We evaluate the methods on standard and real-world MBQP benchmarks and show that the developed ML-guided methods significantly outperform existing primal heuristics and state-of-the-art solvers. Furthermore, models trained with our proposed extension with combined losses outperform other ML-based methods adapted from MILPs and improve generalization in cross-regional inference on a real-world wind farm layout optimization problem.

ML-Guided Primal Heuristics for Mixed Binary Quadratic Programs

To mitigate acute wildfire ignition risks, utilities de-energize power lines in high-risk areas. The Optimal Power Shutoff (OPS) problem optimizes line energization statuses to manage wildfire ignition risks through de-energizations while reducing load shedding. OPS problems are computationally challenging Mixed-Integer Linear Programs (MILPs) that must be solved rapidly and frequently in operational settings. For a particular power system, OPS instances share a common structure with varying parameters related to wildfire risks, loads, and renewable generation. This motivates the use of Machine Learning (ML) for solving OPS problems by exploiting shared patterns across instances. In this paper, we develop an ML-guided framework that quickly produces high-quality de-energization decisions by extending existing ML-guided MILP solution methods while integrating domain knowledge on the number of energized and de-energized lines. Results on a large-scale realistic California-based synthetic test system show that the proposed ML-guided method produces high-quality solutions faster than traditional optimization methods.

Machine Learning Guided Optimal Transmission Switching to Mitigate Wildfire Ignition Risk

Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning large language models with human preferences. However, when preference data is aggregated from diverse populations, it remains unclear whether the resulting aligned models serve all demographic groups equitably, or support long-term behavior change and mental health and wellness needs in a balanced way. We investigate this question through a controlled experiment using Direct Preference Optimization (DPO), training on preferences collected from our novel synthetic dataset, the 10th Village, comprising 5,000 synthetic villagers with demographics and personality traits modeled on U.S. Census data and validated psychological instruments. Each villager provided preferences across everyday stressors in financial/employment and social/relationship domains. We introduce an alignment fairness evaluation framework that treats RLHF as a behavior-aware recommendation problem, measuring how well the aligned model matches individual villager preferences compared to the base model and analyzing disparities across demographic subgroups. Our results reveal two critical sources of inequality: First, social and relationship problems receive substantially less benefit from alignment than financial concerns (p < .001), despite already generating higher baseline dissatisfaction. Second, more educated villagers gain disproportionate benefit from alignment (p < .001), particularly for social problems, creating a compounding advantage. These findings suggest that standard RLHF practices may systematically disadvantage certain problem domains and demographic groups, highlighting the need for fairness-aware approaches to preference aggregation and model alignment. Our contributions include both the 10th Village dataset and a reusable evaluation protocol for controlled, behavior-aware fairness research, as well as empirical evidence of disparate impact in preference-based alignment to guide the design of more equitable, wellness-oriented RLHF systems.

Not All Stress Is Treated Equal: Fairness Gaps in AI Support for Everyday Problems

As AI systems increasingly mediate human interactions, most alignment frameworks erroneously assume static human preferences. We introduce Socially-Aware Continual Learning (SCL), a framework that maintains ethical alignment with dynamically evolving norms through norm embeddings and Social Elastic Weight Consolidation (SEWC)—a novel algorithm that adapts regularization strength based on measured norm drift. Extensive experiments on longitudinal datasets demonstrate SCL’s superior balance between Alignment Stability (F1 = 0.84) and Normative Plasticity (F1 = 0.87), significantly outperforming state-of-the-art baselines (p < 0.001). Our contributions include: (1) validated drift detection metrics (BDI, DCS) achieving 0.89 F1-score, (2) human evaluations showing 82% trust recovery after norm shifts, (3) formalized fairness-utility trade-offs with 4% versus 20% disparity for baselines, and (4) societal-scale simulation showing 36% polarization reduction. SCL provides both theoretical stability guarantees and practical tools for developing socially responsive AI.

Socially-Aware Continual Learning: Modeling Dynamic Alignment with Evolving Human Norms

Cognitive behavioural therapy (CBT) and exposure therapy (ET) are among the most effective interventions for anxiety and related mental health conditions, yet their traditional delivery remains limited due to issues of scalability, personalisation, evaluation accuracy and constructive user engagement. This position paper proposes a framework for affect-intelligent virtual reality (AIVR), an AI-driven and ethically grounded approach that integrates immersive technology, physiological sensing and automated therapy adaptation to address the core limitations of conventional CBT. By leveraging real-time affective data and behaviour-aware modelling, AIVR systems can dynamically tailor exposure difficulty, provide personalised feedback and support, and offer interpretable feedback to both users and clinicians. The framework outlines how adaptive AI can (1) extend therapy beyond traditionally inefficient conversation-based formats, (2) streamline individualisation through continuous learning, (3) ensure safe and transparent evaluation, and (4) promote long-term behavioural resilience. Emphasising ethical design, transparency and human oversight, AIVR reimagines AI not as a replacement for therapists, but as a collaborative partner in mental health care. This paper calls for interdisciplinary collaboration across AI, human-computer interaction, psychology and ethics to realise trustworthy, behaviour-aware VR interventions that align therapeutic innovation with human values.

Affect-Intelligent Virtual Reality (AIVR): Overcoming the Core Challenges of Cognitive Behavioural Therapy

In this tutorial, we chart a practical path from raw capability to trustworthy reasoning with foundation models. We begin by motivating why trustworthy reasoning is essential: when models bluff multiplications or invent drug interactions, their value collapses and risks increase. We adopt four pillars of trustworthiness, i.e., capability, safety, robustness, and explainability, as the organizing framework for the entire session.

In Part I, we trace the evolution from early language models to today’s foundation models that produce extended chains of thought and act in the world. Through concrete case studies, we dissect jailbreaks, hallucinations, and brittle logic, and we connect these failure modes to regulatory pressure such as the EU AI Act. The takeaway is clear: we must design for trustworthy reasoning from the outset, especially in high-stakes domains such as clinical or financial decision-making.

In Part II, we move from leaderboards to a science of measurement. We show how to build reliable, valid evaluations using psychometric tools, including item response theory, amortized evaluation, and predictability analysis. We implement three open-source pipelines hands-on: TruthfulQA for hallucination detection, HellaSwag for robustness testing, and MATH with formal-verification hooks in Lean4. Along the way, we demonstrate red-teaming stress tests and reasoning-trace metrics that surface subtle errors leaderboards miss, and we practice calibration, dataset curation, and transparent reporting for honest progress tracking.

In Part III, we deliver a compact methodology for trustworthy machine reasoning. We cover training-free prompting methods (chain-of-thought, retrieval-augmented generation, constrained decoding), post-training algorithms (supervised fine-tuning, RLHF, verifiable rewards, self-reward), and test-time techniques (self-consistency, reflection, tree search, tool-augmented verification). We introduce guardrails—safe sampling and semantic filters—that reduce risk without crippling capability. For each technique, we map effects to the four pillars, highlight trade-offs and failure signatures, and summarize when to combine methods for maximum leverage.

In Part IV, we turn to deployment. We walk through real-world agents and workflows, e.g., Lean4-based code verification assistants and bioinformatics pipelines proposing candidate compounds. We share step-by-step recipes, failure checklists, and diagnostics so participants can preserve trust while shipping. We also outline governance artifacts—risk registers, evaluation cards, and incident playbooks—that align technical practice with policy expectations.

We emphasize open, reproducible assets and decision rubrics that translate research into dependable products. Our goal is simple: help you move from compelling demos to trustworthy systems that earn and deserve user trust.

All materials will be available at: https://trustworthy-machine-reasoning.github.io/

Trustworthy Machine Reasoning with Foundation Models

Neuroevolution, or optimization of neural networks through evolutionary computation, is a method for constructing intelligent agents through population-based search. It is particularly useful in partially observable domains with sparse and multiobjective reinforcement; compared to other policy search techniques, its power comes from extensive exploration that allows it to find effective, often surprising solutions.   Prime application domains include robotic control, game-playing agents, and decision-making. More recently, it has also been extended to optimizing deep-learning architectures, understanding how biological intelligence evolved, and optimizing neural networks for hardware implementation.  Several synergies have also emerged with reinforcement learning and large language models. This tutorial introduces participants to the basics of neuroevolution, progresses to several advanced topics that make neuroevolution effective and general, reviews example application areas, and proposes further research questions. An optional hands-on exercise makes these concepts concrete and allows the participants to take advantage of neuroevolution immediately.  For more details, see https://www.cs.utexas.edu/~risto/talks/aaai26-tutorial/.

Premium content

Next from AAAI 2026

Neuro-Symbolic AI for Alzheimer's Disease: Physics-Informed Biomarker Prediction and Verifiable Intervention Planning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES