Singapore

Scalable oversight protocols aim to empower evaluators to verify the output of AI models more capable than themselves. However, human evaluators are subject to biases that can lead to systematic errors. In a reanalysis of prior work that appeared to demonstrate the efficacy of simple protocols, we find that human evaluators possessing knowledge absent from models likely contributed to their positive results, which is an advantage that diminishes as models continue to scale in capability. We also report the results of two experiments examining the performance of simple oversight protocols where evaluators know that the model is &quot;correct most of the time, but not all of the time&#39;&#39;, finding no overall advantage for the tested protocols. In our main experiment, participants in both groups became more confident in the system’s answers after conducting online research, even when those answers were incorrect. These findings underscore the importance of testing the degree to which oversight protocols are robust to evaluator biases, whether they outperform a strategy of simple deference to the model being evaluated, and whether their performance scales with increasing problem difficulty and model capability.

AAAI 2026

Confirmation Bias: A Challenge for Scalable Oversight

scalable oversight

confirmation bias

human-computer interaction

Scalable oversight protocols aim to empower evaluators to verify the output of AI models more capable than themselves. However, human evaluators are subject to biases that can lead to systematic errors. In a reanalysis of prior work that appeared to demonstrate the efficacy of simple protocols, we find that human evaluators possessing knowledge absent from models likely contributed to their positive results, which is an advantage that diminishes as models continue to scale in capability. We also report the results of two experiments examining the performance of simple oversight protocols where evaluators know that the model is "correct most of the time, but not all of the time'', finding no overall advantage for the tested protocols. In our main experiment, participants in both groups became more confident in the system’s answers after conducting online research, even when those answers were incorrect. These findings underscore the importance of testing the degree to which oversight protocols are robust to evaluator biases, whether they outperform a strategy of simple deference to the model being evaluated, and whether their performance scales with increasing problem difficulty and model capability.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

In self-consuming generative models that train on their own outputs, alignment with user preferences becomes a recursive rather than one-time process. In this paper, we provide the first formal foundation for analyzing the long-term effects of such recursive retraining on alignment. Under a two-stage curation mechanism based on the Bradley–Terry (BT) model, we model alignment as an interaction between two factions: the Model Owner, who filters which outputs should be learned by the model, and the Public User, who determines which outputs are ultimately shared and retained through interactions with the model. Our analysis reveals three structural convergence regimes: consensus collapse, compromise on shared optima, and asymmetric refinement, depending on the degree of preference alignment. We prove a fundamental impossibility theorem: no recursive BT-based curation mechanism can simultaneously preserve diversity, ensure symmetric influence, and eliminate dependence on initialization. Framing the process as dynamic social choice, we show that alignment is not a static goal but an evolving equilibrium shaped by power asymmetries and path dependence.

The Alignment Game: A Theory of Long-Horizon Alignment Through Recursive Curation

Generating long, coherent text remains a challenge for large language models (LLMs), as they lack hierarchical planning and structured organization in discourse generation. We introduce Structural Alignment, a novel method that aligns LLMs with human-like discourse structures to enhance long-form text generation. By integrating linguistically grounded discourse frameworks into reinforcement learning, our approach guides models to produce coherent and well-organized outputs. We employ a dense reward scheme within a Proximal Policy Optimization framework, assigning fine-grained, token-level rewards based on the discourse distinctiveness relative to human writing. Two complementary reward models are evaluated: the first improves readability by scoring surface-level textual features to provide explicit structuring, while the second reinforces deeper coherence and rhetorical sophistication by analyzing global discourse patterns through hierarchical discourse motifs, outperforming both standard and RLHF-enhanced models in tasks such as essay generation and long-document summarization. All training data and code will be publicly released.

Align to Structure: Aligning Large Language Models with Structural Information

Deep learning (DL) models are increasingly deployed in safety-critical applications such as face recognition, autonomous driving, and medical diagnosis. Despite their impressive accuracy, they remain vulnerable to adversarial examples - subtle perturbations that can cause incorrect predictions, i.e., the robustness issues. While adversarial training improves robustness against known attacks, it often fails to generalize to unseen or stronger threats, revealing a critical gap in robustness generalization. In this work, we propose a dual-model fuzzing framework to enhance generalized robustness in DL models. Central to our method is a lightweight metric, the Lagrangian Information Bottleneck (LIB), which guides entropy-based mutation toward semantically meaningful and high-risk regions of the input space. The executor uses a resistant model and a more error-prone vulnerable model; their prediction consistency forms the basis of agreement mining, a label-free oracle for isolating decision-boundary samples. To ensure fuzzing effectiveness, we further introduce a task-driven seed selection strategy (e.g., SSIM for vision) that filters out low-quality inputs. We implement a prototype, TWINFUZZ, and evaluate it on six benchmark datasets and nine DL models. Compared with state-of-the-art testing approaches, TWINFUZZ achieves superior improvements in both training-specific and generalized robustness.

TWINFUZZ: Dual-Model Fuzzing for Robustness Generalization in Deep Learning

Abuse detection in e-commerce platforms is critical for preventing operational losses, particularly for transaction types vulnerable to abuse such as Return-to-Origin (RTO) in Cash-on-Delivery (COD) workflows. Detecting such abuse accurate, real-time decisions to intercept malicious orders before placement, imposing stringent sub-second latency requirements on deployed systems. In this work, we present TRUST, a deployed, production-scale abuse detection system based on a unified architecture of heterogeneous Graph Neural Networks (GNNs) and Transformer-based sequence encoders. This design enables joint reasoning over multi-relational entity interactions and temporal behavioural signals, allowing the model to combine complementary information for effective abuse detection when either modality is sparse or absent. TRUST processes millions of transactions daily with an average inference latency of ~25 ms, achieving a ~9.6% absolute precision improvement over a strong XGBoost baseline in live RTO detection. We report systematic ablation studies across both graph and sequence stages, evaluating GNN variants, sampling strategies, sequence lengths, and positional encoding schemes to guide architectural choices. Deployed end-to-end in a high-throughput environment, TRUST demonstrates that GNN–Transformer cascades can deliver state-of-the-art accuracy, scalability, and operational reliability in real-world abuse detection, offering a reproducible blueprint for similar industry-scale applications.

TRUST: Transaction Risk via Unified Sequence and Topology

Maintenance of mission-critical industrial assets is frequently hindered by fragmented data, inconsistent record-keeping, and limited access to analytical expertise, resulting in reactive rather than predictive practices. We present \textit{CodeReAct}, an AI-powered agentic framework deployed in large-scale facilities to automate event analysis and work order (WO) management.CodeReAct extends the ReAct paradigm by embedding executable Python code within the Thought--Action--Observation (TAO) loop, enabling natural language interaction, grounding heterogeneous alerts and work orders into structured Business Objects (BOs), and dynamically invoking analytic functions for forecasting, anomaly correlation, and maintenance recommendations. This architecture reduces manual data science intervention, improves adaptability, and supports reuse across asset types. 

Deployed in a mission-critical data center and productionized in Maximo, CodeReAct manages pumps, chillers, AHUs, compressors, cooling towers, and other mechanical and electrical systems. Evaluation with 36 representative maintenance utterances showed that outer-loop reflection and adaptive temperature improved task completion by up to 20%, while ablation studies confirmed the importance of reasoning in addition to code execution. Business validation revealed seasonal failure patterns, bundling opportunities, and predictive accuracy trends. In production, site engineers reported 25--40% faster diagnostics, fewer unplanned downtime events, and reduced reliance on specialized analysts. Lessons learned highlight the importance of structured BOs for grounding analytics, runtime safeguards to mitigate hallucinations, and adaptive model control for consistent execution. These results demonstrate how deployed agentic AI can deliver measurable business value in predictive and strategic maintenance planning.

Deployed AI Agents for Industrial Asset Management: CodeReAct Framework for Event Analysis and Work Order Automation

Automatically extracting engaging and high-quality humorous scenes from cinematic titles is pivotal for creating captivating video previews and snackable content, boosting user engagement on streaming platforms. Long-form cinematic titles, with their extended duration and complex narratives, challenge scene localization, while humor’s reliance on diverse modalities and its nuanced style add further complexity. This paper introduces an end-to-end system for automatically identifying and ranking humorous scenes from long-form cinematic titles,
featuring shot detection, multimodal scene localization, and humor tagging optimized for cinematic content. Key innovations include a novel scene segmentation approach combining visual and textual cues, improved shot representations via guided triplet mining, and a multimodal humor tagging framework leveraging both audio and text modalities. Our system achieves an 18.3% AP improvement over state-of-the-art scene detection on the OVSD dataset and an F1 score of 0.834 for detecting humor in long text. Extensive evaluations across five cinematic titles demonstrate 87% of clips extracted by our pipeline are intended to be funny, while 98% of scenes are accurately localized. With successful generalization to trailers, these results showcase the pipeline’s potential to enhance content creation workflows, improve user engagement, and streamline snackable content generation for diverse cinematic media formats.

Automatic Funny Scene Extraction from Long-form Cinematic Videos

We introduce Trauma THOMPSON, a dataset and suite of benchmarks designed to accelerate the development of AI-powered copilots for real-time decision-making in emergency and resource-limited medical settings. This work proposes a method to address a critical bottleneck for future deployment: models trained on simulation may not work well in the real world. The dataset features 3,717 unscripted, first-person video clips of five emergency procedures, uniquely including "just-in-time" (JIT) interventions that mirror the improvisational nature of field medicine. To obtain realistic patient data without ethical issues and identity concerns that medical data often counter, we also propose TraumaGen, a novel framework for generating photorealistic patient and wound images from manikins while preserving clinical context. We establish benchmarks for action recognition, anticipation, and visual question answering (VQA), evaluating state-of-the-art models to demonstrate the challenges and potential of our dataset. By focusing on realism and improvisation, Trauma THOMPSON provides a crucial resource and a clear path toward developing and validating robust AI assistants for future deployment in real-world emergency care. The dataset is available at https://anonymous.4open.science/r/dataset-58F3.

Trauma THOMPSON: A Dataset and Realistic Generative Framework for AI Copilots in Emergency Care

Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pre-training (DAPT), Domain-specific Supervised Fine-Tuning (DSFT), and Direct Preference Optimization (DPO) to train effective small-scale models for specialized use cases. We demonstrate this approach through DiagnosticSLM, a 3B-parameter language model tailored for fault diagnosis, root cause analysis, and repair recommendation in industrial settings. To evaluate model performance, we introduce four domain-specific benchmarks: multiple-choice questions (DiagnosticMCQ), question answering (DiagnosticQA), sentence completion (DiagnosticComp), and summarization (DiagnosticSum). DiagnosticSLM achieves up to 25% accuracy improvement over open-source models of comparable or larger size (2B-9B) on the MCQ task, while also outperforming or matching them in other tasks, demonstrating strong domain-specific reasoning and generalization capabilities.

Building Domain-Specific Small Language Models via Guided Data Generation

Estimating Body Mass Index (BMI) from camera images with machine learning models enables rapid weight assessment when traditional methods are unavailable or impractical, such as in telehealth or emergency scenarios. Existing computer vision approaches have been limited to datasets of up to 14,500 images. In this study, we present a deep learning-based BMI estimation method trained on our WayBED dataset, a large proprietary collection of 84,963 smartphone images from 25,353 individuals. We introduce an automatic filtering method that uses posture clustering and person detection to curate the dataset by removing low-quality images, such as those with atypical postures or incomplete views. This process retained 71,322 high-quality images suitable for training. We achieve a Mean Absolute Percentage Error (MAPE) of 7.9% on our hold-out test set (WayBED data) using full-body images, the lowest value in the published literature to the best of our knowledge. Further, we achieve a MAPE of 13% on the completely unseen (during training) VisualBodyToBMI dataset, comparable with state-of-the-art approaches trained on it, demonstrating robust generalization. Lastly, we fine-tune our model on VisualBodyToBMI and achieve a MAPE of 8.56%, the lowest reported value on this dataset so far. We deploy the full pipeline, including image filtering and BMI estimation, on Android devices using the CLAID framework. We release our complete code for model training, filtering, and the CLAID package for mobile deployment as open-source contributions.

Digital Scale: Open-Source On-Device BMI Estimation from Smartphone Camera Images Trained on a Large-Scale Real-World Dataset

Urinary tract infection (UTI) flare-ups pose a significant health risk for older adults with chronic conditions. These infections often go unnoticed until they become severe, making early detection through innovative smart home technologies crucial. Traditional machine learning (ML) approaches relying on simple binary classification for UTI detection offer limited utility to nurses and practitioners as they lack insight into prediction uncertainty, hindering informed clinical decision-making. This paper presents a clinician-in-the-loop (CIL) smart home system that leverages ambient sensor data to extract meaningful behavioral markers, train robust predictive ML models, and calibrate them to enable uncertainty-aware decision support. The system incorporates a statistically valid uncertainty quantification method called Conformal-Calibrated Interval (CCI), which quantifies uncertainty and abstains from making predictions ("I don’t know") when the ML model's confidence is low. Evaluated on real-world data from eight smart homes, our method outperforms baseline methods in recall and other classification metrics while maintaining the lowest abstention proportion and interval width. A survey of 42 nurses confirms that our system's outputs are valuable for guiding clinical decision-making, underscoring their practical utility in improving informed decisions and effectively managing UTIs and other condition flare-ups in older adults.

Downloads

Next from AAAI 2026

The Alignment Game: A Theory of Long-Horizon Alignment Through Recursive Curation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

The Alignment Game: A Theory of Long-Horizon Alignment Through Recursive Curation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads