Singapore

When evaluating large language models (LLMs) for question
answering tasks, a common protocol is multiple-choice
question-answering (MCQA), where the model selects from a
fixed set of choices.
In contemporary robustness testing, researchers typically
perturb instructions or introduce confusion into factual
statements; however, model behavior also hinges on choice
compliance: whether models remain within the canonical set
A-D.
We formalize this setting by asking whether the model
continues to respect the interface&#39;s rules when the problem
presents a tempting alternative.
Our approach is interface-preserving: we append a single
selectable option E while keeping the question and A-D
unchanged.
Then, we introduce three types of malicious option
injection to assess LLMs&#39; robustness.
Experimental results highlight the vulnerability of LLMs on
contradict type content of the additional option E.
Our evaluation framework can effectively serve as a
low-cost audit of rule adherence on existing datasets and
black-box models, surfaces off-policy items, and supports
interpretable model comparison for deployment.

AAAI 2026

Obedience or Vigilance? How Large Language Models React to Malicious Multiple-Choice Options (Student Abstract)

When evaluating large language models (LLMs) for question
answering tasks, a common protocol is multiple-choice
question-answering (MCQA), where the model selects from a
fixed set of choices.
In contemporary robustness testing, researchers typically
perturb instructions or introduce confusion into factual
statements; however, model behavior also hinges on choice
compliance: whether models remain within the canonical set
A-D.
We formalize this setting by asking whether the model
continues to respect the interface's rules when the problem
presents a tempting alternative.
Our approach is interface-preserving: we append a single
selectable option E while keeping the question and A-D
unchanged.
Then, we introduce three types of malicious option
injection to assess LLMs' robustness.
Experimental results highlight the vulnerability of LLMs on
contradict type content of the additional option E.
Our evaluation framework can effectively serve as a
low-cost audit of rule adherence on existing datasets and
black-box models, surfaces off-policy items, and supports
interpretable model comparison for deployment.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Retrieval-augmented generation (RAG) is the backbone of
knowledge-intensive NLP, yet its progress is hindered by a
long-standing asymmetry: Generators are refined while
retrievers remain static, and full end-to-end optimization
is prohibitively unstable. We present BPO-RAG, a bi-level
preference-learning framework that redefines the training
paradigm by jointly optimizing retrieval and generation
with a single supervision signal, pairwise preferences.
Stage~1 (Retrieval Preference Optimization) learns to
select superior evidence sets, while Stage~2 (Generation
Preference Optimization) aligns answer generation with the
same evidence, closing the gap between what to read and
what to write. This recipe without label requires no reward
model or online RL, integrates seamlessly with standard RAG
pipelines, and transforms preferences into a unifying
training currency. Across open-domain QA benchmarks,
BPO-RAG consistently advances retrieval quality and yields
more accurate, faithful answers, surpassing strong RAG
baselines with remarkable stability. By coupling retrieval
and generation under a unified preference framework,
BPO-RAG establishes a practical and principled path toward
the next generation of reliable, modular, and trustworthy
knowledge-intensive language models.

Bi-Level Preference Optimization for Retrieval-Augmented Generation (Student Abstract)

Esports is growing rapidly, yet the data available to
researchers is limited due to the game company policies.
Consequently, vision-based approaches utilizing game
screens are gaining attention as a practical alternative.
We focus on the League of Legends minimap and address the
challenges of champion detection when extracting champion
information from the minimap. The challenges in this domain
include small objects, rapid movement, and frequent
occlusions.
We propose a transfer-learning-based object detection
pipeline that combines synthetic data with a subset of replay
data. Synthetic data enables the rapid generation of
diverse scenarios and improves training scalability, while
replay data reduces the data distribution gap. This approach
achieves 0.588 mean average precision, improving over
replay-only by 0.261 and synthetic-only by 0.312, with 6.4 ms
latency. Furthermore, we constructed a dataset
encompassing all champions, enabling comparative analysis
of detection models and supporting reproducible
benchmarking for various application studies.

Synthetic-to-Real Transfer Learning for League of Legends Minimap Object Detection (Student Abstract)

We present Magnol.AI Copilot, an extension of the Magnol.AI digital biomarker platform that integrates multimodal large language models (LLMs) to transform digital health technology (DHT) trial dashboards into conversational systems. Copilot augments the platform with a multi-agent orchestration layer and vision-enabled LLMs that interpret visualizations, tabular summaries, and textual metadata. The
system enables natural language queries and automatic generation of contextual insights, allowing researchers to interact with wearable data through dialogue rather than static inspection. A case study with an actigraphy device demonstrates Copilot’s ability to identify nightly compliance gaps and provide contextual explanations, reducing cognitive load compared to manual dashboard review. This work presents
a novel integration of IoMT infrastructure with multimodal LLMs, advancing digital biomarker research toward conversational and accessible DHT trial platforms.

Magnol.AI Copilot: Multimodal LLMs for Conversational Insight Generation

Language models are powerful artifacts, yet their factual knowledge is still poorly understood, and inaccessible to ad-hoc browsing and scalable statistical analysis. This demonstration introduces GPTKB v1.5, a densely interlinked 100-million-triple knowledge base (KB) built for $14,000 from GPT-4.1, using the GPTKB methodology for massive-recursive LLM knowledge materialization. This demo focuses on three use cases: (1) link-traversal-based LLM knowledge exploration, (2) SPARQL-based structured LLM knowledge querying, (3) comparative exploration of the strengths and weaknesses of LLM knowledge. Massive-recursive LLM knowledge materialization is a groundbreaking opportunity both for the systematic analysis of LLM knowledge, as well as for automated KB construction.

GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge

LLMs are increasingly being deployed as chatbots, but today’s interfaces offer little to no friction: users interact through seamless conversations that conceal when the model is drifting, hallucinating or failing. This lack of transparency fosters blind trust, even as models produce unstable or repetitive outputs. We introduce an interactive demo that surfaces and mitigates cognitive fatigue, a failure mode where LLMs gradually lose coherence during auto-regressive generation. Our system, Chatsparent, instruments real-time, token-level signals of fatigue, including attention-to-prompt decay, embedding drift, and entropy collapse, and visualizes them as a unified fatigue index. When fatigue thresholds are crossed, the interface allows users to activate lightweight interventions such as attention resets, entropy-regularized decoding, and self-reflection checkpoints. The demo streams live text and fatigue signals, allowing users to observe when fatigue arises, how it affects output quality, and how interventions restore stability. By turning passive chatbot interaction into an interactive diagnostic experience, our system empowers users to better understand LLM behavior while improving reliability at inference time. The demo video is available at https://youtu.be/ktqkZyYWDDE.

Chatsparent: An Interactive System for Detecting and Mitigating Cognitive Fatigue in LLMs

We present SafeLens, a lightweight segment-level video moderation system that fuses speech, text, and visual frames to produce hateful content detection for each segment. For every segment, SafeLens returns a structured prediction: label, prediction confidence, reasons for flag, harm categories. The structured predictions are optimized for triage, appeals, and downstream enforcement. The system is modular (pluggable speech, text, and visual processing modules back-ends and a mid-size policy Language Language Model (LLM) agent with parameter-efficient tuning). In the live demo, attendees can upload or select clips, scrub the timeline to flag hateful segments, inspect rationales, and vary the policy LLM agent to benchmark the hateful content moderation performance.

Video: https://www.youtube.com/watch?v=B1dYceLSnXA

SafeLens: Segment-Level Hate Speech Detection in Online Videos

Time series anomaly detection has received substantial attention over the past two decades, leading to the development of hundreds of algorithms. However, comprehensively understanding this vast landscape remains challenging, particularly for non-experts and novices. In this demonstration paper, we present \demonstratorname, an interactive web application that provides access to more than 30 state-of-the-art time series anomaly detection algorithms. \demonstratorname is intended to explore the performance of existing as well as custom anomaly detection models in an interactive, hands-on manner. By lowering the entry bar, we support practitioners overwhelmed by the large number of existing techniques, while providing a platform for researchers to rapidly analyze their novel anomaly detection algorithms.

InTimeAD: Interactive Time Series Anomaly Detection

We present ARGUS, an end-to-end Argument Mining (AM) tool that exploits Large Language Models (LLMs) to automatically perform all core AM tasks, i.e., Argument Component Segmentation, Classification, Relation Identification, and Relation Classification. Furthermore, ARGUS builds the corresponding argumentation framework (AF) and seamlessly integrates symbolic solvers to compute extensions and perform formal reasoning. ARGUS is designed to ensure broad flexibility and usability, supporting any open-source or commercial LLMs and symbolic solvers, providing a ready-to-use platform for exploring neuro-symbolic approaches to argumentation in both research and practical applications.

ARGUS: Towards End-to-End Argument Mining with Large Language Models

We present City of Light (COL), a Unity-based, city-scale (116 km2) simulator of Paris for high-throughput embodied-AI research. COL fuses open geographic information system (GIS) sources into geo-anchored, per-tile meshes and provides a configurable, stochastic runtime with controllable traffic and pedestrians. Agents receive frame-synchronized multi-sensor observations (RGB, depth, normals, semantics). To support high-rate vision pipelines, we introduce TURBO, a zero-copy Unity-Python bridge that streams multi-camera observations to Python and allows control at up to 1300 frames per second (FPS), achieving higher throughput than ML-Agents in our benchmark. We also provide a Street View Digital Twin that aligns simulator viewpoints with corresponding real-world panoramas for frame-accurate visual comparison and quantitative matching. COL enables fast scripting, large-scale data collection, and reinforcement-learning (RL) in geo-anchored urban settings.

City of Light (COL): A City-Scale, Geo-Anchored Urban Simulator with High-Throughput Multi-Sensor Streams

Extended reality (XR) is well suited to support the situated learning of technical procedures. At the same time, AI-driven intelligent tutoring systems (ITS) can complement XR by providing adaptive pedagogical support. Many domains would benefit from this combination, especially when trainers, equipment, or team members are limited. We present a domain-agnostic XR-based ITS that integrates a training procedure representation (TPR), XR simulation, and an LLM-driven instructor. We demonstrate the tutor's use for tissue sample handling and engine repair, showing how it delivers adaptive feedback, collaborative roleplay, and dynamic scenario management to create realistic and pedagogically meaningful training experiences.

Downloads

Next from AAAI 2026

Bi-Level Preference Optimization for Retrieval-Augmented Generation (Student Abstract)

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES