Singapore

The integration of Large Language Models (LLMs) into clinical applications presents transformative potential but is undermined by the critical risk of hallucination, the generation of plausible but factually incorrect information. Such failures pose a direct threat to patient safety and the integrity of clinical decision-making. To address this challenge, we introduce MHB, a novel and comprehensive benchmark framework designed to evaluate LLM reliability in two complex, high-stakes clinical contexts: multi-turn medical dialogues and clinical case report analysis. The core of our contribution is a systematic methodology for generating adversarial test cases by injecting ``hallucination traps&quot; into realistic medical data, guided by a fine-grained taxonomy of clinical errors.
MHB, comprising 4,695 samples and 20,288 evaluation rubrics, underwent a rigorous, two-stage validation by a panel of \textit{60 licensed physicians from top-tier hospitals}, ensuring high clinical realism and consistency. This comprehensive assessment of leading LLMs revealed significant, clinically relevant shortcomings across the board. Even the best-performing model, \texttt{Claude-4-Sonnet}, exhibited a hallucination rate of 29.1\%, with some open-source models exceeding 57.0\%. All models struggled with specific traps, like fabricated medical data or non-existent guidelines, highlighting prevalent systemic weaknesses.

AAAI 2026

MHB: Medical Hallucination Benchmark for Large Language Models in Complex Clinical Tasks

medical benchmark

large language model

hallucination

The integration of Large Language Models (LLMs) into clinical applications presents transformative potential but is undermined by the critical risk of hallucination, the generation of plausible but factually incorrect information. Such failures pose a direct threat to patient safety and the integrity of clinical decision-making. To address this challenge, we introduce MHB, a novel and comprehensive benchmark framework designed to evaluate LLM reliability in two complex, high-stakes clinical contexts: multi-turn medical dialogues and clinical case report analysis. The core of our contribution is a systematic methodology for generating adversarial test cases by injecting ``hallucination traps" into realistic medical data, guided by a fine-grained taxonomy of clinical errors.
MHB, comprising 4,695 samples and 20,288 evaluation rubrics, underwent a rigorous, two-stage validation by a panel of \textit{60 licensed physicians from top-tier hospitals}, ensuring high clinical realism and consistency. This comprehensive assessment of leading LLMs revealed significant, clinically relevant shortcomings across the board. Even the best-performing model, \texttt{Claude-4-Sonnet}, exhibited a hallucination rate of 29.1\%, with some open-source models exceeding 57.0\%. All models struggled with specific traps, like fabricated medical data or non-existent guidelines, highlighting prevalent systemic weaknesses.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Accurate detection of offensive content on social media demands high-quality labeled data; however, such data is often scarce due to the low prevalence of offensive instances and the high cost of manual annotation. To address this low-resource challenge, we propose a self-training framework that leverages abundant unlabeled data through collaborative pseudo-labeling. Starting with a lightweight classifier trained on limited labeled data, our method iteratively assigns pseudo-labels to unlabeled instances with the support of Multi-Agent Vision-Language Models (MA-VLMs). Unlabeled data on which the classifier and MA-VLMs agree are designated as the Agreed-Unknown set, while conflicting samples form the Disagreed-Unknown set. To enhance label reliability, MA-VLMs simulate dual perspectives, moderator and user, capturing both regulatory and subjective viewpoints. The classifier is optimized using a novel Positive-Negative-Unlabeled (PNU) loss, which jointly exploits labeled, Agreed-Unknown, and Disagreed-Unknown data while mitigating pseudo-label noise. Experiments on benchmark datasets demonstrate that our framework substantially outperforms baselines under limited supervision and approaches the performance of large-scale models.

Multi-Agent VLMs Guided Self-Training with PNU Loss for Low-Resource Offensive Content Detection

Clear visual information is indispensable for tasks such as autonomous navigation, ecological monitoring, and inspection in marine environments, yet underwater images are notoriously marred by colour casts, haze, and loss of detail. We present DRM-Net, a practical enhancement framework that turns this challenge into an explicit recovery problem. Instead of guessing the whole clean image, DRM-Net first predicts a Degradation Residual Map (DRM) that pinpoints, pixel by pixel, how much colour, contrast, and texture have been lost. Adding this residual back to the raw frame produces the restored result in a single, transparent step. A lightweight Subaquatic Multi-Scale Context Fusion module further enhances robustness by allowing the network to view the scene through multiple “water layers”, adaptively selecting the most relevant scale for each image. Guided jointly by a DRM residual L1-loss and a perceptual loss, DRM-Net delivers sharper edges and truer colours while adding only negligible computational overhead. Comprehensive experiments on benchmark datasets demonstrate that our method effectively restores underwater images with superior colour fidelity, perceptual quality, and structural details. Compared with state-of-the-art approaches, our framework achieves significant improvements in both quantitative metrics and qualitative visual assessments across diverse underwater scenarios.

DRM-Net: Explicit Residual Modelling with Subaquatic Multi-Scale Context Fusion for Underwater Image Enhancement

Biodiversity is declining globally at an unprecedented rate. Managers urgently need to allocate limited resources to control pest species where interventions have the highest ecological impact. However, many species are hard to detect, and data collection is often expensive, irregular, and incomplete, thus posing significant challenges for machine learning models that traditionally require large and regular datasets. We present a novel deep learning architecture that estimates the spatiotemporal abundance of hard-to-detect species from sparse, zero-inflated, and irregular observational data. Our method combines Graph Convolutional Networks (GCNs) to model spatial dependencies across monitoring sites with Recurrent Neural Networks (RNNs) to capture long-range temporal dynamics. This architecture explicitly addresses the challenges of ecological data sparsity, heterogeneity, and irregular sampling. We apply our model to the Crown-of-Thorns Starfish (COTS) on Australia's Great Barrier Reef, a species with devastating impact on coral reefs and a major target of pest control programs. Our method significantly outperforms baseline approaches and the current resource-intensive approach, manta-tow surveillance, in both accuracy and detectability. Simulations indicate a 20\% increase in starfish removal efficiency over a year, enabling more effective coral protection. This work demonstrates how tailored deep learning methods can overcome ecological data limitations and substantially improve conservation outcomes. The code is available at \url{https://github.com/XXX}.

Leveraging Sparse Observations to Predict Species Abundance Across Space and Time

Augmented Reality (AR) and Multimodal Large Language
Models (LLMs) are rapidly evolving, providing unprecedented
capabilities for human-computer interaction. However,
their integration introduces a new attack surface for social
engineering. In this paper, we systematically investigate
the feasibility of orchestrating AR-driven Social Engineering
attacks using Multimodal LLM for the first time, via our proposed
SEAR framework, which operates through three key
phases: (1) AR-based social context synthesis, which fuses
Multimodal inputs (visual, auditory and environmental cues);
(2) role-based Multimodal RAG (Retrieval-Augmented Generation),
which dynamically retrieves and integrates contextual
data while preserving character differentiation; and (3)
ReInteract social engineering agents, which execute adaptive
multiphase attack strategies through inference interaction
loops. To verify SEAR, we conducted an IRB-approved
study with 60 participants in three experimental configurations
(unassisted, AR+LLM, and full SEAR pipeline) compiling
a new dataset of 180 annotated conversations in simulated
social scenarios. Our results show that SEAR is highly
effective at eliciting high-risk behaviors (e.g., 93.3% of participants
susceptible to email phishing). The framework was
particularly effective in building trust, with 85% of targets
willing to accept an attacker’s call after an interaction. Also,
we identified notable limitations such as “occasionally artificial”
due to perceived authenticity gaps. This work provides
proof-of-concept for AR-LLM driven social engineering
attacks and insights for developing defensive countermeasures
against next-generation augmented reality threats. The
SEAR code and dataset is available at: https://github.com/
2192537130/searsystem/tree/master.

On the Feasibility of Using MultiModal LLMs to Execute AR Social Engineering Attacks

Digital Twins (DTs) offer powerful tools for managing complex infrastructure systems, but their effectiveness is often limited by challenges in integrating unstructured knowledge. Recent advances in Large Language Models (LLMs) bring new potential to address this gap, with strong abilities in extracting and organizing diverse textual information. We therefore propose LSDTs (LLM-Augmented Semantic Digital Twins), a framework that helps LLMs extract planning knowledge from unstructured documents like environmental regulations and technical guidelines, and organize it into a formal ontology. This ontology forms a semantic layer that powers a digital twin—a virtual model of the physical system—allowing it to simulate realistic, regulation-aware planning scenarios. We evaluate LSDTs through a case study of offshore wind farm planning in Maryland, including its application during Hurricane Sandy. Results demonstrate that LSDTs support interpretable, regulation-aware layout optimization, enable high-fidelity simulation, and enhance adaptability in infrastructure planning. This work shows the potential of combining generative AI with digital twins to support complex, knowledge-driven planning tasks.

LSDTs: LLM-Augmented Semantic Digital Twins for Adaptive Knowledge-Intensive Infrastructure Planning

The mismatch between the growing demand for psychological counseling and the limited availability of services has motivated research into the application of Large Language Models (LLMs) in this domain. Consequently, there is a need for a robust and unified benchmark to assess the counseling competence of various LLMs. Existing works, however, are limited by unprofessional client simulation, static question-and-answer evaluation formats, and unidimensional metrics. These limitations hinder their effectiveness in assessing a model's comprehensive ability to handle diverse and complex clients. To address this gap, we introduce **CARE-Bench**, a dynamic and interactive automated benchmark. It is built upon diverse client profiles derived from real-world counseling cases and simulated according to expert guidelines. CARE-Bench provides a multidimensional performance evaluation grounded in established psychological scales. Using CARE-Bench, we evaluate several general-purpose LLMs and specialized counseling models, revealing their current limitations. In collaboration with psychologists, we conduct a detailed analysis of the reasons for LLMs' failures when interacting with clients of different types, which provides directions for developing more comprehensive, universal, and effective counseling models. We intend to make the CARE-Bench suite publicly available upon acceptance of this paper.

CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling

Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity and co-movement – semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real world scenarios.

MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes

Accelerating research in renewable energy policy is critical for addressing climate change and enabling informed decision-making. Question answering (QA) over public policy documents presents unique challenges due to their legal structure, conditional dependencies, and domain-specific vocabulary. In this paper, we introduce EvalQAG, a framework for generating high-quality QA pairs from renewable energy policy texts. EvalQAG combines structured prompts, retrieval-augmented inputs, and multi-stage evaluation using large language models (LLMs) to support accurate and diverse QA generation. Using this framework, we construct REPolicyQA, a domain-specific QA dataset comprising around 160,000 QA pairs from more than 1,000 U.S. renewable energy policy documents. The dataset covers five policy-relevant question types—Yes/No, Yes/No with Conditions, Factual, Legal Obligation, and Descriptive—capturing a wide range of reasoning patterns grounded in regulatory texts. We evaluate multiple QA models and uncover significant performance gaps, especially in legal reasoning and conditional inference—highlighting major shortcomings in current systems. Our results establish EvalQAG as a generalizable QA generation pipeline for policy texts and position REPolicyQA as a new benchmark for advancing QA research in policy and regulatory domains. We believe this work can foster impactful research in the renewable energy sector, particularly by enabling more robust and explainable QA systems for legal and condition-heavy regulatory documents.

EvalQAG: A Framework for Automatic Complex QA Generation and a Benchmark QA Dataset for Policy Documents

Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8\% accuracy on narrative-style questions and perform below 15.5\% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini’s accuracy by +13.6\%, but reduces GPT-series performance by -11.4\%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.

LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning

Municipal inspections are an important part of maintaining the quality of goods and services. In this paper, we approach the problem of intelligently scheduling service inspections to maximize their impact, using the case of food establishment inspections in Chicago as a case study. The Chicago Department of Public Health (CDPH) inspects thousands of establishments each year, with a substantial fail rate (over 3,000 failed inspection reports in 2023). 
To balance the objectives of ensuring adherence to guidelines, minimizing disruption to establishments, and minimizing inspection costs, CDPH assigns each establishment an inspection window every year and guarantees that they will be inspected exactly once during that window. Meanwhile, CDPH also promises surprise public health inspections for unexpected food safety emergencies or complaints.
These constraints create a challenge for a restless multi-armed bandit (RMAB) approach, for which there are no existing methods.
We develop an extension to Whittle index-based systems for RMABs that can guarantee action window constraints and frequencies, and furthermore can be leveraged to optimize action window assignments themselves. Briefly, we combine MDP reformulation and integer programming-based lookahead to maximize the impact of inspections subject to constraints. A neural network-based supervised learning model is developed to model state transitions of real Chicago establishments using public CDPH inspection records, which demonstrates 10\% AUC improvements compared with directly predicting establishments' failures. Our experiments not only show up to 24\% (in simulation) or 33\% (on real data) objective improvements resulting from our approach and robustness to surprise inspections, but also give insight into the impact of scheduling constraints.

Content not yet available

Next from AAAI 2026

Multi-Agent VLMs Guided Self-Training with PNU Loss for Low-Resource Offensive Content Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES