Singapore

Large language models (LLMs) have demonstrated capabilities across diverse domains, yet their performance on rare disease diagnosis from narrative medical cases remains underexplored. We introduce a novel dataset of 177 symptom-diagnosis pairs extracted from House M.D., a medical television series validated for teaching rare disease recognition in medical education. We evaluate four state-of-the-art LLMs such as GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro on narrative-based diagnostic reasoning tasks. Results show significant variation in performance, ranging from 16.48\% to 38.64\% accuracy, with newer model generations demonstrating a 2.3$\times$ improvement. While all models face substantial challenges with rare disease diagnosis, the observed improvement across architectures suggests promising directions for future development. Our educationally validated benchmark establishes baseline performance metrics for narrative medical reasoning and provides a publicly accessible evaluation framework for advancing AI-assisted diagnosis research.

AAAI 2026

Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D.

workshop paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

To access this event page, you need to log in with the **email address you registered with**. Access credentials will be sent to your email from Underline - subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The proliferation of AI-generated video technologies poses challenges to information integrity. While recent benchmarks advance AIGC video detection, they overlook a critical factor: many state-of-the-art generative models embed digital watermarks in outputs, and detectors may partially rely on these patterns. To evaluate this influence, we present RobustSora, the benchmark designed to assess watermark robustness in AIGC video detection. We systematically construct a dataset of 6,500 videos comprising four types: Authentic-Clean (A-C), Authentic-Spoofed with fake watermarks (A-S), Generated-Watermarked (G-W), and Generated-DeWatermarked (G-DeW). Our benchmark introduces two evaluation tasks: Task-I tests performance on watermark-removed AI videos, while Task-II assesses false alarm rates on authentic videos with fake watermarks. Experiments with ten models spanning specialized AIGC detectors, transformer architectures, and MLLM approaches reveal performance variations of 2-8pp under watermark manipulation. Transformer-based models show consistent moderate dependency (6-8pp), while MLLMs exhibit diverse patterns (2-8pp). These findings indicate partial watermark dependency and highlight the need for watermark-aware training strategies. RobustSora provides essential tools to advance robust AIGC detection research.

RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

Large Language Models (LLMs) have shown advanced capabilities in tasks like counterfactual generation and style transfer using prompt strategies. However, previous strategies lacked detailed instructions, limiting effectiveness. To address this, we introduce Compare&Generate, an algorithm inspired by human comparison, where minimal instructions lead to substantial learning. Specifically, our method incorporates an objective function that quantitatively assesses alignment with the task goal and the content relevance in the output. Then, it constructs comparison pairs based on previous generation assessments and prompts the model to reconsider how to optimize its output. Through comparison, the model focuses on the critical aspects of the task objective and refines its outputs accordingly. We benchmark our method with single-instruction as well as iterative refinement approaches across three natural language generation tasks. Experimental results show that our approach outperforms other related methods; for instance, it surpasses its single-instruction base by 17% and a state-of-the-art refinement approach by 7% on IMDB datasets in generated label accuracy, highlighting the effectiveness of using comparisons in prompts to enhance LLMs.

Improving Synthetic Data Generation with LLMs through Strategic Comparisons

Knowledge Graphs (KGs) enable applications in various domains such as semantic search, recommendation systems, and natural language processing. KGs are often incomplete, missing entities and relations, an issue addressed by Knowledge Graph Completion (KGC) methods that predict missing elements. Different evaluation metrics, such as Mean Reciprocal Rank (MRR), Mean Rank (MR), and Hit@k (e.g., Hit@1), are commonly used to assess the performance of such KGC models. A major challenge in evaluating KGC models however, lies in comparing their performance across multiple datasets and metrics. A model may outperform others on one dataset but underperform on another, making it difficult to determine overall superiority. Moreover, even within a single dataset, different metrics such as MRR and Hit@1 can yield conflicting rankings, where one model excels in MRR while another performs better in Hit@1, further complicating model selection for downstream tasks. These inconsistencies hinder holistic comparisons and highlight the need for a unified meta-metric that integrates performance across all metrics and datasets to enable a more reliable and interpretable evaluation framework. To address this need, we propose KG \textit{E}valuation based on \textit{D}istance from \textit{A}verage \textit{S}olution (EDAS), a robust and interpretable meta-metric that synthesizes model performance across multiple datasets and diverse evaluation criteria into a single normalized score ($M_i \in [0,1]$). Unlike traditional metrics that focus on isolated aspects of performance, EDAS offers a global perspective that supports more informed model selection and promotes fairness in cross-dataset evaluation. Experimental results on benchmark datasets such as FB15k-237 and WN18RR demonstrate that EDAS effectively integrates multi-metric, multi-dataset performance into a unified ranking, offering a consistent, robust, and generalizable framework for evaluating KGC models.

KG-EDAS: A Meta-Metric Framework for Evaluating Knowledge Graph Completion Models

Dialogue State Tracking (DST) requires precise extraction of structured information from multi-domain conversations, a task where Large Language Models (LLMs) struggle despite their impressive general capabilities. We present GEM (Graph-Enhanced Mixture-of-Experts), a novel framework that combines language models and graph-structured dialogue understanding with ReAct agent-based reasoning for superior DST performance. Our approach dynamically routes between specialized experts: a Graph Neural Network that captures dialogue structure and turn-level dependencies, and a finetuned T5-Small encoder-decoder for sequence modeling, coordinated by an intelligent gating network. For complex value generation tasks, we integrate ReAct agents that perform structured reasoning over dialogue context. On MultiWOZ 2.2, GEM achieves 65.19% Joint Goal Accuracy, substantially outperforming end-to-end LLM approaches (best: 38.43%) and surpassing state-of-the-art (SOTA) methods including TOATOD (63.79%), D3ST (58.70%), and Diable (56.48%). Our graph-enhanced mixture-of-experts architecture with ReAct integration demonstrates that combining structured dialogue representation with dynamic expert routing and agent-based reasoning provides a powerful paradigm for dialogue state tracking, achieving superior accuracy while maintaining computational efficiency through selective expert activation.

GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking

With the increase of data in day-to-day life, businesses and different stakeholders need to analyze the data for better predictions. Traditionally, relational data has been a source of various insights, but with the increase in computational power and the need to understand deeper relationships between entities, the need to design new techniques has arisen. For this graph data analysis has become an extraordinary tool for understanding the data, which reveals more realistic and flexible modelling of complex relationships. Recently, Graph Neural Networks (GNNs) have shown great promise in various applications, such as social network analysis, recommendation systems, drug discovery, and more. However, many adversarial attacks can happen over the data, whether during training (poisoning attack) or during testing (evasion attack), which can adversely manipulate the desired outcome from the GNN model. Therefore, it is crucial to make the GNNs robust to such attacks. The existing robustness methods are computationally demanding and perform poorly when the intensity of attack increases. This paper presents a computationally efficient framework, namely, pLAPGNN, based on weighted p-Laplacian for making GNNs robust. Empirical evaluation on real datasets establishes the efficacy and efficiency of the proposed method.

Enhancing Robustness of Graph Neural Networks through p-Laplacian

Many real-world systems, from neural circuits to economic networks, exhibit feedback loops that are best represented as directed cyclic graphs (DCGs). Yet most scalable causal discovery methods either impose hard acyclicity or rely on global backpropagation, making them unsuitable for feedback-rich settings. We propose PreCyc, a predictive coding framework for causal structure learning that combines node-wise energy minimisation with a soft acyclicity surrogate and sparsity regularisation. The algorithm alternates local state inference and weight updates, avoiding reverse-mode differentiation while remaining scalable to larger graphs. Our analysis shows convergence to a stationary point under standard smoothness assumptions, and we clarify the distinction between local error signals for data fit and the global nature of acyclicity enforcement. Experiments on synthetic Erdos–Renyi, Watts–Strogatz, and scale free graphs, as well as the 279-node C. elegans connectome, demonstrate competitive performance in both structure recovery and cycle identification compared with state-of-the art cyclic causal discovery methods. While the current implementation focuses on linear structural equation models with observational equilibrium data, PreCyc establishes predictive coding as a principled and scalable foundation for causal discovery in feedback-rich systems.

Predictive Coding Causal Discovery for Directed Cyclic Graphs

Prior work on node classification shows that Graph Neural Networks (GNNs) can learn transferable representations of graph properties when those properties are consistent across graphs. For a fixed graph, one would then expect GNNs trained for link prediction to learn a representation consistent with that learnt for node classification. We show this intuition does not hold in the general case. We find instead, popular link prediction models can learn a trivial mini-batch dependent heuristic, enabled by batch normalisation layers, to solve the edge classification task. When correcting for this, we observe increased alignment of network representation with node-class relevant features, suggesting the network has learnt a graph representation that better aligns with the underlying graph's properties. Our findings suggest that standard link prediction training may be leading us to overestimate link predictors' ability to learn a generalised representation of a graph that is consistent across tasks.

Mini-Batch Class Composition Bias in Link Prediction

Knowledge graph completion aims to infer unknown information in a knowledge graph that is incomplete, due to noisy or missing data. Geographic knowledge graphs, which are typically derived from crowd-sourced data, are often incomplete, making geographic knowledge graph completion an important problem. Most current methods for knowledge graph completion are generic, and do not account for the spatial nature of geographic knowledge graphs. The few methods that are tailored to geographic knowledge graphs are computationally expensive or are designed for a closed-world setting, which is not practical in the geography domain. We study this problem by evaluating existing state-of-the-art standard and geo-specific knowledge graph completion methods on a large dataset of geographic knowledge graphs. Our findings reveal that these methods perform poorly, leaving an open problem for the AI and graphs community. To aid in further research, we suggest some possible areas of work that we believe could lead to fruitful developments for this problem.

An Experimental Analysis of Geographic Knowledge Graph Completion Methods

Chain-of-thought (CoT) prompting enables Large Language Models to solve complex problems, but deploying these models safely requires reliable confidence estimates—a capability where existing methods suffer from poor calibration and severe overconfidence on incorrect predictions. We propose Enhanced Dirichlet+Topology Risk (EDTR), a novel decoding strategy that combines topological analysis with Dirichlet-based uncertainty quantification to measure LLM confidence across multiple reasoning paths. EDTR treats each CoT as a vector in high-dimensional space and extracts eight topological risk features capturing the geometric structure of reasoning distributions: tighter, more coherent clusters indicate higher confidence while dispersed, inconsistent paths signal uncertainty. We evaluate EDTR against three state-of-the-art calibration methods across four diverse reasoning benchmarks spanning olympiad-level mathematics (AIME), grade school math (GSM8K), commonsense reasoning, and stock price prediction. EDTR achieves 41\% better calibration than competing methods with an average ECE of 0.287 and the best overall composite score of 0.672, while notably achieving perfect accuracy on AIME and exceptional calibration on GSM8K with an ECE of 0.107—domains where baselines exhibit severe overconfidence. Our work provides a geometric framework for understanding and quantifying uncertainty in multi-step LLM reasoning, enabling more reliable deployment where calibrated confidence estimates are essential.

Optimizing Chain-of-Thought Confidence via Topological and Dirichlet Risk Analysis

Explainable AI (XAI) methods aiming to probe model internals for scientific discovery ("RED XAI") must move beyond correlational saliency maps. We address this by presenting a systematic comparison of segmentation methods within a causal attribution framework. We contrast an object-aware approach using the Segment Anything Model (SAM) against a texture-aware baseline using SLIC superpixels. Both are integrated into a pipeline utilizing Grad-CAM for saliency, CLIP for concept labeling, and a causal validation step quantifying concept importance via counterfactual interventions (blur masking) measured by raw confidence drop. Evaluating on 200 ImageNet images, we uncover a critical sensitivity-reliability trade-off: SAM-based object-centric concepts show significantly higher average causal impact (81.0\% mean confidence drop vs. 37.7\% for SLIC), demonstrating greater sensitivity, but suffer from segmentation failures in 9.5\% of cases (181/200 successes). SLIC achieves perfect 100\% reliability (200/200 successes) and lower impact variance, albeit with reduced sensitivity. This trade-off provides actionable guidance for domain scientists: SLIC's robustness is preferable for high-stakes, texture-reliant tasks (e.g., medical diagnostics), while SAM's sensitivity may benefit exploratory analysis of object-centric phenomena. Our work offers quantitative evidence of this trade-off, enabling more informed XAI method selection for reliable scientific insight.

Premium content

Next from AAAI 2026

RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES