Singapore

The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model&#39;s performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.

AAAI 2026

CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

llm as a judge

key point analysis

error analysis

The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model's performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.

demo

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The floods caused by the Isolated High-Level Depression (DANA) in the Valencian Community in October 2024 destroyed and damaged hundreds of thousands of personal photographs, erasing key pieces of collective and emotional memory. Within the project Recuperar las Memorias, we present an AI-based system for automated photo reconstruction, designed to support the recovery of more than 200,000 affected images. The system integrates YOLOv8 and SAM2 for automatic detection of damaged regions, followed by context-aware inpainting to restore visual coherence. Special modules are included for facial restoration, preserving identity in one of the most emotionally critical aspects of personal photographs. The tool is deployed as a web application that enables both single-image and batch restoration, making it accessible to non-expert users. Preliminary evaluation, combining human perceptual studies and automatic metrics (LPIPS), shows consistent alignment between subjective and objective assessments of quality. This demonstration highlights how advances in computer vision can be mobilised in real-world crisis contexts, placing AI at the service of cultural heritage, dignity, and memory preservation.

AI for Memory Preservation: Automated Restoration of Photographs Damaged by Floods

Metalenses have been widely recognized as a key building block of next-generation optical systems, offering unprecedented advantages in compactness, lightweight design, and scalable manufacturing compared to traditional refractive optics. Despite this promise, practical use is limited by optical aberrations, blur, and illumination sensitivity, which degrade both visual quality and machine perception. In this demonstration, we present an end-to-end metalens vision system—from hardware sensing with a custom-built RGB metalens camera, to physics-informed imaging and real-time restoration, and finally to downstream vision applications such as object detection and depth estimation. By integrating spatially-aware attention enhancement and reinforcement learning-based illumination control into a real-time system, our solution transforms degraded raw captures into high-fidelity images that are both visually interpretable and functionally reliable for machine vision. This AI-powered pipeline highlights metalenses as a cornerstone for next-generation imaging, where advances in optics and machine intelligence jointly drive the future of visual perception.

Next-Generation Metalens Vision System: Powered by AI and Applied to AI

We present AgentSeer, an interactive observability framework for agentic AI systems. Unlike conventional tracing tools that expose raw spans or model-centric metrics, AgentSeer introduces a dual graph decomposition constructed through a deterministic rule-based parser: a temporal action graph, where each prompt or tool invocation is represented as a distinct action, and a component graph capturing architectural relations among agents, tools, and memory modules. Beyond visualization, AgentSeer enables action-level red teaming, where jailbreak payloads are systematically attached to every action node (including agent messages, tool calls, and memory retrievals) to uncover vulnerabilities invisible to model-level testing. Our demonstration features a six-agent hierarchical testbed with interactive visualization and deployment-oriented safety evaluation applied directly on the same prompts and contexts, systematically revealing high-risk interactions, context-dependent vulnerabilities, and emergent behaviors. By combining structured decomposition, automated red teaming, and rule-based reliability, AgentSeer establishes a safety-first methodology for observability in multi-agent AI.

AgentSeer: Visualizing and Evaluating Temporal Actions in Agentic AI Systems

We develop KnowThyself, an agentic assistant that advances large language model (LLM) interpretability. Existing tools provide useful insights but remain fragmented and code-intensive. KnowThyself consolidates these capabilities into a chat-based interface, where users can upload models, pose natural language questions, and obtain interactive visualizations with guided explanations. At its core, an orchestrator LLM first reformulates user queries, an agent router further directs them to specialized modules, and the outputs are finally contextualized into coherent explanations. This design lowers technical barriers and provides an extensible platform for LLM inspection. By embedding the whole process into a conversational workflow, KnowThyself offers a robust foundation for accessible LLM interpretability.

KnowThyself: An Agentic Assistant for LLM Interpretability

We introduce QueryGym, an interactive environment for building, testing, and evaluating LLM-based query planning agents. Existing frameworks often tie agents to specific query language dialects or obscure their reasoning; QueryGym instead requires agents to construct explicit sequences of relational algebra operations, ensuring engine-agnostic evaluation and transparent step-by-step planning. The environment is implemented as a Gymnasium interface that supplies observations---including schema details, intermediate results, and execution feedback---and receives actions that represent database exploration (e.g., previewing tables, sampling column values, retrieving unique values) as well as relational algebra operations (e.g., filter, project, join).We detail the motivation and the design of the environment. In the demo, we showcase the utility of the environment by contrasting it with contemporary LLMs that query databases. QueryGym serves as a practical testbed for research in error remediation, transparency, and reinforcement learning for query generation.

QueryGym: Step-by-Step Interaction with Relational Databases

Modern manufacturing systems demand real-time, trustworthy, and interpretable insights into anomalies and their underlying causes. However, conventional pipelines treat anomaly detection, causal inference, and decision-making as siloed tasks, lacking integration, explainability, and adaptability. We present CausalPulse, an intelligent, multi-agent copilot for automated Root Cause Analysis (RCA) in industrial settings. Built on a modular and extensible architecture, the system leverages standard agentic protocols, including Model Context Protocol (MCP), Agent2Agent (A2A), and LangGraph for dynamic tool and agent discovery and seamless orchestration of tasks. Agents dynamically interact to perform data preprocessing, anomaly detection, causal discovery, and root cause analysis through a neurosymbolic workflow that combines symbolic reasoning with neural methods. Intelligent postprocessing pipelines enable automatic chaining of agent tasks, enhancing contextual awareness and adaptability. CausalPulse is evaluated using both an academic public dataset (i.e., Future Factories) and an industrial proprietary dataset (i.e., Planar Oxygen Sensor Element) and shows that the system outperforms traditional baselines in interpretability, trustworthiness, and operational utility. 
Demo Video: https://tinyurl.com/nhat89bd

CausalPulse: Agentic Copilot for Root Cause Analysis in Smart Manufacturing

Although LLMs can generate tools for generic domains and tasks, they struggle with enterprise-related domains that involve proprietary APIs and data schemas. We present ToolSmith, a framework for autonomously generating and validating agent-compatible tools. Given an API specification and a Tool Specification Requirement (TSR), ToolSmith produces a tool function and verifies it through a closed-loop process: it creates natural language (NL) tests and executes the tool in a secure agent sandbox for validation. For state-changing tools, ToolSmith confirms outcomes by querying the API with parameters derived from the NL tests. If the tool fails to produce the desired output, ToolSmith generates diagnostic feedback to iteratively regenerate it. By ensuring both functional correctness and agent compatibility, ToolSmith enables reliable automation of enterprise workflows.

ToolSmith: A Multi-Agent Framework for Enterprise Tool Creation

Developing new portfolio-management algorithms typically demands substantial programming effort, limiting rapid experimentation and excluding finance professionals without coding skills. Current robo-advisory tools offer pre-built but rigid strategies, restricting customization and experimentation. We introduce PortfolioPilot, an open-source, agentic platform that enables users to generate bespoke portfolio through natural-language descriptions. Leveraging the Anthropic Claude API, PortfolioPilot dynamically synthesizes executable TypeScript algorithms that run in the frontend with security validation. The system integrates real-time backtesting with historical market data, classical optimization algorithms (Markowitz, LSTM, ARIMA), and interactive performance visualizations.

PortfolioPilot: An Agentic Platform for Financial Portfolio Management Algorithm Development and Evaluation

The integration of Large Language Models (LLMs) into clinical applications presents transformative potential but is undermined by the critical risk of hallucination, the generation of plausible but factually incorrect information. Such failures pose a direct threat to patient safety and the integrity of clinical decision-making. To address this challenge, we introduce MHB, a novel and comprehensive benchmark framework designed to evaluate LLM reliability in two complex, high-stakes clinical contexts: multi-turn medical dialogues and clinical case report analysis. The core of our contribution is a systematic methodology for generating adversarial test cases by injecting ``hallucination traps" into realistic medical data, guided by a fine-grained taxonomy of clinical errors.
MHB, comprising 4,695 samples and 20,288 evaluation rubrics, underwent a rigorous, two-stage validation by a panel of \textit{60 licensed physicians from top-tier hospitals}, ensuring high clinical realism and consistency. This comprehensive assessment of leading LLMs revealed significant, clinically relevant shortcomings across the board. Even the best-performing model, \texttt{Claude-4-Sonnet}, exhibited a hallucination rate of 29.1\%, with some open-source models exceeding 57.0\%. All models struggled with specific traps, like fabricated medical data or non-existent guidelines, highlighting prevalent systemic weaknesses.

MHB: Medical Hallucination Benchmark for Large Language Models in Complex Clinical Tasks

Accurate detection of offensive content on social media demands high-quality labeled data; however, such data is often scarce due to the low prevalence of offensive instances and the high cost of manual annotation. To address this low-resource challenge, we propose a self-training framework that leverages abundant unlabeled data through collaborative pseudo-labeling. Starting with a lightweight classifier trained on limited labeled data, our method iteratively assigns pseudo-labels to unlabeled instances with the support of Multi-Agent Vision-Language Models (MA-VLMs). Unlabeled data on which the classifier and MA-VLMs agree are designated as the Agreed-Unknown set, while conflicting samples form the Disagreed-Unknown set. To enhance label reliability, MA-VLMs simulate dual perspectives, moderator and user, capturing both regulatory and subjective viewpoints. The classifier is optimized using a novel Positive-Negative-Unlabeled (PNU) loss, which jointly exploits labeled, Agreed-Unknown, and Disagreed-Unknown data while mitigating pseudo-label noise. Experiments on benchmark datasets demonstrate that our framework substantially outperforms baselines under limited supervision and approaches the performance of large-scale models.

Downloads

Next from AAAI 2026

AI for Memory Preservation: Automated Restoration of Photographs Damaged by Floods

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

AI for Memory Preservation: Automated Restoration of Photographs Damaged by Floods

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads