Singapore

We present Auto-BenchmarkCard, a workflow for generating validated descriptions of AI benchmarks. Benchmark documentation is often incomplete or inconsistent, making it difficult to interpret and compare benchmarks across tasks or domains. Auto-BenchmarkCard addresses this gap by combining multi-agent data extraction from heterogeneous sources (e.g., Hugging Face, Unitxt, academic papers) with LLM-driven synthesis. A validation phase evaluates factual accuracy through atomic entailment scoring using the FactReasoner tool. This workflow has the potential to promote transparency, comparability, and reusability in AI benchmark reporting, enabling researchers and practitioners to better navigate and evaluate benchmark choices.

AAAI 2026

Auto-BenchmarkCard: Automated Synthesis of Benchmark Documentation

regulation & governance hai: human-computer interaction

humans and ai (hai) peai: ai & law

justice

demo

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Large language models (LLMs) often produce factually inaccurate content, or hallucinations, which undermines their reliability. Existing factuality evaluation systems usually rely on a single predefined fact source, making them task-specific and hard to extend. We present UFO, a unified framework for factuality evaluation that supports multiple plug-and-play fact sources. UFO integrates human-written evidence, web search results, and LLM knowledge within a single evaluation pipeline, and allows users to flexibly select, reorder, and even define customized sources. The system is accessible through both a Python interface and a web-based demo, offering interactive claim-level verification and visualization. Experiments show that UFO system achieves moderate consistency with human annotations. Overall, UFO serves as a transparent and extensible platform for benchmarking fact sources, comparing LLMs, and enabling real-world fact-checking applications across diverse domains.

Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources

In this work, we introduce CAT-V (Caption Anything in Video), a training-free framework for fine-grained object-centric video captioning of user-selected instances. CAT-V combines (i) a SAMURAI-based Segmenter for precise object masks across frames, (ii) a TRACE-Uni Temporal Analyzer for event boundary detection and coarse event descriptions, and (iii) an InternVL-2.5 Captioner that, conditioned on spatiotemporal visual prompts and chain-of-thought (CoT) guidance, produces detailed, temporally coherent captions about object attributes, actions, states, interactions, and context. The system supports point, box, and region prompts and maintains temporal sensitivity by tracking object states across segments. In contrast to vanilla video captioning that is overly abstract and dense video captioning that is often terse, CAT-V enables object-level specificity with spatial accuracy and temporal coherence, without additional training data.

Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

We present Risk Atlas Nexus, an open source system for governing AI risks. The system unifies several risk classification frameworks through a common ontology. Given an AI application use case (called an intent), the system estimates risks and associated mitigations that are linked to identified risks. The tool is designed to be incorporated in AI governance workflows where recommendations can be translated to business controls to cover risks arising from AI use in firms.

Risk Atlas Nexus: A System for Managing AI Risks

We present IntelliProof, an interactive system for analyzing argumentative essays through LLMs. IntelliProof structures an essay as an argumentation graph, where claims are represented as nodes, supporting evidence is attached as node properties, and edges encode supporting or attacking relations. Unlike existing automated essay scoring systems, IntelliProof emphasizes the user experience: each relation is initially classified and scored by an LLM, then visualized for enhanced understanding. The system provides justifications for classifications and produces quantitative measures for essay coherence. It enables rapid exploration of argumentative quality while retaining human oversight. In addition, IntelliProof provides a set of tools for a better understanding of an argumentative essay and its corresponding graph in natural language, bridging the gap between the structural semantics of argumentative essays and the user's understanding of a given text.

IntelliProof: An Argumentation Network-based Conversational Helper for Organized Reflection

Retrieval-Augmented Generation (RAG) has become the standard approach for integrating domain knowledge into Large Language Models (LLMs). However, fair comparison of RAG pipelines remains difficult: data preparation is often ad hoc, subsampling methods are opaque, parameters vary across implementations, and evaluation is fragmented. We present In-Situ Eval, a unified and reproducible framework that operationalizes the full RAG pipeline with configurable subsampling strategies and both RAG-specific and generic evaluation metrics. The platform supports two execution modes: an offline Dataset mode for evaluating precomputed outputs, and a live Retrieval mode for benchmarking RAG variants with state-of-the-art LLMs. Users can flexibly select datasets, retrieval techniques, models, and metrics, enabling side-by-side comparisons, ablations, and targeted analyses. This holistic approach reduces computational costs, clarifies the impact of subsampling techniques, and provides actionable insights for real-world deployments. By facilitating transparent, customizable, and interactive benchmarking, In-Situ Eval empowers both researchers and practitioners to make informed decisions in adapting RAG pipelines to domain-specific needs. The demo video is available at https://youtu.be/DoKKwQbclIg. The code is available at https://github.com/Ritvik-G/in-situ_eval/ .

In-Situ Eval: A Modular Framework for Custom and Real-Time RAG Benchmarking

Multi-Agent Path Finding (MAPF) algorithms provide highly optimized solutions for coordinating multiple agents in shared environments, yet their outputs lack explainability to human stakeholders. Existing explanation approaches, such as visual trace segmentation or logic-based reasoning, remain fragmented. In this demo, we present OMEGA, an interactive explanation platform that generates Natural Language (NL) explanations using the novel Multi-Agent Planning Ontology (maPO). Our framework transforms raw MAPF planner execution logs into a semantic knowledge graph, enabling SPARQL-based explanations of collision events, replanning strategies, and efficiency trade-offs. A lightweight web interface allows users to query, visualize, and interpret planner decisions, thereby making MAPF solutions transparent and auditable. We conducted a user study that confirms the ontology-driven explanations are significantly clearer and more preferred than raw logs, underscoring the potential of semantic technologies for explainable multi-agent systems. Demo video link: https://shorturl.at/298j5

OMEGA: An Ontology-Driven Tool for Explaining Multi-Agent Path Finding

We present MemoVision, a digital catalog system that captures semantic, spatial, temporal and interaction information as users move around physical environments using client devices such as smart glasses. The system utilizes open-vocabulary semantic segmentation and 3D scans to store objects-of-interest with comprehensive semantic, spatial, temporal and interaction labels. Our demonstration shows multimodal information query and retrieval capabilities, supporting specific queries about object locations, temporal events and user interactions including eye gaze and hand poses, enabling more contextualized responses compared to current multimodal large language models.

MemoVision: A Digital Catalog for Everyday Interactions

Industrial automation increasingly relies on multi-agent AI, yet evaluation remains difficult due to task complexity and data confidentiality. We present AssetOpsBench-Live, a demo of a competition-ready platform for real-time, privacy-preserving evaluation of multi-agent AI in industrial contexts. The platform integrates AssetOpsBench, which measures six dimensions of multi-agent performance and performs automated failure-mode discovery, with Codabench, which supports reproducible, code-oriented competitions. End users first validate agents locally, then submit containerized code for execution on hidden industrial scenarios. Instead of raw trajectories, the system provides quantitative scores and clustered failure modes (e.g., reasoning--action mismatch, step repetition), enabling participants to identify failures, apply targeted improvements, and iteratively resubmit. By combining competition-based engagement with actionable diagnostics, AssetOpsBench-Live delivers reproducible, real-time insights reflecting real-world industrial constraints.

AssetOpsBench-Live: Privacy-Aware Online Evaluation of Multi-Agent Performance in Industrial Operations

Capturing expertise and enabling efficient information retrieval are critical in the energy sector, where high staff turnover can lead to significant knowledge loss. Retrieval Augmented Generation (RAG) offers a solution by grounding Large Language Model (LLM) outputs in documented sources, but its effectiveness is limited by reliance on general purpose embeddings. We present Wikatoni, an agentic AI system for energy engineering workflows that integrates a novel domain specific embedding model. Wikatoni combines fine tuned embeddings with agentic RAG, metadata filtering, and hybrid retrieval to improve document search, automated reporting, and workflow efficiency. Evaluation on internal offshore energy data shows that the domain adapted embedding improves recall by 10\%, and Wikatoni Agentic further increases answer accuracy by 14\% compared to vanilla RAG with the base embedding model, achieving the best overall performance in context recall, faithfulness, and answer accuracy.

Wikatoni: An Agentic AI System for Energy Engineering Workflows

Deploying AI on microcontrollers is challenging. We introduce MIMaaS, a Microcontroller-as-a-Service platform that enables users to upload a model, select a target device, and receive a detailed performance report. A key innovation is our measurement of real-world power consumption, alongside latency and memory usage, directly from the physical hardware. MIMaaS empowers researchers and developers to easily create and validate hardware-aware AI models without needing physical hardware access.

Content not yet available

Next from AAAI 2026

Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES