Singapore

Retrieval-Augmented Generation (RAG) has become the standard approach for integrating domain knowledge into Large Language Models (LLMs). However, fair comparison of RAG pipelines remains difficult: data preparation is often ad hoc, subsampling methods are opaque, parameters vary across implementations, and evaluation is fragmented. We present In-Situ Eval, a unified and reproducible framework that operationalizes the full RAG pipeline with configurable subsampling strategies and both RAG-specific and generic evaluation metrics. The platform supports two execution modes: an offline Dataset mode for evaluating precomputed outputs, and a live Retrieval mode for benchmarking RAG variants with state-of-the-art LLMs. Users can flexibly select datasets, retrieval techniques, models, and metrics, enabling side-by-side comparisons, ablations, and targeted analyses. This holistic approach reduces computational costs, clarifies the impact of subsampling techniques, and provides actionable insights for real-world deployments. By facilitating transparent, customizable, and interactive benchmarking, In-Situ Eval empowers both researchers and practitioners to make informed decisions in adapting RAG pipelines to domain-specific needs. The demo video is available at https://youtu.be/DoKKwQbclIg. The code is available at https://github.com/Ritvik-G/in-situ_eval/ .

AAAI 2026

In-Situ Eval: A Modular Framework for Custom and Real-Time RAG Benchmarking

interactive platforms

benchmarking and evaluation

reproducibility

demo

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Multi-Agent Path Finding (MAPF) algorithms provide highly optimized solutions for coordinating multiple agents in shared environments, yet their outputs lack explainability to human stakeholders. Existing explanation approaches, such as visual trace segmentation or logic-based reasoning, remain fragmented. In this demo, we present OMEGA, an interactive explanation platform that generates Natural Language (NL) explanations using the novel Multi-Agent Planning Ontology (maPO). Our framework transforms raw MAPF planner execution logs into a semantic knowledge graph, enabling SPARQL-based explanations of collision events, replanning strategies, and efficiency trade-offs. A lightweight web interface allows users to query, visualize, and interpret planner decisions, thereby making MAPF solutions transparent and auditable. We conducted a user study that confirms the ontology-driven explanations are significantly clearer and more preferred than raw logs, underscoring the potential of semantic technologies for explainable multi-agent systems. Demo video link: https://shorturl.at/298j5

OMEGA: An Ontology-Driven Tool for Explaining Multi-Agent Path Finding

We present MemoVision, a digital catalog system that captures semantic, spatial, temporal and interaction information as users move around physical environments using client devices such as smart glasses. The system utilizes open-vocabulary semantic segmentation and 3D scans to store objects-of-interest with comprehensive semantic, spatial, temporal and interaction labels. Our demonstration shows multimodal information query and retrieval capabilities, supporting specific queries about object locations, temporal events and user interactions including eye gaze and hand poses, enabling more contextualized responses compared to current multimodal large language models.

MemoVision: A Digital Catalog for Everyday Interactions

Industrial automation increasingly relies on multi-agent AI, yet evaluation remains difficult due to task complexity and data confidentiality. We present AssetOpsBench-Live, a demo of a competition-ready platform for real-time, privacy-preserving evaluation of multi-agent AI in industrial contexts. The platform integrates AssetOpsBench, which measures six dimensions of multi-agent performance and performs automated failure-mode discovery, with Codabench, which supports reproducible, code-oriented competitions. End users first validate agents locally, then submit containerized code for execution on hidden industrial scenarios. Instead of raw trajectories, the system provides quantitative scores and clustered failure modes (e.g., reasoning--action mismatch, step repetition), enabling participants to identify failures, apply targeted improvements, and iteratively resubmit. By combining competition-based engagement with actionable diagnostics, AssetOpsBench-Live delivers reproducible, real-time insights reflecting real-world industrial constraints.

AssetOpsBench-Live: Privacy-Aware Online Evaluation of Multi-Agent Performance in Industrial Operations

Capturing expertise and enabling efficient information retrieval are critical in the energy sector, where high staff turnover can lead to significant knowledge loss. Retrieval Augmented Generation (RAG) offers a solution by grounding Large Language Model (LLM) outputs in documented sources, but its effectiveness is limited by reliance on general purpose embeddings. We present Wikatoni, an agentic AI system for energy engineering workflows that integrates a novel domain specific embedding model. Wikatoni combines fine tuned embeddings with agentic RAG, metadata filtering, and hybrid retrieval to improve document search, automated reporting, and workflow efficiency. Evaluation on internal offshore energy data shows that the domain adapted embedding improves recall by 10\%, and Wikatoni Agentic further increases answer accuracy by 14\% compared to vanilla RAG with the base embedding model, achieving the best overall performance in context recall, faithfulness, and answer accuracy.

Wikatoni: An Agentic AI System for Energy Engineering Workflows

Deploying AI on microcontrollers is challenging. We introduce MIMaaS, a Microcontroller-as-a-Service platform that enables users to upload a model, select a target device, and receive a detailed performance report. A key innovation is our measurement of real-world power consumption, alongside latency and memory usage, directly from the physical hardware. MIMaaS empowers researchers and developers to easily create and validate hardware-aware AI models without needing physical hardware access.

Elevating AI on the Edge: A Demonstration of MIMaaS (Machine Intelligence with Microcontroller-as-a-Service)

Navigating new indoor spaces and interacting with the environment presents many challenges for people who are blind or have low vision (BLV). To address these challenges, we prototyped a smartphone-based conversational assistant that helps BLV people navigate and interact with their environment. The prototype utilizes a cognitive architecture to integrate three different technologies: (i) augmented-reality spatial anchors for high-precision localization and access to static information about the environment; (ii) real-time object/people detection for information about the environment and obstacle avoidance; and (iii) a conversational agent}that uses large language models (LLMs) for information extraction, conversational interaction, and turn-by-turn navigation. We assess the impact of different technologies on human performance by measuring user task time and errors. We found that conversational interaction holistically integrates the different technologies to deliver a better user experience while significantly reducing task completion time.

Navigation and Interaction for Blind Users via a Cognitive Architecture

According to the United States Internal Revenue Service, “the average American spends $270 and 13 hours filing their taxes”. Even beyond the U.S., Tax filing requires complex reasoning, combining application of overlapping rules with numerical calculations. Because errors can incur costly penalties, any automated system must deliver high accuracy and auditability, making modern large language models (LLMs) poorly suited for this task. We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. We evaluate variants of this system on the challenging StAtutory Reasoning Assessment (SARA) dataset, and include a novel method for estimating the cost of deploying such a system based on real-world penalties for tax errors. We further show how combining up-front translation of plain-text rules into formal logic programs, combined with intelligently retrieved exemplars for formal case representations, can dramatically improve performance on this task and reduce costs to well below real-world averages. Our results demonstrate the promise and economic feasibility of neuro-symbolic architectures for increasing equitable access to reliable tax assistance.

Language Models and Logic Programs for Trustworthy Tax Reasoning

Ensuring fairness in machine learning requires understanding how sensitive attributes like race or gender causally influence outcomes. Existing causal discovery (CD) methods often struggle to recover fairness-relevant pathways in the presence of noise, confounding, or data corruption. Large language models (LLMs) offer a complementary signal by leveraging semantic priors from variable metadata. We propose a hybrid LLM-guided CD framework that extends a breadth-first search strategy with active learning and dynamic scoring. Variable pairs are prioritized for querying using a composite score combining mutual information, partial correlation, and LLM confidence, enabling more efficient and robust structure discovery. To evaluate fairness sensitivity, we introduce a semi-synthetic benchmark based on the UCI Adult dataset, embedding domain-informed bias pathways alongside noise and latent confounders. We assess how well CD methods recover both global graph structure and fairness-critical paths (e.g., sex→education→income). Our results show that LLM-guided methods—including our active, dynamically scored variant—outperform baselines in recovering fairness-relevant structure under noisy conditions. We analyze when LLM-driven insights complement statistical dependencies and discuss implications for fairness auditing in high-stakes domains.

Uncovering Bias Paths with LLM-guided Causal Discovery: An Active Learning and Dynamic Scoring Approach

The use of Large Language Models (LLMs) in police opera- tions is growing, yet an evaluation framework tailored to po- lice operations remains absent. While LLM’s responses may not always be legally “incorrect”, their unverified use still can lead to severe issues such as unlawful arrests and improper evidence collection. To address this, we propose PAS (Po- lice Action Scenarios), a systematic framework covering the entire evaluation process. Applying this framework, we con- structed a novel QA dataset from over 8,000 official docu- ments and established key metrics validated through statis- tical analysis with police expert judgements. Experimental results show that commercial LLMs struggle with our new police-related tasks, particularly in providing fact-based rec- ommendations. This study highlights the necessity of an ex- pandable evaluation framework to ensure reliable AI-driven police operations. We release our data and prompt template.

Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios

Large language models (LLMs) have achieved remarkable success in many domains, but concerns about data quality and privacy are growing. Federated Learning (FL) offers a privacy-preserving solution by training a model on local clients without sharing data. However, the impact of biased private data on LLMs fine-tuned through FL remains understudied. This work investigates how client-side biased data affects the global model during federated fine-tuning of LLMs. We simulate realistic scenarios where some clients possess datasets containing social biases (stereotypes, discriminatory language) while others have clean data through extensive experiments with popular FL algorithms (FedAvg, FedAdam and FedProx) and popular LLMs (LLaMA, Mistral, Phi-3 and Gemma) across datasets with varying bias proportions (33\%, 66\%, 100\%). Our findings reveal that 1) FedAdam consistently shows the lowest bias propagation, reducing CrowS-Pairs scores by up to 15\% compared to FedAvg; 2) Even small amounts of biased data (33\%) can significantly influence global model bias; 3) Mixed biased and neutral data distributions lead to 5-7\% higher bias scores than segregated distributions. Additionally, we propose Bias-Aware Model Aggregation (BAMA), a novel debiasing method for federated fine-tuning that consistently reduces bias across various models and algorithms.

Downloads

Next from AAAI 2026

OMEGA: An Ontology-Driven Tool for Explaining Multi-Agent Path Finding

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

OMEGA: An Ontology-Driven Tool for Explaining Multi-Agent Path Finding

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads