Singapore

Current agentic AI benchmarks predominantly evaluate task completion accuracy, while overlooking critical enterprise requirements such as cost-efficiency, reliability, and operational stability. Through systematic analysis of 12 main benchmarks and empirical evaluation of state-of-the-art agents, we identify three fundamental limitations: (1) absence of cost-controlled evaluation leading to 50x cost variations for similar precision, (2) inadequate reliability assessment where agent performance drops from 60\% (single run) to 25\% (8-run consistency), and (3) missing multidimensional metrics for security, latency, and policy compliance. We propose \textbf{CLEAR} (Cost, Latency, Efficacy, Assurance, Reliability), a holistic evaluation framework specifically designed for enterprise deployment. Evaluation of six leading agents on 300 enterprise tasks demonstrates that optimizing for accuracy alone yields agents 4.4-10.8x more expensive than cost-aware alternatives with comparable performance. Expert evaluation (N=15) confirms that CLEAR better predicts production success (correlation $\rho=0.83$) compared to accuracy-only evaluation ($\rho=0.41$).

AAAI 2026

Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems

workshop paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

In today’s fast-paced environment, the ability to swiftly access, understand, and act on data is no longer optional—it is essential. Yet most organizations remain data-rich but insight-poor, constrained by the complexity of querying, interpreting, and explaining enterprise-scale information. We present POLARIS, a supervisor-led multi-agent framework for conversational enterprise analytics that bridges this gap. POLARIS introduces Dynamic Task Coordination (DTC), a decision-theoretic orchestration layer that models agent–task assignment as adaptive bipartite matching, enabling real-time coordination, recovery, and optimization across specialized agents for querying, visualization, and reasoning. By coupling DTC with reason-first, ReAct-style agents, POLARIS transforms natural language queries into coherent analytical workflows that not only retrieve and visualize data but also explain the underlying “why.” Evaluation on structured enterprise datasets demonstrates high semantic fidelity and answer relevancy, underscoring the potential of multi-agent orchestration to deliver trustworthy, end-to-end business intelligence at scale.

Polaris : Multi Agentic System for Conversational Enterprise Analytics

Tool calling enables Large Language Models (LLMs) to interact with external environments via tool invocation, providing a practical mechanism to overcome the inherent limitations of pretraining. However, the effectiveness of tool use depends critically on the quality of associated documentation and knowledge base context, which are typically authored for human users and often misaligned with LLMs' interpretive needs. This issue is further amplified in industrial settings, where hundreds of tools with overlapping functionalities introduce challenges of scalability, variability, and ambiguity. 

We propose Verification-Guided Context Optimization (VGCO), a framework that employs LLMs-as-editors to automatically refine tool-related documentation and knowledge base context. VGCO operates in two stages: (1) Evaluation, which collects real-world failure cases and identifies tool-context mismatches; and (2) Optimization, which performs hierarchical editing through offline learning with structure-aware, in-context optimization. 

The novelty of our LLM editors lies in three key aspects: (1) a hierarchical structure that integrates naturally into the tool-calling workflow; (2) a state-aware, action-specific, and verification-guided design that constrains the search space for efficient, targeted improvements; and (3) the potential for cost-efficient sub-task specialization through either prompt engineering large editor models or post-training smaller editor models. Unlike prior work emphasizing multi-turn reasoning, VGCO targets the single-turn large scale tool-calling problem, achieving significant gains in accuracy, robustness, and generalization across LLMs.

Verification-Guided Context Optimization for Tool Calling via Hierarchical LLMs-as-editors

As multimodal LLMs become tool-using agents, the field still lacks a standardized metric for translating visual inputs into correct tool invocations. We introduce MFCL Vision, the first large-scale benchmark for vision-based function calling, comprising 250 expert-verified tasks across five image domains (Places, Events, Media, Sports, Shopping) and five query types (Locate, Temporal, Select, Identify, Quantify). Each task comprises (1) a textual user query, (2) an accompanying image, (3) a ground-truth answer obtained from the web, and (4) a human-produced reasoning trace for comparative error analysis. To constrain the task, we expose a singular web-search tool to each model. To examine the robustness of LLM’s perception-to-tool-use pipeline, we introduce controlled visual perturbations, including crops, resizes, and color channel removal. Our automatic grader computes exact-match scores on model final answers, removing dependence on brittle and potentially biased LLM judges. We evaluate leading models and present a taxonomy of failure modes, including visual reasoning, assumption bias, keyword selection, and tool avoidance errors. By releasing MFCL Vision’s dataset, taxonomy, and diagnostics, we aim to accelerate progress towards versatile multimodal agents capable of intelligent tool usage in complex visual contexts.

MFCL Vision: Benchmarking Tool Use in Multimodal Large Language Models for Visual Reasoning Tasks

Evaluating generative AI (GenAI) products for compliance in high-stakes domains such as healthcare is difficult across multi-turn conversations and diverse contexts. This is especially true and pressing for enterprises evaluating human-facing AI Chatbots which can output life-threatening content, such as responses supporting self-harm. Existing approaches for automated compliance focus on single-agent approaches (e.g. prompt tuning). These are difficult to adopt in practice because they diverge from established enterprise compliance workflows and lack mechanisms for uncertainty quantification (UQ). This paper introduces a system for converting any compliance workflow into a human-in-the-loop, multi-agent framework. The approach maps process roles to specialized agents (e.g., content, business, legal), enables escalation between agents analogous to human review procedures, and enables incorporating human input for uncertain classifications. Methodologically, this system differs from more conventional single-orchestrator/supervisor designs in multi-agent settings by formulating the process as a constrained Markov Decision Process (MDP) and using Monte Carlo simulations to estimate empirical agent uncertainty. We demonstrate this framework on an open-source safety benchmark for self-harm detection. Results show improvements over a single-agent baseline in accuracy (up to 19%), reduction in required human review (up to 85×), and, in some configurations, less processing time. The framework also identified several mislabeled items in the open source benchmarks. We reported suspected label errors to the maintainers, several were acknowledged as likely errors and candidates for correction, emphasizing the timeliness and promise of this approach.

Auditing Generative AI Benchmarks with a Multi-Agent Compliance System

As information grows exponentially, enterprises face increasing pressure to transform unstructured data into coherent, actionable insights. While autonomous agents show promise, they often struggle with domain-specific nuances, intent alignment, and enterprise integration. 
We present Enterprise Deep Research (EDR), a multi-agent system that integrates (1) a Master Planning Agent for adaptive query decomposition, (2) four specialized search agents (General, Academic, GitHub, LinkedIn), (3) an extensible MCP-based tool ecosystem supporting NL2SQL, file analysis, and enterprise workflows, (4) a Visualization Agent for data-driven insights, and (5) a reflection mechanism that detects knowledge gaps and updates research direction with optional human-in-the-loop steering guidance. These components enable automated report generation, real-time streaming, and seamless enterprise deployment, as validated on internal enterprise-grade datasets. On open-ended benchmarks including DeepResearch Bench and DeepConsult, EDR outperforms state-of-the-art agentic systems without any human steering. We release the EDR framework and benchmark trajectories to advance research on multi-agent reasoning applications.

Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

Large Language Models (LLMs) often hallucinate, generating non-sensical or false information that can be especially harmful in fields like law or medicine. To study this phenomenon systematically, we introduce FalseCite, a curated dataset designed to capture and benchmark hallucinated responses induced by misleading or fabricated citations. Running GPT-4o-mini, Falcon-7B, and Mistral 7-B through FalseCite, we observed a noticeable increase in hallucination activity for false claims with deceptive citations, especially in GPT-4o-mini. Using the responses from FalseCite, we can also analyze the internal states of hallucinating models, visualizing and clustering the hidden state vectors. From this analysis, we noticed that the hidden state vectors, regardless of hallucination or non-hallucination, tend to trace out a distinct horn-like shape. Our work underscores FalseCite’s potential as a foundation for evaluating and mitigating hallucinations in future LLM research.

Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering

Recent advancements in foundation models have catalyzed research in Embodied AI to develop interactive agents capable of environmental reasoning and interaction. Developing such agents requires diverse, large-scale datasets. Prior frameworks generate synthetic data for long-term human-robot interactions and 3D environments but fail to model the bidirectional influence between human behavior and household environments. Our proposed generative framework creates household datasets at scale through loosely coupled generation of long-term human-robot interactions and environments. Human personas influence environment generation, while environment schematics and semantics shape human-robot interactions. 

The generated 3D data includes rich static context such as object and environment semantics, as well as temporal context capturing human and agent behaviors over extended periods. Our flexible tool allows users to define dataset characteristics via natural language prompts, enabling configuration of both environment and human activity data through natural language specifications. The tool can create variations of user-defined configurations, thus enabling scalable data generation. 

We validate our framework through comprehensive statistical evaluation using multi-modal embeddings and three key metrics: cosine similarity analysis, mutual information gain, intervention analysis, and iterative improvement validation. Statistical comparisons demonstrates good alignment with real-world datasets (HOMER) showing high cosine similarity values (0.60), while comparisons with synthetic datasets (Wang et al.) show moderate alignment (0.27). Intervention analysis across age, organization, and sleep pattern modifications shows statistically significant effects (p $<$ 0.001) with large effect sizes (Cohen's d = 0.51-1.12), confirming that bidirectional coupling successfully translates persona characteristics into measurable differences in both environmental configurations and behavioral patterns.All these contributions will enable the development and testing of household smart devices at scale.

Realistic Synthetic Household Data Generation at Scale

Enterprise AI workloads often exhibit strong long-tail distributions, characterized by a large number of infrequent, high-complexity tasks occurring alongside routine ones. Conventional large language model (LLM) deployments typically process all inputs uniformly, leading to brittle, costly, and computationally inefficient systems. In this work, we introduce a set of complementary runtime techniques—Dynamic Prompting, Dynamic Context Control (DCC), and Dynamic Model Selection—that enable agentic AI systems to adapt dynamically to task heterogeneity and effectively address the long-tail distribution inherent in real-world agentic workloads.

Scalable Strategies for Agentic-AI to Handle Long-Tail Enterprise Use Cases

Agentic AI systems depend on accurate perception and grounding of enterprise data to plan and act reliably. Yet most organizational datasets remain locked in schema-level representations that lack semantic linkage to business meaning, limiting transparency, trust, and regulatory compliance. This paper introduces a semantic correlation framework that serves as the perceptual grounding layer for agentic AI in enterprise environments. Built on efficient small language models, the framework aligns database columns with business terms and textual descriptions to establish vertical data lineage from structured metadata. A human-in-the-loop pipeline is proposed for real-world deployment, where high-confidence correlations are automated while some mappings are routed for expert review ensuring traceability, explainability, and continuous refinement. The approach enhances data visibility, auditability, and governance, supports regulatory mandates such as GDPR, and enables trustworthy, small-model deployment within agentic enterprise AI systems.

Grounding Enterprise Data for Agentic AI: A Semantic Approach to Vertical Data Lineage using Small Language Models

The growing demand for multimodal retrieval-augmented generation (RAG) has exposed critical limitations in existing frameworks, which often rely on static routing or heuristic decision rules, leading to inefficiency and poor adaptability to query-specific modality needs. To address these challenges, we present FusionMind, a modular multimodal RAG architecture that combines differentiable modality selection and agent-based iterative reasoning. At the core of FusionMind is a Gumbel–Softmax selector that learns discrete, query-adaptive modality routing across text, layout, chart, and table evidence. This selector is integrated into a three-agent reasoning pipeline—Seeker, Inspector, and Synthesizer that progressively filters, validates, and synthesizes multimodal evidence. Unlike heuristic or Gaussian mixture-based baselines, FusionMind enables supervised training of routing policies and transparent reasoning dynamics. We evaluate FusionMind on the ViDoSeek benchmark, a challenging dataset requiring cross-modal retrieval and multi-hop reasoning. Compared to the ViDoRAG baseline, FusionMind achieves substantial improvements across standard retrieval metrics, including Recall@1 (+67.9 pp), nDCG@all (+37.5pp), and MRR@all (+49.1 pp). Category-wise results show strong absolute performance on harder query types such as multi hop, chart, and table; due to the absence of distinct baseline per-category scores, we report absolute levels without deltas. These results demonstrate that unifying differentiable routing with iterative agent reasoning offers a scalable and interpretable path forward for multimodal RAG, significantly advancing performance on visually rich document understanding tasks.

Premium content

Next from AAAI 2026

Polaris : Multi Agentic System for Conversational Enterprise Analytics

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES