Singapore

Monitoring forecasting systems is critical for customer satisfaction, profitability, and operational efficiency in large-scale retail businesses, but relying on human expertise is costly and not scalable. 
We propose \texttt{The Forecast Critic}, a system that leverages Large Language Models (LLMs) for automated forecast monitoring, taking advantage of their broad world knowledge and strong ``reasoning&#39;&#39; capabilities. 
As a prerequisite for this, we systematically evaluate the ability of LLMs to assess time series forecast quality, focusing on three key questions. 
(1) Can LLMs be deployed to perform forecast monitoring and identify obviously unreasonable forecasts? 
(2) Can LLMs effectively incorporate unstructured exogenous features to assess what a reasonable forecast looks like? 
(3) How does performance vary across model sizes and reasoning capabilities, measured across five state-of-the-art LLMs? 
We present three experiments, including both synthetic and real-world forecasting data. Our results show that LLMs can reliably detect and critique poor forecasts, such as those plagued by temporal misalignment, trend inconsistencies, and spike errors. 
The best-performing model we evaluated achieves an F1 score of $0.88$, somewhat below human-level performance (F1 score: $0.97$). We demonstrate that multi-modal LLMs can effectively incorporate unstructured contextual signals to refine their assessment of the forecast. Models correctly identify missing or spurious promotional spikes when provided with historical context about past promotions (F1 score: $0.84$). Lastly, we demonstrate that these techniques succeed in identifying significantly inaccurate forecasts on the real-world M5 time series dataset, with unreasonable forecasts having an sCRPS at least 10\% higher than that of reasonable forecasts. 
These findings suggest that LLMs, even without domain-specific fine-tuning, may provide a viable and scalable option for automated forecast monitoring and evaluation.

AAAI 2026

The Forecast Critic: Leveraging Large Language Models for Poor Forecast Identification

Monitoring forecasting systems is critical for customer satisfaction, profitability, and operational efficiency in large-scale retail businesses, but relying on human expertise is costly and not scalable. 
We propose \texttt{The Forecast Critic}, a system that leverages Large Language Models (LLMs) for automated forecast monitoring, taking advantage of their broad world knowledge and strong ``reasoning'' capabilities. 
As a prerequisite for this, we systematically evaluate the ability of LLMs to assess time series forecast quality, focusing on three key questions. 
(1) Can LLMs be deployed to perform forecast monitoring and identify obviously unreasonable forecasts? 
(2) Can LLMs effectively incorporate unstructured exogenous features to assess what a reasonable forecast looks like? 
(3) How does performance vary across model sizes and reasoning capabilities, measured across five state-of-the-art LLMs? 
We present three experiments, including both synthetic and real-world forecasting data. Our results show that LLMs can reliably detect and critique poor forecasts, such as those plagued by temporal misalignment, trend inconsistencies, and spike errors. 
The best-performing model we evaluated achieves an F1 score of $0.88$, somewhat below human-level performance (F1 score: $0.97$). We demonstrate that multi-modal LLMs can effectively incorporate unstructured contextual signals to refine their assessment of the forecast. Models correctly identify missing or spurious promotional spikes when provided with historical context about past promotions (F1 score: $0.84$). Lastly, we demonstrate that these techniques succeed in identifying significantly inaccurate forecasts on the real-world M5 time series dataset, with unreasonable forecasts having an sCRPS at least 10\% higher than that of reasonable forecasts. 
These findings suggest that LLMs, even without domain-specific fine-tuning, may provide a viable and scalable option for automated forecast monitoring and evaluation.

workshop paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

As multimodal LLMs become tool-using agents, the field still lacks a standardized metric for translating visual inputs into correct tool invocations. We introduce MFCL Vision, the first large-scale benchmark for vision-based function calling, comprising 250 expert-verified tasks across five image domains (Places, Events, Media, Sports, Shopping) and five query types (Locate, Temporal, Select, Identify, Quantify). Each task comprises (1) a textual user query, (2) an accompanying image, (3) a ground-truth answer obtained from the web, and (4) a human-produced reasoning trace for comparative error analysis. To constrain the task, we expose a singular web-search tool to each model. To examine the robustness of LLM’s perception-to-tool-use pipeline, we introduce controlled visual perturbations, including crops, resizes, and color channel removal. Our automatic grader computes exact-match scores on model final answers, removing dependence on brittle and potentially biased LLM judges. We evaluate leading models and present a taxonomy of failure modes, including visual reasoning, assumption bias, keyword selection, and tool avoidance errors. By releasing MFCL Vision’s dataset, taxonomy, and diagnostics, we aim to accelerate progress towards versatile multimodal agents capable of intelligent tool usage in complex visual contexts.

MFCL Vision: Benchmarking Tool Use in Multimodal Large Language Models for Visual Reasoning Tasks

Evaluating generative AI (GenAI) products for compliance in high-stakes domains such as healthcare is difficult across multi-turn conversations and diverse contexts. This is especially true and pressing for enterprises evaluating human-facing AI Chatbots which can output life-threatening content, such as responses supporting self-harm. Existing approaches for automated compliance focus on single-agent approaches (e.g. prompt tuning). These are difficult to adopt in practice because they diverge from established enterprise compliance workflows and lack mechanisms for uncertainty quantification (UQ). This paper introduces a system for converting any compliance workflow into a human-in-the-loop, multi-agent framework. The approach maps process roles to specialized agents (e.g., content, business, legal), enables escalation between agents analogous to human review procedures, and enables incorporating human input for uncertain classifications. Methodologically, this system differs from more conventional single-orchestrator/supervisor designs in multi-agent settings by formulating the process as a constrained Markov Decision Process (MDP) and using Monte Carlo simulations to estimate empirical agent uncertainty. We demonstrate this framework on an open-source safety benchmark for self-harm detection. Results show improvements over a single-agent baseline in accuracy (up to 19%), reduction in required human review (up to 85×), and, in some configurations, less processing time. The framework also identified several mislabeled items in the open source benchmarks. We reported suspected label errors to the maintainers, several were acknowledged as likely errors and candidates for correction, emphasizing the timeliness and promise of this approach.

Auditing Generative AI Benchmarks with a Multi-Agent Compliance System

As information grows exponentially, enterprises face increasing pressure to transform unstructured data into coherent, actionable insights. While autonomous agents show promise, they often struggle with domain-specific nuances, intent alignment, and enterprise integration. 
We present Enterprise Deep Research (EDR), a multi-agent system that integrates (1) a Master Planning Agent for adaptive query decomposition, (2) four specialized search agents (General, Academic, GitHub, LinkedIn), (3) an extensible MCP-based tool ecosystem supporting NL2SQL, file analysis, and enterprise workflows, (4) a Visualization Agent for data-driven insights, and (5) a reflection mechanism that detects knowledge gaps and updates research direction with optional human-in-the-loop steering guidance. These components enable automated report generation, real-time streaming, and seamless enterprise deployment, as validated on internal enterprise-grade datasets. On open-ended benchmarks including DeepResearch Bench and DeepConsult, EDR outperforms state-of-the-art agentic systems without any human steering. We release the EDR framework and benchmark trajectories to advance research on multi-agent reasoning applications.

Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

Large Language Models (LLMs) often hallucinate, generating non-sensical or false information that can be especially harmful in fields like law or medicine. To study this phenomenon systematically, we introduce FalseCite, a curated dataset designed to capture and benchmark hallucinated responses induced by misleading or fabricated citations. Running GPT-4o-mini, Falcon-7B, and Mistral 7-B through FalseCite, we observed a noticeable increase in hallucination activity for false claims with deceptive citations, especially in GPT-4o-mini. Using the responses from FalseCite, we can also analyze the internal states of hallucinating models, visualizing and clustering the hidden state vectors. From this analysis, we noticed that the hidden state vectors, regardless of hallucination or non-hallucination, tend to trace out a distinct horn-like shape. Our work underscores FalseCite’s potential as a foundation for evaluating and mitigating hallucinations in future LLM research.

Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering

Recent advancements in foundation models have catalyzed research in Embodied AI to develop interactive agents capable of environmental reasoning and interaction. Developing such agents requires diverse, large-scale datasets. Prior frameworks generate synthetic data for long-term human-robot interactions and 3D environments but fail to model the bidirectional influence between human behavior and household environments. Our proposed generative framework creates household datasets at scale through loosely coupled generation of long-term human-robot interactions and environments. Human personas influence environment generation, while environment schematics and semantics shape human-robot interactions. 

The generated 3D data includes rich static context such as object and environment semantics, as well as temporal context capturing human and agent behaviors over extended periods. Our flexible tool allows users to define dataset characteristics via natural language prompts, enabling configuration of both environment and human activity data through natural language specifications. The tool can create variations of user-defined configurations, thus enabling scalable data generation. 

We validate our framework through comprehensive statistical evaluation using multi-modal embeddings and three key metrics: cosine similarity analysis, mutual information gain, intervention analysis, and iterative improvement validation. Statistical comparisons demonstrates good alignment with real-world datasets (HOMER) showing high cosine similarity values (0.60), while comparisons with synthetic datasets (Wang et al.) show moderate alignment (0.27). Intervention analysis across age, organization, and sleep pattern modifications shows statistically significant effects (p $<$ 0.001) with large effect sizes (Cohen's d = 0.51-1.12), confirming that bidirectional coupling successfully translates persona characteristics into measurable differences in both environmental configurations and behavioral patterns.All these contributions will enable the development and testing of household smart devices at scale.

Realistic Synthetic Household Data Generation at Scale

Enterprise AI workloads often exhibit strong long-tail distributions, characterized by a large number of infrequent, high-complexity tasks occurring alongside routine ones. Conventional large language model (LLM) deployments typically process all inputs uniformly, leading to brittle, costly, and computationally inefficient systems. In this work, we introduce a set of complementary runtime techniques—Dynamic Prompting, Dynamic Context Control (DCC), and Dynamic Model Selection—that enable agentic AI systems to adapt dynamically to task heterogeneity and effectively address the long-tail distribution inherent in real-world agentic workloads.

Scalable Strategies for Agentic-AI to Handle Long-Tail Enterprise Use Cases

Agentic AI systems depend on accurate perception and grounding of enterprise data to plan and act reliably. Yet most organizational datasets remain locked in schema-level representations that lack semantic linkage to business meaning, limiting transparency, trust, and regulatory compliance. This paper introduces a semantic correlation framework that serves as the perceptual grounding layer for agentic AI in enterprise environments. Built on efficient small language models, the framework aligns database columns with business terms and textual descriptions to establish vertical data lineage from structured metadata. A human-in-the-loop pipeline is proposed for real-world deployment, where high-confidence correlations are automated while some mappings are routed for expert review ensuring traceability, explainability, and continuous refinement. The approach enhances data visibility, auditability, and governance, supports regulatory mandates such as GDPR, and enables trustworthy, small-model deployment within agentic enterprise AI systems.

Grounding Enterprise Data for Agentic AI: A Semantic Approach to Vertical Data Lineage using Small Language Models

The growing demand for multimodal retrieval-augmented generation (RAG) has exposed critical limitations in existing frameworks, which often rely on static routing or heuristic decision rules, leading to inefficiency and poor adaptability to query-specific modality needs. To address these challenges, we present FusionMind, a modular multimodal RAG architecture that combines differentiable modality selection and agent-based iterative reasoning. At the core of FusionMind is a Gumbel–Softmax selector that learns discrete, query-adaptive modality routing across text, layout, chart, and table evidence. This selector is integrated into a three-agent reasoning pipeline—Seeker, Inspector, and Synthesizer that progressively filters, validates, and synthesizes multimodal evidence. Unlike heuristic or Gaussian mixture-based baselines, FusionMind enables supervised training of routing policies and transparent reasoning dynamics. We evaluate FusionMind on the ViDoSeek benchmark, a challenging dataset requiring cross-modal retrieval and multi-hop reasoning. Compared to the ViDoRAG baseline, FusionMind achieves substantial improvements across standard retrieval metrics, including Recall@1 (+67.9 pp), nDCG@all (+37.5pp), and MRR@all (+49.1 pp). Category-wise results show strong absolute performance on harder query types such as multi hop, chart, and table; due to the absence of distinct baseline per-category scores, we report absolute levels without deltas. These results demonstrate that unifying differentiable routing with iterative agent reasoning offers a scalable and interpretable path forward for multimodal RAG, significantly advancing performance on visually rich document understanding tasks.

FusionMind A Differentiable and Efficient Multi-Modal Retrieval-Augmented Generation Framework

As AI agents integrate into enterprise applications, their evaluation demands benchmarks that reflect the complexity of real-world operations. Instead, existing benchmarks overemphasize open-domains such as code, use narrow accuracy metrics, and lack authentic complexity. We present UNDERWRITE, an expert-first, multi-turn insurance underwriting benchmark designed in close collaboration with domain experts to capture real-world enterprise challenges. UNDERWRITE introduces critical realism factors often absent in current benchmarks: proprietary business knowledge, noisy tool interfaces, and imperfect simulated users requiring careful information gathering. Evaluating 13 frontier models, we uncover significant gaps between research lab performance and enterprise readiness: the most accurate models are not the most efficient, models hallucinate domain knowledge despite tool access, and pass$\textasciicircum$k results show a 20% drop in performance. The results from UNDERWRITE demonstrate that expert involvement in benchmark design is essential for realistic agent evaluation, common agentic frameworks exhibit brittleness that skews performance reporting, and hallucination detection in specialized domains demands compositional approaches. Our work provides insights for developing benchmarks that better align with enterprise deployment requirements.

Benchmarking Agents in Insurance Underwriting Environments

Reliable Agentic AI in enterprise environments requires both effective agent orchestration and strong domain grounding, as general-purpose models often fail to handle specialised terminology and complex workflows. This work presents an enterprise framework for offshore energy operations that integrates a fine-tuned domain-specific embedding model with a multi-agent architecture for reliable retrieval augmented reasoning. The system introduces the first open benchmark for evaluating retrieval and reasoning reliability in offshore energy technical workflows. Evaluation on industrial data shows a 32% improvement in retrieval accuracy, an 85% reduction in numerical failures, and near-perfect faithfulness of 0.97 in deployed RAG systems, while routing accuracy increases from 81% to 92% with the intent classification agent. These results demonstrate that combining domain-adapted embeddings with coordinated multi-agent reasoning enables transparent, trustworthy, and reproducible Agentic AI for safety-critical enterprise applications.

Premium content

Next from AAAI 2026

MFCL Vision: Benchmarking Tool Use in Multimodal Large Language Models for Visual Reasoning Tasks

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES