Singapore

Although LLMs can generate tools for generic domains and tasks, they struggle with enterprise-related domains that involve proprietary APIs and data schemas. We present ToolSmith, a framework for autonomously generating and validating agent-compatible tools. Given an API specification and a Tool Specification Requirement (TSR), ToolSmith produces a tool function and verifies it through a closed-loop process: it creates natural language (NL) tests and executes the tool in a secure agent sandbox for validation. For state-changing tools, ToolSmith confirms outcomes by querying the API with parameters derived from the NL tests. If the tool fails to produce the desired output, ToolSmith generates diagnostic feedback to iteratively regenerate it. By ensuring both functional correctness and agent compatibility, ToolSmith enables reliable automation of enterprise workflows.

AAAI 2026

ToolSmith: A Multi-Agent Framework for Enterprise Tool Creation

mas: other foundations of multi agent systems

mas: applications

mas: coordination and collaboration

app: software engineering

demo

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Developing new portfolio-management algorithms typically demands substantial programming effort, limiting rapid experimentation and excluding finance professionals without coding skills. Current robo-advisory tools offer pre-built but rigid strategies, restricting customization and experimentation. We introduce PortfolioPilot, an open-source, agentic platform that enables users to generate bespoke portfolio through natural-language descriptions. Leveraging the Anthropic Claude API, PortfolioPilot dynamically synthesizes executable TypeScript algorithms that run in the frontend with security validation. The system integrates real-time backtesting with historical market data, classical optimization algorithms (Markowitz, LSTM, ARIMA), and interactive performance visualizations.

PortfolioPilot: An Agentic Platform for Financial Portfolio Management Algorithm Development and Evaluation

The integration of Large Language Models (LLMs) into clinical applications presents transformative potential but is undermined by the critical risk of hallucination, the generation of plausible but factually incorrect information. Such failures pose a direct threat to patient safety and the integrity of clinical decision-making. To address this challenge, we introduce MHB, a novel and comprehensive benchmark framework designed to evaluate LLM reliability in two complex, high-stakes clinical contexts: multi-turn medical dialogues and clinical case report analysis. The core of our contribution is a systematic methodology for generating adversarial test cases by injecting ``hallucination traps" into realistic medical data, guided by a fine-grained taxonomy of clinical errors.
MHB, comprising 4,695 samples and 20,288 evaluation rubrics, underwent a rigorous, two-stage validation by a panel of \textit{60 licensed physicians from top-tier hospitals}, ensuring high clinical realism and consistency. This comprehensive assessment of leading LLMs revealed significant, clinically relevant shortcomings across the board. Even the best-performing model, \texttt{Claude-4-Sonnet}, exhibited a hallucination rate of 29.1\%, with some open-source models exceeding 57.0\%. All models struggled with specific traps, like fabricated medical data or non-existent guidelines, highlighting prevalent systemic weaknesses.

MHB: Medical Hallucination Benchmark for Large Language Models in Complex Clinical Tasks

Accurate detection of offensive content on social media demands high-quality labeled data; however, such data is often scarce due to the low prevalence of offensive instances and the high cost of manual annotation. To address this low-resource challenge, we propose a self-training framework that leverages abundant unlabeled data through collaborative pseudo-labeling. Starting with a lightweight classifier trained on limited labeled data, our method iteratively assigns pseudo-labels to unlabeled instances with the support of Multi-Agent Vision-Language Models (MA-VLMs). Unlabeled data on which the classifier and MA-VLMs agree are designated as the Agreed-Unknown set, while conflicting samples form the Disagreed-Unknown set. To enhance label reliability, MA-VLMs simulate dual perspectives, moderator and user, capturing both regulatory and subjective viewpoints. The classifier is optimized using a novel Positive-Negative-Unlabeled (PNU) loss, which jointly exploits labeled, Agreed-Unknown, and Disagreed-Unknown data while mitigating pseudo-label noise. Experiments on benchmark datasets demonstrate that our framework substantially outperforms baselines under limited supervision and approaches the performance of large-scale models.

Multi-Agent VLMs Guided Self-Training with PNU Loss for Low-Resource Offensive Content Detection

Clear visual information is indispensable for tasks such as autonomous navigation, ecological monitoring, and inspection in marine environments, yet underwater images are notoriously marred by colour casts, haze, and loss of detail. We present DRM-Net, a practical enhancement framework that turns this challenge into an explicit recovery problem. Instead of guessing the whole clean image, DRM-Net first predicts a Degradation Residual Map (DRM) that pinpoints, pixel by pixel, how much colour, contrast, and texture have been lost. Adding this residual back to the raw frame produces the restored result in a single, transparent step. A lightweight Subaquatic Multi-Scale Context Fusion module further enhances robustness by allowing the network to view the scene through multiple “water layers”, adaptively selecting the most relevant scale for each image. Guided jointly by a DRM residual L1-loss and a perceptual loss, DRM-Net delivers sharper edges and truer colours while adding only negligible computational overhead. Comprehensive experiments on benchmark datasets demonstrate that our method effectively restores underwater images with superior colour fidelity, perceptual quality, and structural details. Compared with state-of-the-art approaches, our framework achieves significant improvements in both quantitative metrics and qualitative visual assessments across diverse underwater scenarios.

DRM-Net: Explicit Residual Modelling with Subaquatic Multi-Scale Context Fusion for Underwater Image Enhancement

Biodiversity is declining globally at an unprecedented rate. Managers urgently need to allocate limited resources to control pest species where interventions have the highest ecological impact. However, many species are hard to detect, and data collection is often expensive, irregular, and incomplete, thus posing significant challenges for machine learning models that traditionally require large and regular datasets. We present a novel deep learning architecture that estimates the spatiotemporal abundance of hard-to-detect species from sparse, zero-inflated, and irregular observational data. Our method combines Graph Convolutional Networks (GCNs) to model spatial dependencies across monitoring sites with Recurrent Neural Networks (RNNs) to capture long-range temporal dynamics. This architecture explicitly addresses the challenges of ecological data sparsity, heterogeneity, and irregular sampling. We apply our model to the Crown-of-Thorns Starfish (COTS) on Australia's Great Barrier Reef, a species with devastating impact on coral reefs and a major target of pest control programs. Our method significantly outperforms baseline approaches and the current resource-intensive approach, manta-tow surveillance, in both accuracy and detectability. Simulations indicate a 20\% increase in starfish removal efficiency over a year, enabling more effective coral protection. This work demonstrates how tailored deep learning methods can overcome ecological data limitations and substantially improve conservation outcomes. The code is available at \url{https://github.com/XXX}.

Leveraging Sparse Observations to Predict Species Abundance Across Space and Time

Augmented Reality (AR) and Multimodal Large Language
Models (LLMs) are rapidly evolving, providing unprecedented
capabilities for human-computer interaction. However,
their integration introduces a new attack surface for social
engineering. In this paper, we systematically investigate
the feasibility of orchestrating AR-driven Social Engineering
attacks using Multimodal LLM for the first time, via our proposed
SEAR framework, which operates through three key
phases: (1) AR-based social context synthesis, which fuses
Multimodal inputs (visual, auditory and environmental cues);
(2) role-based Multimodal RAG (Retrieval-Augmented Generation),
which dynamically retrieves and integrates contextual
data while preserving character differentiation; and (3)
ReInteract social engineering agents, which execute adaptive
multiphase attack strategies through inference interaction
loops. To verify SEAR, we conducted an IRB-approved
study with 60 participants in three experimental configurations
(unassisted, AR+LLM, and full SEAR pipeline) compiling
a new dataset of 180 annotated conversations in simulated
social scenarios. Our results show that SEAR is highly
effective at eliciting high-risk behaviors (e.g., 93.3% of participants
susceptible to email phishing). The framework was
particularly effective in building trust, with 85% of targets
willing to accept an attacker’s call after an interaction. Also,
we identified notable limitations such as “occasionally artificial”
due to perceived authenticity gaps. This work provides
proof-of-concept for AR-LLM driven social engineering
attacks and insights for developing defensive countermeasures
against next-generation augmented reality threats. The
SEAR code and dataset is available at: https://github.com/
2192537130/searsystem/tree/master.

On the Feasibility of Using MultiModal LLMs to Execute AR Social Engineering Attacks

Digital Twins (DTs) offer powerful tools for managing complex infrastructure systems, but their effectiveness is often limited by challenges in integrating unstructured knowledge. Recent advances in Large Language Models (LLMs) bring new potential to address this gap, with strong abilities in extracting and organizing diverse textual information. We therefore propose LSDTs (LLM-Augmented Semantic Digital Twins), a framework that helps LLMs extract planning knowledge from unstructured documents like environmental regulations and technical guidelines, and organize it into a formal ontology. This ontology forms a semantic layer that powers a digital twin—a virtual model of the physical system—allowing it to simulate realistic, regulation-aware planning scenarios. We evaluate LSDTs through a case study of offshore wind farm planning in Maryland, including its application during Hurricane Sandy. Results demonstrate that LSDTs support interpretable, regulation-aware layout optimization, enable high-fidelity simulation, and enhance adaptability in infrastructure planning. This work shows the potential of combining generative AI with digital twins to support complex, knowledge-driven planning tasks.

LSDTs: LLM-Augmented Semantic Digital Twins for Adaptive Knowledge-Intensive Infrastructure Planning

The mismatch between the growing demand for psychological counseling and the limited availability of services has motivated research into the application of Large Language Models (LLMs) in this domain. Consequently, there is a need for a robust and unified benchmark to assess the counseling competence of various LLMs. Existing works, however, are limited by unprofessional client simulation, static question-and-answer evaluation formats, and unidimensional metrics. These limitations hinder their effectiveness in assessing a model's comprehensive ability to handle diverse and complex clients. To address this gap, we introduce **CARE-Bench**, a dynamic and interactive automated benchmark. It is built upon diverse client profiles derived from real-world counseling cases and simulated according to expert guidelines. CARE-Bench provides a multidimensional performance evaluation grounded in established psychological scales. Using CARE-Bench, we evaluate several general-purpose LLMs and specialized counseling models, revealing their current limitations. In collaboration with psychologists, we conduct a detailed analysis of the reasons for LLMs' failures when interacting with clients of different types, which provides directions for developing more comprehensive, universal, and effective counseling models. We intend to make the CARE-Bench suite publicly available upon acceptance of this paper.

CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling

Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity and co-movement – semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real world scenarios.

MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes

Accelerating research in renewable energy policy is critical for addressing climate change and enabling informed decision-making. Question answering (QA) over public policy documents presents unique challenges due to their legal structure, conditional dependencies, and domain-specific vocabulary. In this paper, we introduce EvalQAG, a framework for generating high-quality QA pairs from renewable energy policy texts. EvalQAG combines structured prompts, retrieval-augmented inputs, and multi-stage evaluation using large language models (LLMs) to support accurate and diverse QA generation. Using this framework, we construct REPolicyQA, a domain-specific QA dataset comprising around 160,000 QA pairs from more than 1,000 U.S. renewable energy policy documents. The dataset covers five policy-relevant question types—Yes/No, Yes/No with Conditions, Factual, Legal Obligation, and Descriptive—capturing a wide range of reasoning patterns grounded in regulatory texts. We evaluate multiple QA models and uncover significant performance gaps, especially in legal reasoning and conditional inference—highlighting major shortcomings in current systems. Our results establish EvalQAG as a generalizable QA generation pipeline for policy texts and position REPolicyQA as a new benchmark for advancing QA research in policy and regulatory domains. We believe this work can foster impactful research in the renewable energy sector, particularly by enabling more robust and explainable QA systems for legal and condition-heavy regulatory documents.

Downloads

Next from AAAI 2026

PortfolioPilot: An Agentic Platform for Financial Portfolio Management Algorithm Development and Evaluation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES