Singapore

Large Multimodal Models (LMMs) often hallucinate objects and struggle with compositional reasoning in complex visual scenes. Structured Scene Graph (SG) representations explicitly encoding objects, attributes, and relations can mitigate these issues, however finetuning risks catastrophic forgetting. Recent zero-shot approaches prompt LMMs with scene graphs, yet typically rely on a single SG generated in one step, limiting capture of holistic context and question-specific details. We introduce a Dual-Layer Scene Graph Chain-of-Thought DLSG-CoT framework that enriches reasoning by combining two structured SGs: a Global Scene Graph (G-SG) that offers comprehensive image context, and a Query-Specific Scene Graph (Q-SG) produced through a two-step process targeting information relevant to the input query. Extensive experiments demonstrate that DLSG-CoT substantially improves LMM performance on compositional and context-sensitive tasks

AAAI 2026

Zero-Shot Vision Language Reasoning via Dual-layer Scene Graph Chain of Thoughts (Student Abstract)

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Automated negotiation, a form of interaction among
autonomous agents, plays a central role in multi-agent
systems, yet the application of large language models
(LLMs) in this domain remains underexplored. An LLM can
serve as a meta-strategist, adaptively selecting explicit
strategies for execution by external strategic tools based
on its capabilities.
We propose LLM negotiators equipped with explicit strategic
tools, including time-dependent and tit-for-tat negotiation
strategies. Our results show that strategic tool enhanced
negotiators achieve approximately 16% higher average
utility compared with baseline, latest LLM negotiators.

Strategic Tool Enhanced AI Agent for Multi-Issue Negotiation (Student Abstract)

The Completely Automated Public Turing test to Tell Computers and Humans Apart (CAPTCHA) is widely deployed on the web as a security mechanism to distinguish humans from automated bots. However, their robustness is being challenged by the rapid advancements in AI, with models capable of near-human level character recognition rendering CAPTCHA obsolete. This research aims to systematically study the effect of multiple image corruptions, including elastic transformations, blur, noise, and occlusions, on human readability and automated solvers in text-based CAPTCHA recognition. We conduct experiments on multimodal large language models (MLLMs), a traditional deep learning-based optical character recognition (OCR) system, and human subjects. Using an existing CAPTCHA dataset and artificially corrupted versions, we analyze the recognition performance of AI models and humans, identifying vulnerabilities and patterns of robustness. The findings will contribute to a better understanding of CAPTCHA vulnerabilities and explore potential methods to increase the robustness of CAPTCHA in the era of advanced AI models.

Improving CAPTCHA Robustness via Controlled Image Corruptions (Student Abstract)

In a game of Network Restoration Games with Quotas, there
is an underlying graph where a subset of its edges have to
be restored by a set of agents. Each agent has a creation
cost for each such edge, a traversal cost for every edge of
the graph, and in addition they have a quota on the number
of edges they have to restore. Then, given a set of edges
that fulfill the quota, the cost of an agent is the cost of
creating these edges, plus the cost of reaching them, i.e.,
the traversal cost. We prove that any cost-minimizing
allocation is swap-stable, i.e., there is no profitable
exchange of edges between any pair of agents, but computing
one is hard even on trees. We complement this by designing
an algorithm that finds a swap-stable allocation on trees
in polynomial time and we quantify its cost against the
optimal one.

Network Restoration Games with Quotas (Student Abstract)

We present a context-aware diffusion model for multivariate
time series generation in dynamic and partially observed
environments, with applications to data-center computing
node's telemetry and beyond. The model integrates
pretrained textual embeddings to represent feature
semantics, enabling flexible, context-guided generation and
improved adaptability to unseen or re-ordered input
features. Built on a transformer architecture, it employs
both time-wise and feature-wise masking to support missing
data during training and inference. We show that the model
is robust to permutations with respect to the feature
dimension, mantaining stable performance in settings where
input configurations vary. Empirical evaluations on HPC
sensor data illustrate the model’s versatility across
generation and imputation tasks. This work introduces a
modular and generalizable framework for time series
modeling in complex, high-dimensional systems which can
serve as a digital-twin for data-center's compute node
telemetry.

Context-Aware Diffusion for Telemetry Time Series with Permutation-Stable Feature Modeling (Student Abstract)

Ensuring proper use of personal protective equipment (PPE),
especially helmets, is critical for workplace safety.
Conventional object detectors often fail to distinguish
whether a helmet is worn correctly, and existing approaches
relying on ROI cropping or single-model pipelines are prone
to localization errors and false alarms. Moreover, most
prior studies do not guarantee real-time operation under
lightweight deployment constraints. To address these
challenges, we propose a lightweight YOLO11-based object
detector combined with a pose estimation model, achieving
both higher F1 score and lower false alarm rates while
maintaining real-time performance.

A Lightweight Safety Helmet Compliance Detection via Multimodal Fusion (Student Abstract)

Interdependent directed networks model real-life systems,
like trade flows and social interactions, where asymmetric
edges drive one-way cascades and mutual dependencies
amplify vulnerabilities. Dismantling these networks to
minimize the largest mutually strongly connected component
(MSCC) is an NP-hard problem. We propose Dismantling
Directed Interdependent Networks (DDIN), a novel
combination of Reinforcement Learning (RL) and Graph Neural
Networks (GNN) framework, to address this problem. Our
contributions include (i) a directed GraphSAGE encoder
separating in/out aggregations for asymmetry, (ii)
multi-relational attention fusing layer semantics, and
(iii) sum-tree prioritized n-step Deep Q-Network (DQN) for
efficient policy search in sparse states. Evaluated on
three directed multiplexes (FAO Trade, Homo Genetic,
Sanremo 2016), DDIN achieves 17-22% lower AUDC values
compared to heuristics like High Degree Attack (HDA) and
Directed Collective Influence (DCI).

DDIN: Reinforcement Learning with Asymmetric GNNs for Dismantling Directed Interdependent Networks (Student Abstract)

Building temperature prediction is crucial for energy
optimization and control in smart cities. We present a
hybrid framework combining XGBoost with physics-informed
neural networks (PINN) in a multi-stage sequential scaling
approach. Starting from single-zone, single-day
predictions, we progressively scale to multi-zone,
multi-year forecasts using real-world data from Google’s
Smart Building Simulator. Our method incorporates physics
enhanced features, temporal encodings, and inter-zone
interactions, achieving mean absolute errors (MAE) as low
as 0.169°F for weekly multi-zone predictions. For longer
horizons, we employ ensemble strategies, demonstrating
robust performance up to 2.5 years. Compared to pure
XGBoost or PINN baselines, our hybrid framework
consistently improves long-term prediction fidelity.
This work advances urban AI by enabling accurate long-term
building dynamics modeling
for downstream control tasks and bridges machine learning
with physics-based modeling approaches.

Bridging Machine Learning and Physics for Scalable Long-Term Building Temperature Prediction (Student Abstract)

Deep learning is increasingly applied to intraoperative and
surgical video analysis to enable real-time workflow
recognition, and decision support for improved surgical
precision. A key direction is modeling surgical activity as
triplets of instrument, action, and target, which provide a
richer representation of procedures. However, existing
approaches often depend on bounding-box annotations or lack
temporal context. We propose TWiST (Temporal Weakly
Supervised Triplet detection), a framework that combines
weakly supervised instrument localization, temporal
attention for triplet prediction, and grounding of triplets
with detected instruments. Our experiments show that TWiST
outperforms prior weakly supervised baselines.

TWiST: Temporal Weakly-Supervised Triplets Recognition in Surgical Videos (Student Abstract)

In this work, C2R-KD is proposed, applying a
Complex-to-Real projection to map complex domain features
into the real domain. C2R-KD mitigates complex-real domain
mismatch to strengthen the representational capacity of the
student model and further improves the knowledge
distillation model performance through the hybrid
distillation of features and logits simultaneously.
Experimental result demonstrates higher accuracy than the
conventional KD across all test environments.

C2R-KD: Complex to Real Knowledge Distillation (Student Abstract)

Efficient spam detection in resource-constrained
environments remains challenging due to class imbalance,
noisy text, and the computational demands of large
Transformer models. We introduce a novel coreset selection
framework based on a unified Entropy–Class-Balanced
Uncertainty-Density Ranking (CBUDR) scheme. Our method
prioritizes highly informative and uncertain samples while
ensuring diversity and class balance within the selected
subset. The framework flexibly supports multiple selection
strategies, including Top-K, Bottom-K, and adaptive
class-wise schemes, enabling robust performance even when
training on as little as 5% of the dataset. Extensive
experiments on benchmark datasets (UCI SMS, UTKML Twitter,
LingSpam) show that our ranking scheme achieves competitive
accuracy, precision, and recall while significantly
reducing computational cost. These results demonstrate that
carefully designed coreset strategies can surpass full-data
performance in both balanced and imbalanced settings,
highlighting the potential for deployment on low-power
devices and mobile platforms.

Content not yet available

Next from AAAI 2026

Strategic Tool Enhanced AI Agent for Multi-Issue Negotiation (Student Abstract)

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES