Singapore

Vision-Language Retrieval (VLR) aims to retrieve relevant visual or textual information from multimodal data using language or image queries. However, traditional VLR methods often rely on data-driven shallow semantic alignment and fail to understand the deeper structural and fine-grained entity features of queries, resulting in poor performance on multi-entity layouts and challenging entities. In this paper, we propose the Layout-Aware and Sketch-Enhanced (LASE) VLR framework, which refines query representations by incorporating multimodal layout and sketch knowledge. Specifically, layout knowledge encodes the spatial arrangement of entities, while sketch knowledge refines entity perception by capturing essential structural details. To extract these knowledge representations, we leverage Large Language Models&#39; (LLMs) powerful semantic understanding for layout generation, and Diffusion Models&#39; (DMs) fine-grained cross-modal generative capabilities for sketch generation. However, integrating knowledge into queries may introduce biases and query-specific preferences due to varying visual content and knowledge demands. To address this, we propose the Gated Dual-Stream Knowledge Module (GDKM), which consists of a multi-instance fusion network with a sample-aware gating network. The fusion network aggregates diverse knowledge using multi-head attention to reduce bias, while the gating network adjusts knowledge weights based on query characteristics. Extensive experiments demonstrate that the LASE significantly enhances VLR performance across multiple benchmarks, with superior generalization and transferability.

AAAI 2026

Imagine with Layout and Sketch: Enhancing Vision-Language Retrieval with Dual-Stream Multi-Modal Query Refinement

Vision-Language Retrieval (VLR) aims to retrieve relevant visual or textual information from multimodal data using language or image queries. However, traditional VLR methods often rely on data-driven shallow semantic alignment and fail to understand the deeper structural and fine-grained entity features of queries, resulting in poor performance on multi-entity layouts and challenging entities. In this paper, we propose the Layout-Aware and Sketch-Enhanced (LASE) VLR framework, which refines query representations by incorporating multimodal layout and sketch knowledge. Specifically, layout knowledge encodes the spatial arrangement of entities, while sketch knowledge refines entity perception by capturing essential structural details. To extract these knowledge representations, we leverage Large Language Models' (LLMs) powerful semantic understanding for layout generation, and Diffusion Models' (DMs) fine-grained cross-modal generative capabilities for sketch generation. However, integrating knowledge into queries may introduce biases and query-specific preferences due to varying visual content and knowledge demands. To address this, we propose the Gated Dual-Stream Knowledge Module (GDKM), which consists of a multi-instance fusion network with a sample-aware gating network. The fusion network aggregates diverse knowledge using multi-head attention to reduce bias, while the gating network adjusts knowledge weights based on query characteristics. Extensive experiments demonstrate that the LASE significantly enhances VLR performance across multiple benchmarks, with superior generalization and transferability.

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

This talk will introduce knowledge-guided machine learning
(KGML), a rapidly growing field of research where
scientific knowledge is deeply integrated in machine
learning frameworks to produce scientifically grounded,
explainable, and generalizable predictions even on
out-of-distribution data. This talk will present a
multi-dimensional view to organize prior research in KGML
in terms of the nature and format of scientific knowledge
used, the form of knowledge-ML integration explored, and
the method for incorporating scientific knowledge in ML for
diverse scientific use-cases. These KGML concepts will be
illustrated using a variety of case studies in ecology,
biology, and public health including modeling the quality
of water in lakes across the US and discovering novel
biological traits of organisms linked with evolution from
biodiversity images. The talk will conclude with a
discussion of emerging opportunities in KGML especially in
the age of generative AI and Foundation models with
potential applications in a broad range of scientific
disciplines.

Knowledge-Guided Machine Learning: A Paradigm Shift in AI for Science

The research trend of metacognitive AI deals with the study of artificial intelligence systems that can self-monitor and/or regulate resources. This concept has its roots in cognitive psychology studies on human metacognition. It has led to the understanding of how people monitor, control, and communicate their cognitive processes. An emerging research trend in artificial intelligence is to build systems that possess these capabilities. This paper summarizes the key ideas about metacognition from cognitive psychology, describes recent attempts to instantiate these concepts in AI systems, and discusses metacognitive capabilities observed in humans that are not thoroughly explored in AI research.

Toward Artificial Metacognition

Visual reinforcement learning (RL) suffers from poor sample efficiency due to high-dimensional observations in complex tasks. While existing works have shown that vision-language models (VLMs) can assist RL, they often focus on knowledge distillation from the VLM to RL, overlooking the potential of RL-generated interaction data to enhance the VLM. To address this, we propose COVR, a collaborative optimization framework that enables the mutual enhancement of the VLM and RL policies. Specifically, COVR fine-tunes the VLM with RL-generated data to enhance the semantic reasoning ability consistent with the target task, and uses the enhanced VLM to further guide policy learning via action priors. To improve fine-tuning efficiency, we introduce two key modules: (1) an Exploration-Driven Dynamic Filter module that preserves valuable exploration samples using adaptive thresholds based on the degree of exploration, and (2) a Return-Aware Adaptive Loss Weight module that improves the stability of training by quantifying the inconsistency of sampling actions via return signals of RL. We further design a progressive fine-tuning strategy to reduce resource consumption. Extensive experiments show that COVR achieves strong performance across various challenging visual control tasks.

COVR: Collaborative Optimization of VLMs and RL Agent for Visual-Based Control

Deep multi-modal clustering fully learns semantically consistent and discriminative cluster representations between multiple modalities in an unlabeled manner. However, existing methods treat all samples equally, ignoring varying sample quality, which limits clustering performance. Inspired by the concept of interest in the recommendation system, we propose a novel interest-driven deep multi-modal clustering (IDMC) framework. It designs a new paradigm to quantify the importance of each sample base on the attention it receives from other samples, which called interest value. This value jointly captures the local geometric structure through the Euclidean distance in feature space and the consistency of pseudo-labels. Then, we design a novel adaptive Bayesian fusion mechanism to dynamically balance the prior features and self-supervisory signals to ensure confidence-based sample importance estimation. Furthermore, we introduce a median normalization constraint and a label consistency constraint to further refine the construction of the interest value. By embedding this interest-guided value into representation learning and cluster optimization, IDMC focuses on the samples with the most information and the most stable semantics, thereby enhancing the performance of multi-modal representation learning. Extensive experiments verify that IDMC is superior to existing state-of-the-art methods in multiple evaluation metrics.

Interest-driven Deep Multi-modal Clustering

Fine-grained urban flow inference is crucial for urban planning and intelligent transportation systems, enabling precise traffic management and resource allocation.
However, the practical deployment of existing methods is hindered by two key challenges: the prohibitive computational cost of over-parameterized models and the suboptimal performance of conventional loss functions on the highly skewed distribution of urban flows.
To address these challenges, we propose a unified solution that synergizes architectural efficiency with adaptive optimization.
Specifically, we first introduce **PLGF**, a lightweight yet powerful architecture that employs a Progressive Local-Global Fusion strategy to effectively capture both fine-grained details and global contextual dependencies. 
Second, we propose **DualFocal** Loss, a novel function that integrates dual-space supervision with a difficulty-aware focusing mechanism, enabling the model to adaptively concentrate on hard-to-predict regions. 
Extensive experiments on 4 real-world scenarios validate the effectiveness and scalability of our method. 
Notably, while achieving state-of-the-art performance, PLGF reduces the model size by up to 97\% compared to current high-performing methods. Furthermore, under comparable parameter budgets, our model yields an accuracy improvement of over 10\% against strong baselines.
The implementation is included in the **supplementary material**.

Boosting Fine-Grained Urban Flow Inference via Lightweight Architecture and Focalized Optimization

Leveraging vast amounts of unlabeled internet video data for embodied AI is currently bottlenecked by the lack of action labels and the presence of action-correlated visual distractors. Although recent latent action policy optimization (LAPO) has shown promise in inferring proxy action labels from visual observations, its performance degrades significantly when distractors are present. To address this limitation, we propose a novel object-centric latent action learning framework that centers on objects rather than pixels. We leverage self-supervised object-centric pretraining to disentangle the movement of the agent and distracting background dynamics. This allows LAPO to focus on task-relevant interactions, resulting in more robust proxy-action labels, enabling better imitation learning and efficient adaptation of the agent with just a few action-labeled trajectories. We evaluated our method in eight visually complex tasks across the Distracting Control Suite (DCS) and Distracting MetaWorld (DMW). Our results show that object-centric pretraining mitigates the negative effects of distractors by 50%, as measured by downstream task performance: average return (DCS) and success rate (DMW).

Object-Centric Latent Action Learning

Large language models (LLMs) are widely adopted across diverse AI applications.
To align LLM behavior with human values, Reinforcement Learning from Human Feedback (RLHF) employs a reward model (RM) as a proxy for human preferences to guide policy optimization.
Consequently, the accuracy, reliability, and interpretability of the RM critically influence downstream alignment outcomes.
However, conventional scalar RMs are both opaque and rigid, offering little insight into reward reasoning and lacking adaptability to evolving preferences.
While recent work on multidimensional RMs has sought to improve interpretability, these methods often fall short in feature-level attribution and incur substantial annotation costs.
To address these challenges, we propose the Sparse Autoencoder-enhanced Reward Model (\textbf{SARM}), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into the reward modeling pipeline.
Specifically, SARM projects LLM hidden activations into a sparse monosemantic feature space, with a scalar head aggregating these features to produce reward scores attributable to interpretable concepts.
Experiments demonstrate that SARM enables direct attribution of reward scores to interpretable feature activations, supports dynamic preference adjustment, and outperforms standard scalar RMs in alignment tasks.

Interpretable Reward Model via Sparse Autoencoder

With the widespread deployment of large language models (LLMs) in human-computer interaction, dark patterns have extended from traditional visual interfaces to conversational AI systems. While existing research has confirmed the prevalence of dark patterns in LLMs, current evaluation benchmarks face critical challenges including limited classification coverage, overlooked risks specific to reasoning models, and inadequate consideration of cross-linguistic differences. To address these limitations, we propose DarkBench+, an extended benchmark for evaluating dark patterns in LLMs. We construct an expanded taxonomy containing 10 major categories and 24 subcategories, introduce an annotation workflow combining manual and automated methods, and design 2,088 bilingual test samples in Chinese and English. This benchmark is the first to develop specialized evaluation dimensions for reasoning models and systematically evaluates dark pattern behaviors across nearly 40 mainstream LLMs. Experimental results demonstrate significant manipulation risks in reasoning models' transparency displays, while cross-linguistic evaluation analyzes AI manipulation behavior differences across different linguistic environments, promoting more ethical and responsible LLM development.

DarkBench+: An Extended Benchmark for Evaluating Dark Patterns in Large Language Models

This paper presents the first AI/ML system for automating building damage assessment in uncrewed aerial systems (sUAS) imagery to be deployed operationally during federally declared disasters (Hurricanes Debby and Helene). In response to major disasters, sUAS teams are dispatched to collect imagery of the affected areas to assess damage; however, at recent disasters, teams collectively delivered between
47GB and 369GB of imagery per day, representing more imagery than can reasonably be transmitted or interpreted by subject matter experts in the disaster scene, thus delaying response efforts. To alleviate this data avalanche encountered in practice, computer vision and machine learning techniques are necessary. While prior work has been deployed to automatically assess damage in satellite imagery, there is no current state of practice for sUAS-based damage assessment systems for operational use, as all known work has been confined 
to academic settings. This work establishes the state of practice via the development and deployment of models for building damage assessment with sUAS imagery. The development of the models consisted of training on the largest known dataset of post-disaster sUAS aerial imagery, which consists of 21,716 building damage labels, and the operational training of 91 disaster practitioners. The deployment of the system was during the responses to Hurricanes Debby and Helene, where it assessed a combined 415 buildings in approximately 18 minutes. This work contributes detailed documentation of the actual use of AI/ML for damage assessment during a disaster and lessons learned to the benefit of the AI/ML research and user communities.

Deploying Rapid Damage Assessments from sUAS Imagery for Disaster Response

Deep learning models are designed based on the i.i.d.
assumption; consequently, they experience a significant
performance drop due to the distribution shifts when
deployed in real environments. Domain Generalisation (DG)
aims to bridge the distribution shift between the source
and target domains by improving the generalisability of the
model to Out-Of-Distribution (OOD) data. This challenge is
prominent in satellite imagery classification due to the
scarcity of data from underrepresented regions such as
Africa and Oceania. In this paper, we address the
limitations of existing datasets in capturing distribution
shifts caused by geospatial differences between geographic
regions by constructing a new, large-scale dataset called
Domain Shift across Geographic Regions (DSGR). This dataset
aims to help researchers better understand the impact of
distribution shifts on satellite imagery classification.
Furthermore, we perform rigorous experiments on DSGR to
investigate and benchmark the robustness of existing DG
techniques under single- and multi-source domain settings
and the role of foundation models in enhancing the DG
techniques. Our evaluations reveal that recent DG
techniques have a comparable, yet weak, performance on
DSGR. However, when combined with a foundation model like
CLIP, ERM (introduced in 1999) achieves highly competitive
results, surpassing even recent state-of-the-art DG
solutions in enhancing the generalisability of deep
learning models across different geographic regions. Our
dataset and code are available at
https://github.com/RWGAI/DSGR.



Downloads

Next from AAAI 2026

Knowledge-Guided Machine Learning: A Paradigm Shift in AI for Science

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Knowledge-Guided Machine Learning: A Paradigm Shift in AI for Science

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads