Singapore

Vision-and-Language navigation on websites requires agents to navigate target webpages and answer questions based on human instructions. Current web agents primarily leverage Large Language Models (LLMs) for semantic understanding and reasoning, but still suffer from limited navigation performance and slow inference speed. Constructing a global map across webpages can effectively enhance both navigation accuracy and efficiency, however, this is challenged by the open structure of web navigation graphs and the dynamic nature of web layouts. In this paper, we propose ATLAS: Adaptive Topological Layout And Semantic mapping, a framework that adaptively constructs a time-varying, unbounded topological map across webpages and unifies heterogeneous elements through semantic representation. This enables both global path planning and local element selection for web-based navigation and question answering. As a lightweight approach, ATLAS significantly outperforms existing state-of-the-art methods on the WebVLN benchmark with a 10% improvement in success rate, and achieves the highest average task success rate on both the Mind2Web and WebArena benchmarks.

AAAI 2026

Lightweight Adaptive Topological Layout and Semantic Mapping in Vision-and-Language Navigation on Websites

dmkm: web

dmkm: knowledge acquisition from the web

and navigation

rob: localization

rob: motion and path planning

mapping

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Event linking aims to associate event mentions in text with their corresponding entries in a knowledge base (KB). This task can help text understanding to benefit downstream tasks (e.g., question answering and recommendation systems) and expand the KB through new event knowledge mentioned in the text. Existing event linking approaches usually adopt a retrieve-and-rank framework, which suffers from high computational costs and relies on hand-crafted rules, thereby limiting generalization. Additionally, it is found that some entity linking methods can be used to solve this task directly. However, they also perform not well. In this paper, we propose SEFEL, an end-to-end, argument-aware event representation-based event linking framework to unify the modeling of both in-KB and out-of-KB scenarios. To further enhance the linking performance, we propose a contrastive learning module to refine the learned embeddings of events and event mentions. Experimental results demonstrate that SEFEL improves accuracy by at least $3.59$ (in-KB) and $21.5$ (out-of-KB) compared with baselines, while its inference speed is more than $38$ times faster than baselines, showcasing its accuracy and efficiency.

SEFEL: A Simple Yet Effective Framework for Fast Event Linking

Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adapted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA performs well on three fine-grained datasets.

Fine-Grained Image Retrieval via Dual-Vision Adaptation

In multi-instance partial label learning (MIPL), each sample is a bag of multiple instances linked to a candidate label set containing one true and multiple false labels, yielding inexact supervision in both instance features and label space. However, existing works adopt decoupled approaches that focus exclusively on either instance-level feature fusion or label-level disambiguation, failing to fully exploit the intrinsic dependencies between these two spaces. To overcome this limitation, graph-based methods are widely recognized as a powerful paradigm in weakly supervised learning, yet their success hinges on reliable features—precisely what MIPL lacks due to instance-level noise.
To bridge this gap, we propose DualG, a novel framework that simultaneously addresses feature learning and label disambiguation through dual-level graph propagation.
Specifically, we construct dual relevance graphs at both the bag and instance levels. At the bag level, we build a similarity graph based on fused feature representations; at the instance level, we employ attention scores to filter out irrelevant instances and construct a reliable instance-level relevance graph. These complementary graphs enable our joint label disambiguation framework to simultaneously address inexact supervision signals in both instance space and label space.
Experimental results on five benchmark datasets demonstrate that DualG outperforms existing MIPL and partial label learning methods, validating its effectiveness and superiority.

Dual Graph Disambiguation for Multi-Instance Partial-Label Learning

Tactile sensing offers rich and complementary information to vision and language, enabling robots to perceive fine-grained object properties. However, existing tactile sensors lack standardization, leading to redundant features that hinder cross-sensor generalization. Moreover, existing methods fail to fully integrate the intermediate communication among tactile, language, and vision modalities. To address this, we propose TLV-CoRe, a CLIP-based Tactile-Language-Vision Collaborative Representation learning method. TLV-CoRe introduces a Sensor-Aware Modulator to unify tactile features across different sensors and employs tactile-irrelevant decoupled learning to disentangle irrelevant tactile features. Additionally, a Unified Bridging Adapter is introduced to enhance tri-modal interaction within the shared representation space. To fairly evaluate the effectiveness of tactile models, we further propose the RSS evaluation framework, focusing on Robustness, Synergy, and Stability across different methods. Experimental results demonstrate that TLV-CoRe significantly improves sensor-agnostic representation learning and cross-modal alignment, offering a new direction for multimodal tactile representation. The codes, data and pre-trained weights are available at https://anonymous.4open.science/r/TLV-CoRe.

Collaborative Representation Learning for Alignment of Tactile, Language, and Vision Modalities

Spatial transcriptomics (ST) enables joint profiling of gene expression and spatial positions, thereby revealing spatially resolved biological functions. However, many existing ST analysis methods often fail to explicitly quantify the belief and uncertainty in decisions caused by noisy ST data, making it difficult to handle spots of varying quality in a fine-grained manner. In addition, domain identification is a fundamental and critical task in ST, but commonly used models that separate expression learning and clustering often struggle to learn cluster-friendly latent representations effectively. To address these issues, we propose PREST, a prototype-based evidence-aware integration framework for ST data. PREST performs multi-scale representation learning with fine-grained attention fusion and introduces learnable class prototypes to quantify belief and uncertainty in model decisions. We aim to align overall belief scores with latent semantic information to enhance uncertainty quantification and prototype learning, thereby promoting the learning of clustering-friendly representations. PREST further integrates an uncertainty-aware reconstruction module and spatial regularization to reduce overfitting to unreliable spots and promote denoised, discriminative representations. Extensive experiments on several benchmark datasets validate the effectiveness and superiority of our proposed PREST across various downstream tasks.

Evidence-aware Integration and Domain Identification of Spatial Transcriptomics Data

Open-world object detection (OWOD) aims to detect known and unknown objects in dynamic environments. However, only known classes are labeled during training, making it challenging for detectors to recognize unknown objects during inference. Existing methods typically rely on supervision from known categories, leading models to overconfidently misclassify visually similar unknowns as known, and dissimilar ones as background. This known-class prior bias limits the model’s ability to detect unknown objects. In this paper, we propose a novel method, OW-DAR, which enhances foreground–background separability through collaborative fine-grained and coarse-grained modeling. At the fine-grained level, we propose Fine-grained Masked Reconstruction (FMR), which randomly masks regions of the feature map to guide the reconstruction toward semantic structures, rather than memorizing low-level patterns. At the coarse-grained level, we propose Adaptive Region-based Error Aggregation (AREA), which operates on object proposals to aggregate reconstruction errors. This enables the model to attend to semantically ambiguous foreground–background boundaries while suppressing the influence of local outliers during optimization. Finally, we leverage robust reconstruction errors to perform unsupervised foreground–background modeling, enabling probabilistic estimation for potential unknown objects. We validate the effectiveness of OW-DAR on standard OWOD benchmark. Experimental results demonstrate that OW-DAR consistently outperforms existing state-of-the-art methods, achieving a +18.8 improvement in unknown object recall (U-Recall). Our code will be released.

OW-DAR: Dual-Granularity Adaptive Reconstruction-Error Modeling for Open-World Object Detection

Existing paper review methods often rely on superficial manuscript features or directly on large language models (LLMs), which are prone to hallucinations, biased scoring, and limited reasoning capabilities. Moreover, these methods often fail to capture the complex argumentative reasoning and negotiation dynamics inherent in reviewer-author interactions. To address these limitations, we propose ReViewGraph (Reviewer-Author Debates Graph Reasoner), a novel framework that performs heterogeneous graph reasoning over LLM-simulated multi-round reviewer-author debates. In our approach, reviewer-author exchanges are simulated through LLM-based multi-agent collaboration. Diverse opinion relations (e.g., acceptance, rejection, clarification, and compromise) are then explicitly extracted and encoded as typed edges within a heterogeneous interaction graph. By applying graph neural networks to reason over these structured debate graphs, ReViewGraph captures fine-grained argumentative dynamics and enables more informed review decisions. Extensive experiments on three datasets demonstrate that ReViewGraph outperforms strong baselines with an average relative improvement of 15.73\%, underscoring the value of modeling detailed reviewer–author debate structures.

Automatic Paper Reviewing with Heterogeneous Graph Reasoning over LLM-Simulated Reviewer-Author Debates

With the rapid rise of large models, copyright protection for generated image content has become a critical security challenge. Although deep learning watermarking techniques offer an effective solution for digital image copyright protection, they still face limitations in terms of visual quality, robustness and generalization. To address these issues, this paper proposes an adaptive robust iterative watermarking framework (ARIW-Framework) that achieves high-quality watermarked images while maintaining exceptional robustness and generalization performance. Specifically, we introduce an iterative approach to optimize the encoder for generating robust residuals. The encoder incorporates noise layers and a decoder to compute robustness weights for residuals under various noise attacks. By employing a parallel optimization strategy, the framework enhances robustness against multiple types of noise attacks. Furthermore, we leverage image gradients to determine the embedding strength at each pixel location, significantly improving the visual quality of the watermarked images. Extensive experiments demonstrate that the proposed method achieves superior visual quality while exhibiting remarkable robustness and generalization against noise attacks.

ARIW-Framework: Adaptive Robust Iterative Watermarking Framework

Incomplete cross-modal retrieval (ICMR) requires models to recover missing modalities and robustly align heterogeneous ones for effective retrieval. Existing methods, however, fall short in both aspects. They often rely on limited semantic cues, such as single samples or coarse category prototypes, which compromises reconstruction quality. Moreover, these approaches are vulnerable to learning spurious cross-modal correlations, thereby impairing accurate alignment and hindering retrieval performance. To address these challenges, we propose Causality-Aligned Semantic Recovery (CASR), a novel method designed to both comprehensively restore missing modalities and mitigate spurious associations between vision and language. Our CASR involves two essential components: i) the Missing Modality Imagination (MMI) module, which combines category semantic priors with relevant contextual information to achieve high-quality semantic reconstruction; ii) the Explicit Causal Alignment (ECA) module, which explicitly learns environment-invariant attention, effectively eliminating the interference of spurious correlations and improving retrieval performance. Furthermore, we extend CASR to the challenging task of Partially Aligned Cross-Modal Retrieval, where we treat unlabeled unpaired data as a form of incomplete data. By leveraging MMI and ECA modules, we are able to learn robust representations in this setting. Extensive experiments on benchmark datasets under various missing rates demonstrate that CASR achieves superior robustness and retrieval performance. Code is provided anonymously in the supplementary material.

Causality-Aligned Semantic Recovery for Incomplete Cross-Modal Retrieval

Text-to-SQL is a critical task in natural language processing that aims to transform natural language questions into accurate and executable SQL queries. In real-world scenarios, these reasoning tasks are often accompanied by complex mathematical computations, domain knowledge, and hypothetical reasoning scenarios. However, existing large-scale Text-to-SQL datasets typically focus on business logic and task logic, neglecting critical factors such as vertical domain knowledge, complex mathematical reasoning, and hypothetical reasoning, which are essential for realistically reflecting the reasoning demands in practical applications and completing data querying and analysis. To bridge this gap, we introduce LogicCat, the first Text-to-SQL benchmark dataset specifically designed for complex reasoning and chain-of-thought parsing, encompassing physics, arithmetic, commonsense, and hypothetical reasoning scenarios. LogicCat comprises 4,038 English questions paired 12,114 detailed chain-of-thought reasoning steps, spanning 45 databases across diverse domains, significantly surpassing existing datasets in complexity. Experimental results demonstrate that LogicCat substantially increases the task difficulty for current state-of-the-art models to at most 33.20% execution accuracy, indicating that this task remains exceptionally challenging. The advancement of LogicCat represents a crucial step toward developing systems suitable for real-world enterprise data analysis and autonomous query generation. Our Dataset is in https://  github.com/Ffunkytao/LogicCat. 

Downloads

Next from AAAI 2026

SEFEL: A Simple Yet Effective Framework for Fast Event Linking

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

SEFEL: A Simple Yet Effective Framework for Fast Event Linking

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads