Singapore

For industrial-scale Text-to-SQL, supplying the entire database schema to Large Language Models (LLMs) is impractical due to context window limits and irrelevant noise. Schema linking, which filters the schema to a relevant subset, is therefore critical. However, existing methods incur prohibitive costs, struggle to balance recall with noise, or scale poorly to large databases. We present \textbf{AutoLink}, an autonomous agent framework that reformulates schema linking as an iterative, agent-driven process. Guided by an LLM, AutoLink dynamically explores and expands the linked schema subset, progressively identifying necessary schema components without inputting the full database schema. Our experiments demonstrate AutoLink&#39;s superior performance, achieving state-of-the-art strict schema linking recall of \textbf{97.4\%} on Bird-Dev and \textbf{91.2\%} on Spider-2.0-Lite, with competitive execution accuracy, i.e., \textbf{68.7\%} EX on Bird-Dev (better than CHESS), \textbf{34.9\%} EX on Spider-2.0-Lite (rank 2st on the official leaderboard). Crucially, AutoLink exhibits \textbf{exceptional scalability}, \textbf{maintaining high recall}, \textbf{efficient token consumption} and \textbf{robust execution accuracy} on large schemas (e.g., over 3,000 columns) where existing methods severely degrade. Extensive experimental results validate AutoLink as a robust, highly scalable, and high-recall schema linking solution for industrial Text-to-SQL systems.

AAAI 2026

AutoLink: Autonomous Schema Exploration and Expansion for Scalable Schema Linking in Text-to-SQL at Scale

nlp: code generation / program synthesis from natural language

nlp: prompt engineering / prompting

nlp: (large) language models

For industrial-scale Text-to-SQL, supplying the entire database schema to Large Language Models (LLMs) is impractical due to context window limits and irrelevant noise. Schema linking, which filters the schema to a relevant subset, is therefore critical. However, existing methods incur prohibitive costs, struggle to balance recall with noise, or scale poorly to large databases. We present \textbf{AutoLink}, an autonomous agent framework that reformulates schema linking as an iterative, agent-driven process. Guided by an LLM, AutoLink dynamically explores and expands the linked schema subset, progressively identifying necessary schema components without inputting the full database schema. Our experiments demonstrate AutoLink's superior performance, achieving state-of-the-art strict schema linking recall of \textbf{97.4\%} on Bird-Dev and \textbf{91.2\%} on Spider-2.0-Lite, with competitive execution accuracy, i.e., \textbf{68.7\%} EX on Bird-Dev (better than CHESS), \textbf{34.9\%} EX on Spider-2.0-Lite (rank 2st on the official leaderboard). Crucially, AutoLink exhibits \textbf{exceptional scalability}, \textbf{maintaining high recall}, \textbf{efficient token consumption} and \textbf{robust execution accuracy} on large schemas (e.g., over 3,000 columns) where existing methods severely degrade. Extensive experimental results validate AutoLink as a robust, highly scalable, and high-recall schema linking solution for industrial Text-to-SQL systems.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Cross-domain Few-shot Segmentation (CD-FSS) aims to segment novel classes from target domains that are not involved in training and have significantly different data distributions from the source domain, using only a few annotated samples, and recent years have witnessed significant progress on this task. However, existing CD-FSS methods primarily focus on style gaps between source and target domains while ignoring segmentation granularity gaps, resulting in insufficient semantic discriminability for novel classes in target domains. Therefore, we propose a Hierarchical Semantic Learning (HSL) framework to tackle this problem. Specifically, we introduce a Dual Style Randomization (DSR) module and a Hierarchical Semantic Mining (HSM) module to learn hierarchical semantic features, thereby enhancing the model's ability to recognize semantics at varying granularities. DSR simulates target domain data with diverse foreground-background style differences and overall style variations through foreground and global style randomization respectively, while HSM leverages multi-scale superpixels to guide the model to mine intra-class consistency and inter-class distinction at different granularities. Additionally, we also propose a Prototype Confidence-modulated Thresholding (PCMT) module to mitigate segmentation ambiguity when foreground and background are excessively similar. Extensive experiments are conducted on four popular target domain datasets, and the results demonstrate that our method achieves state-of-the-art performance.

Bridging Granularity Gaps: Hierarchical Semantic Learning for Cross-domain Few-shot Segmentation

The rapid expansion of the Internet of Things (IoT) has created a growing demand for large-scale sensor deployment. However, the high cost of physical sensors limits the scalability and coverage of sensor networks, making fine-grained sensing difficult. Inductive Spatio-Temporal Kriging (ISK) addresses this challenge by introducing virtual sensors that infer measurements from physical sensors, typically using graph neural networks (GNNs) to model their relationships. Despite its promise, current ISK methods often rely on standard message-passing and generic architectures that fail to effectively capture spatio-temporal features or represent virtual nodes accurately. Additionally, existing graph construction techniques suffer from sparse and noisy connections, further hindering performance. To address these limitations, we propose DarkFarseer, a novel ISK framework with three key innovations. First, the Style-enhanced Temporal-Spatial architecture adopts a temporal-then-spatial processing scheme with a temporal style transfer mechanism to enhance virtual node representations. Second, Regional-semantic Contrastive Learning improves representation learning by aligning virtual nodes with regional component patterns. Third, the Similarity-Based Graph Denoising Strategy mitigates the influence of noisy edges by leveraging temporal similarity and regional structure. Extensive experiments on real-world datasets demonstrate that DarkFarseer significantly outperforms state-of-the-art ISK methods.

DarkFarseer: Robust Spatio-Temporal Kriging Under Graph Sparsity and Noise

An interesting phenomenon arises: Empirical Risk Minimization (ERM) sometimes outperforms methods specifically designed for out-of-distribution tasks. This motivates an investigation into the reasons behind such behavior beyond algorithmic design. In this study, we find that one such reason lies in the distribution shift across training domains. A large degree of distribution shift can lead to better performance even under ERM. Specifically, we derive several theoretical and empirical findings demonstrating that distribution shift plays a crucial role in model learning and benefits learning invariant prediction. First, the proposed upper bounds indicate that the degree of distribution shift directly affects the generalization ability of the learned models. If it is large, the generalization ability of the learned models can increase, approximating invariant prediction models that make stable predictions under arbitrary known or unseen domains; and vice versa. Moreover, we prove that under certain data conditions, ERM solutions can exhibit performance comparable to that of invariant prediction models. Second, the empirical validation results demonstrated that the predictions of the trained models approximate the ground-truth labels, provided that the degree of distribution shift in the training data increases.

Distribution Shift Is Key to Learning Invariant Prediction

Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks that require generating comprehensive and diverse answers. We attribute this limitation to \textbf{reasoning overconfidence}: a tendency to express undue certainty in an incomplete solution set. To examine the effect, we introduce \textit{MuSoBench}, a benchmark of multi-solution problems. Experiments show that the conventional short chain-of-thought (Short-CoT) prompting paradigm exhibits pronounced overconfidence, whereas the emerging long chain-of-thought (Long-CoT) approach mitigates it through iterative exploration and self-reflection. We further characterise observable behaviours and influential factors. To probe the underlying cause, we propose the \textbf{cognitive-rigidity hypothesis}, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths. An attention-entropy analysis offers preliminary support for this view. These findings provide tools for assessing the completeness of LLM reasoning and highlight the need to move evaluation beyond single-answer accuracy toward comprehensive exploration.

Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks

In the domain of moment retrieval, accurately identifying temporal segments within videos based on natural language queries remains challenging. Traditional methods often employ pre-trained models that struggle with fine-grained information and deterministic reasoning, leading to difficulties in aligning with complex or ambiguous moments. To overcome these limitations, we explore Deep Evidential Regression (DER) to construct a vanilla Evidential baseline. However, this approach encounters two major issues: the inability to effectively handle modality imbalance and the structural differences in DER's heuristic uncertainty regularizer, which adversely affect uncertainty estimation. This misalignment results in high uncertainty being incorrectly associated with accurate samples rather than challenging ones. Our observations indicate that existing methods lack the adaptability required for complex video scenarios. In response, we propose Debiased Evidential Learning for Moment Retrieval (DEMR), a novel framework that incorporates a Reflective Flipped Fusion (RFF) block for cross-modal alignment and a query reconstruction task to enhance text sensitivity, thereby reducing bias in uncertainty estimation. Additionally, we introduce a Geom-regularizer to refine uncertainty predictions, enabling adaptive alignment with difficult moments and improving retrieval accuracy. Extensive testing on standard datasets and debiased datasets ActivityNet-CD and Charades-CD demonstrates significant enhancements in effectiveness, robustness, and interpretability, positioning our approach as a promising solution for temporal-semantic robustness in moment retrieval.

Adaptive Evidential Learning for Temporal-Semantic Robustness in Moment Retrieval

Stickers are widely used in online communication to convey emotions and implicit intentions. The Sticker Response Selection (SRS) task aims to select the most contextually appropriate sticker based on the dialogue. However, existing methods typically rely on semantic matching and model emotional and intentional cues separately, which can lead to mismatches when emotions and intentions are misaligned. To address this issue, we propose **E**motion and **I**ntention **G**uided **M**ulti-Modal **L**earning (**EIGML**). This framework is the first to jointly model emotion and intention, effectively reducing the bias caused by isolated modeling and significantly improving selection accuracy. Specifically, we introduce Dual-Level Contrastive Framework to perform both intra-modality and inter-modality alignment, ensuring consistent representation of emotional and intentional features within and across modalities. In addition, we design an Intention-Emotion Guided Multi-Modal Fusion module that integrates emotional and intentional information progressively through three components: Emotion-Guided Intention Knowledge Selection, Intention-Emotion Guided Attention Fusion, and Similarity-Adjusted Matching Mechanism. This design injects rich, effective information into the model and enables a deeper understanding of the dialogue, ultimately enhancing sticker selection performance. Experimental results on two public SRS datasets show that EIGML consistently outperforms state-of-the-art baselines, achieving higher accuracy and a better understanding of emotional and intentional features. Code is provided in the supplementary materials.

Emotion and Intention Guided Multi-Modal Learning for Sticker Response Selection

In medical image classification, data privacy constraints and the high cost of expert annotations pose significant challenges to building generalizable models. Federated semi-supervised learning (FSSL), which combines the privacy-preserving nature of federated learning with the label efficiency of semi-supervised learning, offers a promising direction. However, in real-world deployments, client data often exhibits highly non-independent and identically distributed (Non-IID) characteristics. This distributional heterogeneity undermines the reliability of pseudo-labels generated by global models, ultimately limiting model generalization. A key limitation of existing FSSL approaches lies in their reliance on a static labeled set fixed prior to training. Such strategies lack the ability to adaptively correct pseudo-label noise or address class imbalance throughout training, particularly under Non-IID settings. To address this, we propose FSSAL, a novel framework that introduces an active learning component into the FSSL pipeline. By continuously identifying informative and representative samples during training, our method adaptively refines the labeled set and enhances the model’s robustness to distribution shifts. FSSAL employs client-private models for pseudo-label generation to reduce global bias, applies a class-aware dynamic thresholding mechanism to ensure more reliable and balanced label selection, and incorporates a sample selection strategy guided by both feature diversity and model uncertainty. Extensive experiments on four public medical image classification datasets demonstrate that FSSAL consistently outperforms competitive FSSL methods in accuracy and F1-score, especially under highly Non-IID conditions, highlighting its robustness and practical potential.

Class-Aware Active Annotation in Federated Semi-Supervised Learning for Medical Image Classification

Asynchronous distributed learning is crucial for training large-scale deep models, especially when the computing capabilities of the workers in the cluster are heterogeneous. 
To reduce communication frequency, local updates are widely adopted in distributed learning. Meanwhile, momentum SGD (MSGD) serves as a foundational optimizer due to momentum's key role in accelerating convergence and enhancing generalization. However, how to implement asynchronous distributed MSGD with local updates remains unexplored.
To solve this problem, we propose a novel method, called \underline{or}dered \underline{lo}cal \underline{mo}mentum (OrLoMo), for asynchronous distributed learning. 
In OrLoMo, each worker runs MSGD locally. Then the local momentum from each worker will be aggregated by the server in order based on its global iteration index. To the best of our knowledge, OrLoMo is the first method to implement asynchronous distributed MSGD with local updates. We prove the convergence of OrLoMo for non-convex problems under arbitrary delays. Experiments validate that OrLoMo can outperform its synchronous counterpart and other asynchronous methods.

Ordered Local Momentum for Asynchronous Distributed Learning Under Arbitrary Delays

Auto-regressive (AR)-based decoders, owing to their flexibility in handling variable-length outputs and their strong capability in modeling character-level dependencies, have emerged as the predominant decoding paradigm in the field of scene text recognition (STR). However, AR-based decoders suffer from attention drift, slow decoding speed, and difficulty capturing global dependencies, restricting their performance in various scenarios. In this paper, we propose a novel paradigm for AR-based decoding, called One-Token to Sequence (One2Seq), to address the above issues. Unlike existing methods, we encode the semantic features into a single context token and design a One-Token Wise Decoder to perform the decoding, which alleviates the attention drift caused by the accumulation of semantic information. Moreover, we proposed Positioal-aware Hash Embedding to embed the decoded characters, ensuring the order information is obtained in the context token. By continuously updating this token, One2Seq fully leverages the decoded semantic information while avoiding the computational overhead associated with the growing query sequence. Furthermore, to leverage global information for decoding, we propose Dynamic Global Infusion to dynamically integrates global visual features into the context token. Equipped with the enriched context token, the model has an enhanced ability to extract discriminative local features under the guidance of global context, thereby enhancing recognition accuracy. Extensive experiments reveal that, with its ingenious design, One2Seq exhibits marked superiority on both accuracy and decoding speed compared to existing STR models.

One2Seq: One-Token Wise Decoder for Efficient Scene Text Recognition

Recent advances in multimodal large language models (MLLMs) have demonstrated strong capabilities in addressing open-world segmentation tasks. However, the substantial computational cost of the LLM components presents a significant challenge, especially in segmentation tasks, where efficiency has long been a central concern. Existing efficient MLLM approaches typically reduce computation cost by pruning visual tokens in the early layers, as they account for the majority of the input sequence. Despite their efficiency, this is incompatible with dense prediction tasks such as segmentation, since removing visual tokens leads to the loss of essential object parts and spatial details. To better understand the roles of visual tokens in segmentation, we analyze the attention weights of both image and mask tokens within LLM. We find that image tokens are important throughout all layers, whereas mask tokens only attend to image tokens at deeper layers. Based on the observation, we build an efficient segmentation framework based on MLLMs by introducing a sophisticated token routing strategy. This strategy dynamically determines when and how different tokens participate in computation: For mask tokens, they are only inserted at deeper layers of the LLM to reduce redundant computation, since they rarely attend to image tokens in early layers; For image tokens, only a small number of them, named proxies, are updated via full feedforward network (FFN) computation, while the update of the remaining tokens is guided by these proxies, i.e., efficiently computed through a lightweight projector applied on the difference of the proxies during their update. Our method achieves a 1.5$\times$ acceleration over the original LLM process by reducing its FLOPs to 56\%, while maintaining the same segmentation performance.

Content not yet available

Next from AAAI 2026

Bridging Granularity Gaps: Hierarchical Semantic Learning for Cross-domain Few-shot Segmentation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES