Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Schema linking is a critical bottleneck in applying existing Text-to-SQL models to realworld, large-scale, multi-database environments. Through error analysis, we identify two major challenges in schema linking: (1)Database Retrieval: accurately selecting the target database from a large schema pool, while effectively filtering out irrelevant ones; and (2) Schema Item Grounding: precisely identifying the relevant tables and columns within complex and often redundant schemas for SQL generation. Based on these, we introduce LinkAlign, a novel framework tailored for large-scale databases with thousands of fields. LinkAlign comprises three key steps: multi-round semantic enhanced retrieval and irrelevant information isolation for Challenge 1, and schema extraction enhancement for Challenge 2. Each stage supports both Agent and Pipeline execution modes, enabling balancing efficiency and performance via modular design. To enable more realistic evaluation, we construct AmbiDB, a synthetic dataset designed toreflect the ambiguity of real-world schema linking. Experiments on widely-used Text-to-SQLbenchmarks demonstrate that LinkAlign consistently outperforms existing baselines on all schema linking metrics. Notably, it improvesthe overall Text-to-SQL pipeline and achieves a new state-of-the-art score of 33.09% on theSpider 2.0-Lite benchmark using only open-source LLMs, ranking first on the leaderboardat the time of submission. The code will be open-sourced after the review period.
