China

Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce textbfAgentThink, a pioneering unified framework that, for the first time, integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink&#39;s core innovations include: textbf(i) Structured Data Generation, by establishing an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; textbf(ii) A Two-stage Training Pipeline, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and textbf(iii) Agent-style Tool-Usage Evaluation, introducing a novel multi-tool assessment protocol to rigorously evaluate the model&#39;s tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate AgentThink significantly boosts overall reasoning scores by textbf53.91\% and enhances answer accuracy by textbf33.54\%, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models.

EMNLP 2025

AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

tool-augmented

chain-of-thought

autonomous driving

Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce textbfAgentThink, a pioneering unified framework that, for the first time, integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's core innovations include: textbf(i) Structured Data Generation, by establishing an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; textbf(ii) A Two-stage Training Pipeline, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and textbf(iii) Agent-style Tool-Usage Evaluation, introducing a novel multi-tool assessment protocol to rigorously evaluate the model's tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate AgentThink significantly boosts overall reasoning scores by textbf53.91\% and enhances answer accuracy by textbf33.54\%, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Chart understanding requires models to effectively analyze and reason about numerical data, textual elements, and complex visual components. Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process. In this study, we delve into this perception bottleneck by decomposing it into two components: the vision encoder bottleneck, where the visual representation may fail to encapsulate the correct information, and the extraction bottleneck, where the language model struggles to extract the necessary information from the provided visual representations. Through comprehensive experiments, we find that (1) the information embedded within visual representations is substantially richer than what is typically captured by linear extractors, such as the widely used retrieval accuracy metric; (2) While instruction tuning effectively enhances the extraction capability of LVLMs, the vision encoder remains a critical bottleneck, demanding focused attention and improvement. Therefore, we further enhance the visual encoder to mitigate the vision encoder bottleneck under a contrastive learning framework. Empirical results demonstrate that our approach significantly mitigates the perception bottleneck and improves the ability of LVLMs to comprehend charts.

On the Perception Bottleneck of VLMs for Chart Understanding

Generative Information Retrieval is an emerging retrieval paradigm that exhibits remarkable performance in monolingual scenarios. However, applying these methods to multilingual retrieval still encounters two primary challenges, cross-lingual identifier misalignment and identifier inflation. To address these limitations, we propose Multilingual Generative Retrieval via Cross-lingual Semantic Compression (MGR-CSC), a novel framework that unifies semantically equivalent multilingual keywords into shared atoms to align semantics and compresses the identifier space, and we propose a dynamic multi-step constrained decoding strategy during retrieval. MGR-CSC improves cross-lingual alignment by assigning consistent identifiers and enhances decoding efficiency by reducing redundancy. Experiments demonstrate that MGR-CSC achieves outstanding retrieval accuracy, improving by 6.83% on mMarco100k and 5.01% on mNQ320k, while reducing document identifiers length by 74.51% and 78.2%, respectively. We publicly release our dataset and code at https://anonymous.4open.science/r/MGR-CSC-D01C

Multilingual Generative Retrieval via Cross-lingual Semantic Compression

Lexical semantic change has been investigated extensively through corpus studies across many languages, but its underlying principles remain hard to unravel through experimental paradigms due to their extended diachronic processes. This work introduces NeLLCom-Lex, a neural-agent framework to simulate semantic change by first grounding agents in a real lexical system (e.g. English) and then systematically manipulating their communicative needs. We use a well-established color naming task to study which factors lead agents to modulate their naming choices according to the situational context and maintain efficient and understandable lexical systems. Our experiments with different supervised and reinforcement learning pipelines show that neural agents trained to `speak' an existing language can reproduce human-like patterns in color naming, supporting their use for simulating and explaining semantic change in real lexical systems.

NeLLCom-Lex: A Neural-agent Framework to Study the Interplay between Lexical Systems and Language Use

Recent developments have enabled Large Language Models (LLMs) to engage in complex reasoning tasks through deep thinking. However, the capacity of reasoning has not been successfully transferred to non-high-resource languages due to resource constraints, which struggles with multilingual reasoning tasks. To this end, we propose Structured-of-Thought (SoT), a training-free method that improves the performance on multilingual reasoning through a multi-step transformation: Language Thinking Transformation and Structured Knowledge Transformation. The SoT method converts language-specific semantic information into language-agnostic structured representations, enabling the models to understand the query in different languages more sophisticated. Besides, SoT effectively guides LLMs toward more concentrated reasoning to maintain consistent underlying reasoning pathways when handling cross-lingual variations in expression. Experimental results demonstrate that SoT outperforms several strong baselines on multiple multilingual reasoning benchmarks when adapting to various backbones of LLMs. It can also be integrated with other training-free strategies for further improvements. Our code is available at https://anonymous.4open.science/r/SoT-E461/.

SoT: Structured-of-Thought Prompting Guides Multilingual Reasoning in Large Language Models

The advancement of Large Language Models (LLMs) enables flexible and interpretable automatic evaluations. In the field of machine translation evaluation, utilizing LLMs with translation error annotations based on Multidimensional Quality Metrics (MQM) yields more human-aligned judgments. However, current LLM-based evaluation methods still face challenges in accurately identifying error spans and assessing their severity. In this paper, we propose HiMATE, a Hierarchical Multi-Agent Framework for Machine Translation Evaluation. We argue that existing approaches inadequately exploit the fine-grained structural and semantic information within the MQM hierarchy. To address this, we develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors. Two key strategies are incorporated to further mitigate systemic hallucinations within the framework: the utilization of the model's self-reflective capability and the facilitation of agent discussion involving asymmetric information. Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations. Further analyses underscore its significant advantage in error span detection and severity assessment, achieving an average F1-score improvement of 89% over the best-performing baseline. We make our code and data publicly available at https://anonymous.4open.science/r/HiMATE-Anony.

HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

Children learn how to speak with a low amount of data and can be taught new words on a few-shot basis, which makes them particularly data-efficient learners. The BabyLM challenge aims at exploring language model (LM) training in the low data regime but uses metrics that concentrate on the head of the word distribution. Here, we introduce LongTail-Swap (LT-Swap), a benchmark that focuses on the tail of the distribution, i.e., measures the ability of LMs to learn new words with very little exposure, like infants do. LT-Swap is a pretraining corpus-specific test-set of acceptable versus unacceptable sentence pairs that isolate semantic and syntactic usage of rare words. Models are evaluated in a zero-shot fashion by computing the average log probabilities over the two members of each pair. We build two such test sets associated with the 10M words and 100M words BabyLM training sets respectively and evaluate 16 models from the BabyLM leaderboard. Our results show that: 1) model performances on LT-Swap exhibit sharp declines for rare words, 2) differences across models are more visible on rare words than on frequent words, and 3) increasing training data size while fixing the number of parameters improves performance for rare words. Finally, we also demonstrate that simple RAG-like methods can enhance rare word understanding, highlighting in-context learning capabilities even in small LMs. These findings underscore the importance of evaluating language models on the long tail of lexical distribution and open new directions for improving their robustness to rare linguistic phenomena. We open source the code that automatically build new instances of LT-Swap over other training datasets

LongTail-Swap: benchmarking language models' abilities on rare words

In this work, we introduce SPLICE, a human-curated benchmark derived from the COIN instructional video dataset, designed to probe event-based reasoning across multiple dimensions: temporal, causal, spatial, contextual, and general knowledge. SPLICE includes 3,381 human-filtered videos spanning 12 categories and 180 sub-categories, such as sports, engineering, and housework. These videos are segmented into a total of 11,423 event clips. We evaluate both human participants and state-of-the-art vision-language models (VLMs) on the task of rearranging these clips into coherent event sequences to assess visual reasoning capabilities. Results reveal a significant gap: VLMs struggle to match human performance. While human-annotated textual descriptions improve model accuracy, they do not affect human performance, suggesting that models rely more on language priors than on visual understanding. Even with annotations, VLMs fall short of human-level reasoning, underscoring persistent challenges in visual reasoning. A deeper analysis across sub-categories shows that VLMs perform relatively better on videos where temporal and causal reasoning are dominant, compared to those where contextual and spatial reasoning are dominant. They also perform better on everyday tasks than on specialized ones.

Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs

Multi-stage information retrieval (IR) has become a widely-adopted paradigm in search. While Large Language Models (LLMs) have been extensively evaluated as second-stage reranking models for monolingual IR, a systematic large-scale comparison is still lacking for cross-lingual IR (CLIR). Moreover, while prior work shows that LLM-based rerankers improve CLIR performance, their evaluation setup relies on machine translation (MT) for the first stage. This is not only prohibitively expensive but also prone to error propagation across stages. Our evaluation on passage-level and document-level CLIR reveals that this setup, which we term \textit{noisy monolingual IR}, is favorable for LLMs. However, LLMs still fail to improve the first-stage ranking if instead produced by multilingual bi-encoders. We further show that pairwise rerankers based on instruction-tuned LLMs perform competitively with listwise rerankers. To the best of our knowledge, we are the first to study the interaction between retrievers and rerankers in two-stage CLIR with LLMs. Our findings reveal that, without MT, current state-of-the-art rerankers fall severely short when directly applied in CLIR.

Evaluating Large Language Models for Cross-Lingual Retrieval

Limited low-resource language corpora in professional domains like medicine hinder cross-lingual domain adaptation of pre-trained large language models (PLMs). While abundant English medical corpora could complement this scarcity, the effective mixture of English and target language, including machine-translated content, remains underexplored. We examined how linguistic features (e.g., token sizes and language proportions) affect performance on a Japanese–English medical knowledge benchmark. Through continued pre-training of a bilingual PLM on multilingual corpora with varying proportions of English and Japanese texts (both original and machine-translated), we analyzed correlations between linguistic features and fine-grained task performance. Our findings suggest a practical approach to optimizing multilingual corpora for cross-lingual domain adaptation, which requires leveraging specialized knowledge from English corpora while ensuring sufficient coverage of language-specific expressions in a target language (Japanese). Such insights will contribute to the development of multilingual models that effectively leverage English-language resources in various professional domains with low-resource languages.

Leveraging High-Resource English Corpora for Cross-lingual Domain Adaptation in Low-Resource Japanese Medicine via Continued Pre-training

As domain-specific datasets continue to expand, Large Language Models (LLMs) have achieved significant improvements across various fields through supervised fine-tuning (SFT). However, is more data always better for model fine-tuning? Through a series of controlled experiments, we discover that dataset structure—rather than mere size—plays a decisive role in enhancing LLM reasoning capabilities. While existing methods acknowledge that good data quality can make training more efficient, they primarily rely on simple heuristic strategies and lack systematic, quantitative frameworks for evaluating data quality. To address this gap, we introduce MCSQ—the first multi-dimensional quantitative framework for reasoning data management. MCSQ rigorously evaluates and optimizes datasets along six orthogonal dimensions. Through comprehensive controlled experiments, we find that selectively incorporating "distorted" (model-disagreed) or "mismatched" (low-relevance) samples—which are typically discarded in traditional approaches—can outperform conventional "clean" data on certain advanced reasoning benchmarks. Our findings challenge traditional assumptions about data "quality" in LLM fine-tuning and provide actionable, quantitative guidance for efficient, structure-aware dataset management.

Downloads

Next from EMNLP 2025

On the Perception Bottleneck of VLMs for Chart Understanding

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES