China

We present a new benchmark for evaluating Deep Search—a realistic and complex form of retrieval-augmented generation (RAG) that requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources. These include documents, meeting transcripts, Slack messages, GitHub, and URLs, which vary in structure and often contain human-to-human interactions. We build it using a synthetic data pipeline that simulates business workflows across product planning, development, and support stages, generating interconnected content with realistic noise and multi-hop questions with guaranteed ground-truth answers. We release our benchmark with both answerable and unanswerable queries, and retrieval pool of 39,190 enterprise artifacts, enabling fine-grained evaluation of long-context LLM and RAG systems. Our experiments reveal that even the best-performing agentic RAG methods achieve an average performance score of 32.96 on our benchmark. With further analysis, we highlight retrieval as the main bottleneck: existing methods struggle to conduct deep searches and retrieve all necessary evidence. Consequently, they often reason over partial context, leading to significant performance degradation.

EMNLP 2025

Benchmarking Deep Search over Heterogeneous Enterprise Data

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Trends on microblogs often transcend linguistic boundaries, evolving into global phenomena with significant societal and economic impact. We address the novel task of predicting "cross-lingual trends" on microblog that originate in one language and subsequently become popular in others. While crucial for global monitoring and marketing, this area has been under-explored due to the challenge of cross-lingual trend identification. We introduce a methodology to overcome this by automatically constructing a dataset using Wikipedia's inter-language links to reconcile trend names. We propose a prediction model that leverages a rich feature set, including not only temporal frequency but also microblog content and external knowledge signals from Wikipedia (e.g., article content, pageviews). Our approach achieves higher accuracy than existing trend prediction methods and LLM-based approaches, enabling effective early detection of cross-lingual trends.

Predicting Cross-lingual Trends in Microblogs

The increasing complexity of modern driving systems demands efficient collection and analysis of specific driving scenarios that are crucial for system development and validation. Current approaches either rely on massive data collection followed by manual filtering, or rigid threshold-based recording systems that often miss important edge cases. In this paper, we present Distributed Adaptive Scene Recognition (DASR), a novel multi-agent cloud-edge framework for language-guided scene detection in connected vehicles. Our system leverages the complementary strengths of cloud-based large language models and edge-deployed vision language models to intelligently identify and preserve relevant driving scenarios while optimizing limited on-vehicle buffer storage. The cloud-based LLM serves as an intelligent coordinator that analyzes developer prompts to determine which specialized tools and sensor data streams should be incorporated, while the edge-deployed VLM efficiently processes video streams in real time to make relevant decisions. Extensive experiments across multiple driving datasets demonstrate that our framework achieves superior performance compared to larger baseline models, with exceptional performance on complex driving tasks requiring sophisticated reasoning. DASR also shows strong generalization capabilities on out-of-distribution datasets and significantly reduces storage requirements (28.73 %) compared to baseline methods.

DASR: Distributed Adaptive Scene Recognition - A Multi-Agent Cloud-Edge Framework for Language-Guided Scene Detection

For an e-commerce domain, the customer
address is the single most important piece
of customer data for ensuring accurate
and reliable deliveries. In this two-part
study, we first outline the construction of
a language model to assist customers with
address standardization and in the latter
part, we detail a novel Pareto-ensemble
multi-task prediction algorithm that derives critical insights from customer addresses to minimize operational losses 
arising from a given geographical area. Finally, 
we demonstrate the potential benefits of
the proposed address intelligence system
for a large e-commerce domain through
large scale experiments on a commercial
system.

An Address Intelligence Framework for E-commerce Deliveries

Adapting language models to learn continuously from data streams while retaining previous knowledge is a key challenge in artificial intelligence (AI), particularly in lifelong language learning. Existing distillation methods are based on offline techniques, limiting their ability to update in real-time and adapt to dynamic environments. To address this, we propose online dynamic mutual distillation - a novel framework that enables continuous mutual learning from task streams without relying on domain-specific teachers. To our knowledge, this is the first application of mutual learning in lifelong language learning, providing dynamic knowledge transfer without domain-specific teachers. Moreover, our extensive experiments demonstrate that the proposed method reduces catastrophic forgetting, while improving task performance on various benchmark datasets making it suitable for real-world, dynamic natural language processing (NLP) applications such as adaptive chatbots and personalized language systems. We will make our code publicly available upon acceptance.

L4: Mutual Learning Helps Lifelong Language Learning

The rapid advancements in Large Language Models (LLMs) have enabled their adoption in real-world industrial scenarios for various natural language processing tasks. However, the high inference cost of large-scale LLMs makes their deployment impractical, necessitating the use of smaller models. Despite their efficiency, smaller LLMs lack robust zero-shot instruction-following capabilities across diverse domains, limiting their adaptability to dynamic user requirements. Traditional fine-tuning approaches exacerbate this issue by inducing catastrophic forgetting, reducing the model’s generalization ability for unseen tasks. In this paper, we propose Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension (DACIP-RC), a continual pre-training technique that enhances smaller LLMs' domain adaptability for business conversational tasks. Unlike conventional pre-training approaches that rely on next-token prediction, DACIP-RC generates diverse task instructions and responses via reading comprehension on conversation transcripts, enabling better instruction generalization. Our empirical evaluations demonstrate that DACIP-RC significantly improves zero-shot generalization across a wide range of business conversational tasks, including meeting summarization, action item generation, and call purpose identification. To the best of our knowledge, this is the first work to apply instruction pre-training on business conversational data, providing insights into how industries can leverage proprietary datasets for domain adaptation.

DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations

Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost‑efficient evaluation: (i) multi‑criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain‑adaptive transfer learning, in which we fine‑tune a 2B‑parameter VLM on synthetic judgments in a chart dataset to create the \textbf{ChartJudge}. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA‑Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks. Our code and the data will be made publicly available.

Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

Cyberbullying (CB) involves complex relational dynamics that are often oversimplified as a binary classification task. Existing youth-focused CB datasets rely on scripted role-play, lacking conversational realism and ethical youth involvement, with little or no evaluation of their social plausibility. To address this, we introduce a \textbf{youth-in-the-loop} dataset ``\textbf{BullyBench}'' developed by adolescents (ages 15–16) through an ethical co-research framework. We introduce a structured \textbf{intrinsic} quality evaluation with \textbf{experts-in-the-loop} (social scientists, psychologists, and content moderators) for assessing realism, relevance, and coherence in youth CB data. Additionally, we perform \textbf{extrinsic} baseline evaluation of this dataset by benchmarking encoder- and decoder-only language models for multi-class CB role classification for future research. A three-stage annotation process by young adults refines the dataset into a gold-standard test benchmark, a high-quality resource grounded in minors’ lived experiences of CB detection. Code and data are available for review \footnote{\url{https://github.com/youthcodesign/emnl-industry-track-submission} Please note, labels for BullyBench will be made available after review.}.

BullyBench: Youth & Experts-in-the-loop Framework for Intrinsic and Extrinsic Cyberbullying NLP Benchmarking

Rapid developments of large language models have revolutionized many NLP tasks on English data, unfortunately, the models and their evaluations for low-resource languages are being overlooked, especially for languages in South Asia. Although there are more than 650 languages in South Asia, many of them either have very limited computational resources or are missing from existing language models. Thus, a concrete question to be answered is: *Can we assess the current stage and challenges to inform our NLP community and facilitate model developments for South Asian languages?* In this survey, we have comprehensively examined current efforts and challenges of NLP models for South Asian languages by retrieving studies since 2020 with a focus on transformer-based models, such as BERT, T5, and GPT. We present advances and gaps across 3 essential aspects: data, model, and tasks, such as available data sources, fine-tuning strategies, and domain applications. Our findings highlight substantial issues, such as missing data in critical domains (e.g., health), code-mixing, and missing standardized evaluation. Our survey will raise awareness within the NLP community for more targeted data curation, unify benchmarks tailored to cultural and linguistic nuances of South Asia, and encourage an equitable representation of South Asian languages.

Bhaasha, Bhāṣā, Zaban: A Survey for Low-Resourced Languages in South Asia – Current Stage and Challenges

Recent advances in large language models (LLMs) have opened the door to culture-aware language tasks. We introduce the novel problem of adapting wine reviews across Chinese and English, which goes beyond literal translation by incorporating regional taste preferences and culture-specific flavor descriptors. In a case study on cross-cultural wine review adaptation, we compile the \textbf{first} parallel corpus of professional reviews, containing 8k Chinese and 16k Anglophone reviews. We benchmark both neural-machine-translation baselines and state-of-the-art LLMs with automatic metrics and human evaluation. For the latter, we propose three culture-oriented criteria—Cultural Proximity, Cultural Neutrality, and Cultural Genuineness—to assess how naturally a translated review resonates with target-culture readers. Our analysis shows that current models struggle to capture cultural nuances, especially in translating wine descriptions across different cultures. This highlights the challenges and limitations of translation models in handling cultural content.

Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews

Large Language Models (LLMs) have shown strong capabilities in zero-shot reasoning and generalization to new tasks. However, the zero-shot performance of general LLMs on complex tasks, such as multi-hop reasoning, remains suboptimal, while reasoning LLMs suffer from hallucinations and unfaithfulness. In this paper, to handle these limitations, we introduce a novel structure analysis method that helps LLMs better understand the question structure and guide the problem-solving process. We demonstrate that existing reasoning strategies, such as Chain-of-Thought and ReAct, significantly benefit from the LLM’s inherent understanding of semantic structure. We further ground our method in the theory of probabilistic graphical models to support its effectiveness. To enhance the reasoning process, we augment the structure analysis with refinement and retrieval capabilities, forming a multi-agent reasoning system called Structure-oriented Autonomous Reasoning Agents (SARA). Extensive experiments show that SARA significantly improves zero-shot performance on knowledge-intensive and mathematical tasks. Remarkably, our approach makes a general LLM competitive with dedicated reasoning models in several benchmarks and demonstrates strong robustness against corrupted reasoning paths.

Premium content

Downloads

Next from EMNLP 2025

Predicting Cross-lingual Trends in Microblogs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES