China

Training robust retriever and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectivenesstextemdashpruning 8 out of 15 datasets from the BGE collection, which reduces the training set size by 2.35times , increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on &quot;false negatives&quot;, where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B embedding models by 0.7textemdash1.4 nDCG@10 on BEIR and by 1.7textemdash1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o mini. All datasets and code will be released upon publication.

EMNLP 2025

Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs

Training robust retriever and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectivenesstextemdashpruning 8 out of 15 datasets from the BGE collection, which reduces the training set size by 2.35times , increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B embedding models by 0.7textemdash1.4 nDCG@10 on BEIR and by 1.7textemdash1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o mini. All datasets and code will be released upon publication.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Event argument extraction identifies arguments for predefined event roles in text. Traditional evaluations rely on exact match (EM), requiring predicted arguments to match annotated spans exactly. However, this approach fails for generative models like large language models (LLMs), which produce diverse yet semantically accurate responses. EM underestimates performance by disregarding valid variations, implicit arguments (unstated but inferable), and scattered arguments (distributed across a document). To bridge this gap, we introduce Reliable Evaluation framework for Generative event argument extraction (REGen), a framework that better aligns with human judgment. Across six datasets, REGen improves performance by an average of 23.93 F1 points over EM. Human validation further confirms REGen’s effectiveness, achieving 87.67% alignment with human assessments of argument correctness.

REGen: A Reliable Evaluation Framework for Generative Event Argument Extraction

The accurate trust assessment of large language models (LLMs), which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose textbfFunctionally textbfEquivalent textbfSampling for textbfTrust textbfAssessment (FESTA), an input sampling technique for multimodal models, which generates an uncertainty measure based on the equivalent and complementary input sampling. The sampling approach expands the input space to measure the consistency (through equivalent samples) and sensitivity (through complementary samples) properties of the model. These two uncertainty measures are combined to form the final FESTA estimate. Our approach only requires black-box access, and is unsupervised. The experiments are conducted with various off-the-shelf multi-modal LLMs, on visual and audio reasoning tasks. The proposed FESTA approach is shown to significantly improve (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) the area-under-receiver-operating-curve (AUROC) metric on these reasoning tasks.

FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs

There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs. Furthermore, although using LLMs as evaluators ("LLM-as-a-judge") is a growing research area, their sensitivity to dialectal nuances is still underexplored and requires more focused attention. In this paper, we address these gaps through a comprehensive toxicity evaluation of LLMs across diverse dialects. We create a multi-dialect dataset through synthetic transformations and human-assisted translations, covering 10 language clusters and 60 varieties. We then evaluate five LLMs on their ability to assess toxicity, measuring multilingual, dialectal, and LLM-human consistency. Our findings show that LLMs are sensitive to both dialectal shifts and low-resource multilingual variation, though the most persistent challenge remains aligning their predictions with human judgments.

Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties

This paper introduces a novel method for testing the components of theories of (dialogue) coherence through utterance substitution. The method is described and then applied to Inference Anchoring Theory (IAT) in a large scale experimental study with 933 dialogue snippets and 87 annotators. IAT has been used for substantial corpus annotation and practical applications. To address the aim of finding out if and to what extent two aspects of IAT -- illocutionary acts and propositional relations -- contribute to dialogue coherence, we designed an experiment for systematically comparing the coherence ratings for several variants of short debate snippets. The comparison is between original human-human debate snippets, snippets generated with an IAT-compliant algorithm and snippets produced with ablated versions of the algorithm. This allows us to systematically compare snippets that have identical underlying structures as well as IAT-deficient structures with each other. We found that propositional relations do impact on dialogue coherence (at a statistically highly significant level) whereas we found no such effect for illocutionary act expression. This result suggests that fine-grained inferential relations impact on dialogue coherence, complementing the higher-level coherence structures of, for instance, Rhetorical Structure Theory.

Coherence of Argumentative Dialogue Snippets: A New Method for Large Scale Evaluation with an Application to Inference Anchoring Theory

Negation reasoning remains a challenge for large language models (LLMs), often causing incorrect interpretations of negated statements. In this study, we analyze various LLMs for their handling of negation and propose two genres of prompts Warning-based and Persona-based, which improve overall accuracy by up to 3.17% and distractor negation accuracy by up to 25.14% over most competitive baselines. Next, we assess the robustness of LLMs by reordering prompts while preserving meaning, observing instability linked to positional encoding schemes. Further, we introduce a negative token attention score (NTAS) to quantify attention to negation words. From the comprehensive analysis, we point out that within a specific LLM family, the performance of a model (measured using accuracy) correlates more with NTAS than with model size.

This is not a Disimprovement: Improving Negation Reasoning in Large Language Models via Prompt Engineering

Developing more data-efficient training approaches depends on a better understanding of inductive biases. In this work, we hypothesize that the structural information encoded in a transformer's attention matrices is key to acquiring syntax because attention captures relationships between words -- a crucial part of syntax. Under this hypothesis, we would expect that inductive biases targeting attention should selectively improve data-efficiency on syntactic benchmarks. We use knowledge distillation (KD) as a methodological lens to test this hypothesis, comparing conventional KD through output logits against KD through attention matrices. Using GPT-2 as our teacher model, we train student models on datasets ranging from 10K to 5M sentences and evaluate them on both syntactic benchmarks and general language modeling tasks. Surprisingly, we find that while logit-based KD drastically improves data-efficiency across all metrics, attention-based KD offers minimal benefits even for syntactic tasks. This suggests that logits already effectively supervise syntactic information, challenging assumptions about how syntax is represented in transformers and informing more targeted approaches to data-efficient training.

Evaluating distillation methods for data-efficient syntax learning

Large language models (LLMs) have been used to synthesize persuasive dialogues for studying persuasive behavior. However, existing approaches often suffer from issues such as stance oscillation and low informativeness. To address these challenges, we propose reinforced instructional prompting, a method that ensures speaker characteristics consistently guide all stages of dialogue generation. We further introduce multilingual prompting, which aligns language use with speakers’ native languages to better capture cultural nuances. Our experiments involving speakers from eight countries show that continually reinforcing speaker profiles and cultural context improves argument diversity, enhances informativeness, and stabilizes speaker stances. Moreover, our analysis of inter-group versus intra-group persuasion reveals that speakers engaging within their own cultural groups employ more varied persuasive strategies than in cross-cultural interactions. These findings underscore the importance of speaker and cultural awareness in LLM-based persuasion modeling and suggest new directions for developing more personalized, ethically grounded, and culturally adaptive LLM-generated dialogues.

Enhancing LLM-Based Persuasion Simulations with Cultural and Speaker-Specific Information

Large Language Models (LLMs) have shown impressive capabilities across various text generation tasks; however, their potential for simple yet essential text classification remains underexplored, as LLM pre-training tends to emphasize generation over classification. While LLMs with instruction tuning can transform classification into a generation task, they struggle to categorize nuanced texts. One such example is text revision, which involves nuanced changes between pairs of texts. While simply fine-tuning LLMs for revision classification seems plausible, it requires a large amount of revision annotations, which are expensive and scarce. To address this issue, we introduce a plug-and-play parameter-efficient fine-tuning (PEFT) framework, named IR-Tuning, which only fine-tunes a subset of important LLM layers while freezing those of redundant ones. IR-Tuning improves fine-tuning convergence, reduces memory consumption, and is effective for small corpora. Experiments suggest that our proposed method can surpass multiple PEFT baselines over diverse revisions.

Efficient Layer-wise LLM Fine-tuning for Revision Intention Prediction

Despite recent advances in Reasoning Language Models (RLMs), most research focuses solely on English, even though many models are pretrained on multilingual data. In this work, we investigate: *Is English the most efficient language for reasoning?* We evaluate three open-source RLMs: DeepSeek R1, Qwen 2.5, and Qwen 3, across four math datasets and seven typologically diverse languages. We find that reasoning in non-English languages consistently reduces token usage, often without sacrificing accuracy. These gains persist after translation into English, suggesting genuine shifts in reasoning behavior rather than surface-level linguistic effects. The extent of improvement, however, depends on the model’s multilingual strength. Our findings motivate a broader view of reasoning in language models, highlighting the potential of multilingual reasoning and the importance of strong multilingual foundations.

EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning

Climate change communication on social media increasingly employs microtargeting strategies to effectively reach and influence specific demographic groups. This study presents a *post-hoc* analysis of microtargeting practices within climate campaigns by leveraging large language models (LLMs) to examine Meta (previously known as Facebook) advertisements. Our analysis focuses on two key aspects: **demographic targeting** and **fairness**. We evaluate the ability of LLMs to accurately predict the intended demographic targets, such as gender and age group. Furthermore, we instruct the LLMs to generate explanations for their classifications, providing transparent reasoning behind each decision. These explanations reveal the specific thematic elements used to engage different demographic segments, highlighting distinct strategies tailored to various audiences. Our findings show that ***young adults*** are primarily targeted through messages emphasizing *activism and environmental consciousness*, while **women** are engaged through themes related to *caregiving roles and social advocacy*. Additionally, we conduct a comprehensive fairness analysis to uncover biases in model predictions. We assess disparities in accuracy and error rates across demographic groups using established fairness metrics such as Demographic Parity, Equal Opportunity, and Predictive Equality. Our findings indicate that while LLMs perform well overall, certain biases exist, particularly in the classification of **male** audiences. The analysis of thematic explanations uncovers recurring patterns in messaging strategies tailored to various demographic groups, while the fairness analysis underscores the need for more inclusive targeting methods. This study provides a valuable framework for future research aimed at enhancing transparency, accountability, and inclusivity in social media-driven climate campaigns.

Downloads

Next from EMNLP 2025

REGen: A Reliable Evaluation Framework for Generative Event Argument Extraction

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES