China

Learning high-quality sentence embeddings from Natural Language Inference (NLI) data is often challenged by the tension between discrete labels and the continuous spectrum of semantic similarity, as well as information loss from discarded neutral pairs. To address this, we introduce Rank-Awareness and Angular Optimization Embeddings (RAOE), a framework that leverages the full NLI dataset (Entailment, Neutral, Contradiction) augmented with pre-computed continuous similarity scores (S). RAOE employs a novel composite objective which features: (1) a Rank Margin objective that enforces rank consistency against S using an explicit margin, and (2) a Gated Angular objective that conditionally refines embedding geometry based on NLI label (L) and S score agreement. Extensive evaluations on STS tasks and the MTEB benchmark demonstrate RAOE&#39;s effectiveness. Our general-purpose RAOE-S1 model (BERT-base) significantly outperforms strong baselines, achieving an average Spearman&#39;s correlation of 85.11 (vs. SimCSE&#39;s 81.57 and AnglE&#39;s 82.43), and shows consistent improvements on MTEB. Further STS-specialized fine-tuning (RAOE-S2) establishes new state-of-the-art performance on STS (88.17 with BERT-base). These results confirm RAOE&#39;s ability to learn robust and nuanced sentence representations through the synergy of rank-awareness and conditional angular constraints. Code is available at https://anonymous.4open.science/r/RAOE-2B52.

EMNLP 2025

Rank-Awareness and Angular Constraints: A New Perspective on Learning Sentence Embeddings from NLI Data

angular optimization

rank-awareness

sentence embeddings

Learning high-quality sentence embeddings from Natural Language Inference (NLI) data is often challenged by the tension between discrete labels and the continuous spectrum of semantic similarity, as well as information loss from discarded neutral pairs. To address this, we introduce Rank-Awareness and Angular Optimization Embeddings (RAOE), a framework that leverages the full NLI dataset (Entailment, Neutral, Contradiction) augmented with pre-computed continuous similarity scores (S). RAOE employs a novel composite objective which features: (1) a Rank Margin objective that enforces rank consistency against S using an explicit margin, and (2) a Gated Angular objective that conditionally refines embedding geometry based on NLI label (L) and S score agreement. Extensive evaluations on STS tasks and the MTEB benchmark demonstrate RAOE's effectiveness. Our general-purpose RAOE-S1 model (BERT-base) significantly outperforms strong baselines, achieving an average Spearman's correlation of 85.11 (vs. SimCSE's 81.57 and AnglE's 82.43), and shows consistent improvements on MTEB. Further STS-specialized fine-tuning (RAOE-S2) establishes new state-of-the-art performance on STS (88.17 with BERT-base). These results confirm RAOE's ability to learn robust and nuanced sentence representations through the synergy of rank-awareness and conditional angular constraints. Code is available at https://anonymous.4open.science/r/RAOE-2B52.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Multimodal vision-language models (VLMs) have achieved substantial progress in various tasks that demand combined understanding of visual and textual content, particularly in cultural understanding tasks, with the emergence of new cultural datasets. However, these datasets frequently fall short of providing cultural reasoning while underrepresenting many cultures. In this paper, we introduce the Seeing Culture Benchmark (SCB), focusing on cultural reasoning with a novel approach that requires VLMs to reason on culturally rich images in two stages: i) selecting the correct visual option in a multiple-choice visual question answering (VQA) approach, and ii) segmenting the relevant cultural artifact as evidence of reasoning. Visual options in the first stage are systematically organized into three types: those originating from the same country, those from different countries, or a mixed group. Notably, all options are derived from a singular category for each type. Progression to the second stage occurs only after a correct visual option is chosen. Our benchmark encompasses 1,065 images capturing 138 cultural artifacts across five categories from seven Southeast Asia (SEA) countries, whose diverse cultures are often overlooked. Additionally, the benchmark provides 3,178 questions, of which 1,093 are unique and meticulously curated by human annotators. Our evaluation of various VLMs reveals the complexities involved in cross-modal cultural reasoning and underscores the disparity between visual reasoning and spatial grounding in scenarios that are culturally nuanced. The SCB serves as a crucial benchmark for identifying these shortcomings so as to guide future developments in the field of cultural reasoning.

Seeing Culture: A Benchmark for Visual Reasoning and Grounding

Word sense disambiguation (WSD) is a fundamental yet challenging task in natural language processing. In recent years, the advent of large language models (LLMs) has led to significant advancements in regular WSD tasks. However, most existing LLMs face two major issues that hinder their performance in WSD. Firstly, these models are often prone to misclassifying the correct meaning of an ambiguous word when confronted with contexts containing adversarial information. Secondly, there is a lack of sufficient adversarial WSD datasets, which severely limits the development and evaluation of adversarial WSD systems. To address these gaps, we propose a novel Multi-Agent Debate framework for Adversarial Word Sense Disambiguation (MADAWSD). The MADAWSD framework simulates a real-world debate environment where multiple agent roles — namely, the Debater, Moderator, Consensus-seeker, and Judge — engage in discussions about ambiguous words in the context of adversarial information. Through a collaborative mechanism among these agents, it achieves accurate WSD. Additionally, a novel dataset for Chinese adversarial WSD has been constructed, focusing on improving and evaluating the performance of WSD models in the Chinese language. Extensive experiments on both English and Chinese adversarial WSD datasets demonstrate that MADAWSD can seamlessly integrate with existing LLM models and significantly enhance their performance, showcasing broad generality and outstanding effectiveness.

MADAWSD: Multi-Agent Debate Framework for Adversarial Word Sense Disambiguation

Data perspectivism goes beyond majority vote label aggregation by recognizing various perspectives as legitimate ground truths. However, current evaluation practices remain fragmented, making it difficult to compare perspectivist approaches and analyze their impact on different users and demographic subgroups. To address this gap, we introduce PersEval, the first unified framework for evaluating perspectivist models in NLP. A key innovation is its evaluation at the individual annotator level and its treatment of annotators and users as distinct entities, consistently with real-world scenarios. We demonstrate PersEval's capabilities through experiments with both Encoder-based and Decoder-based approaches, as well as an analysis of the effect of sociodemographic prompting. By considering global, text-, trait-, and user-level evaluation metrics, we show that PersEval is a powerful tool for examining how models are influenced by user-specific information and identifying the biases this information may introduce.

PERSEVAL: A Framework for Perspectivist Classification Evaluation

Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT. Our findings reveal that some training–task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness—often surpassing superficial similarity between trained data and benchmark—and that mid-layer weight changes correlate most strongly with performance gains. We will release these 1,000+ SFT models and benchmark results to accelerate further research.

Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality

Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the \textbf{table-to-report} task and construct a bilingual benchmark named \textbf{T2R-bench}, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as four types of industrial tables. Furthermore, we propose a novel evaluation criteria to fairly measure the quality of report generation. Expeimental results show that Deepseek-R1 only achieves the best performance with \text{62.71%} overall score, indicating that LLMs still have room for improvement on T2R-bench.

T2R-BENCH: A Benchmark for Real World Table-to-Report Task

Temporal reasoning and planning are essential capabilities for large language models (LLMs), yet most existing benchmarks evaluate them in isolation and under limited forms of complexity. We introduce TCP, a new benchmark that jointly assesses both abilities by framing each instance as a temporal constraint-based planning problem. Each instance features a naturalistic dialogue around a collaborative project, where diverse and interdependent temporal constraints are explicitly or implicitly expressed. The model must infer an optimal schedule that satisfies all constraints. To construct TCP, we first generate abstract problem prototypes using a Python script that samples from predefined constraint templates and performs exhaustive search to ensure logical consistency. These prototypes are then paired with realistic scenarios from various domains and enriched into full data instances using an LLM. A human quality check is finally performed on a sampled subset to confirm the reliability of our benchmark. We evaluate state-of-the-art LLMs and find that even the strongest models struggle with TCP, highlighting its difficulty and revealing limitations in LLMs’ temporal constraint-based planning abilities. We analyze underlying failure cases and hope our findings can inspire future research.

TCP: a Benchmark for Temporal Constraint-Based Planning

Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likely reflecting difference in their underlying composition strategies. Specifically, we observe (i) three distinct patterns in the evolution of structural similarity between subword compositions and whole-word representations across layers; (ii) great performance when probing layer by layer their sensitivity to semantic decompositionality; and (iii) three distinct patterns when probing sensitivity to formal features, e.g., character sequence length. These findings provide valuable insights into the compositional dynamics of LLMs and highlight different compositional pattens in how LLMs encode and integrate subword information.

Understanding Subword Compositionality of Large Language Models

We show that large language models (LLMs) exhibit an textitinternal chain-of-thought: they sequentially decompose and execute composite tasks layer-by-layer. Two claims ground our study: (i) distinct subtasks are learned at different network depths, and (ii) these subtasks are executed sequentially across layers. On a benchmark of 15 two-step composite tasks, we employ layer-from context-masking and propose a novel cross-task patching method, confirming (i). To examine claim (ii), we apply LogitLens to decode hidden states, revealing a consistent layerwise execution pattern. We further replicate our analysis on the real-world TRACE benchmark, observing the same stepwise dynamics. Together, our results enhance LLMs transparency by showing their capacity to internally plan and execute subtasks (or instructions), opening avenues for fine-grained, instruction-level activation steering.

Internal Chain-of-Thought: Empirical Evidence for Layer‑wise Subtask Scheduling in LLMs

Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate, essential capabilities like reasoning and code generation rely on the more time-consuming NLG (token-by-token generation) format. In this work, our aim is to decrease the computational burden of NLG benchmarks in order to enable monitoring crucial LLM capabilities during model training. We reformulate generative tasks into computationally cheaper NLU alternatives. We test the performance correlation between the original and reformulated tasks using 8 LMs of various sizes and 4 capabilities: mathematical reasoning, code generation, factual knowledge and reading comprehension. Our results show a strong correlation between task formats, supporting capability assessment via cheaper alternatives and achieving over 35× average reduction in evaluation time. We publish our benchmark adaptions.

From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models

Natural Language Inference (NLI) is an important task in natural language processing. NLI models are aimed at automatically determining logical relationships between pairs of sentences. However, recent studies based on gold labels assigned to sentence pairs by human experts have provided some evidence that NLI models tend to make inconsistent model decisions during inference. Previous studies have used existing NLI datasets to test the transitive consistency of language models. However, they test only variations of two transitive consistency rules out of four. To further evaluate the transitive consistency of NLI models, we propose a novel evaluation approach that allows us to test all four rules automatically by generating adversarial examples via antonym replacements. Since we are testing self-consistency, human labeling of generated adversarial examples is unnecessary. Our experiments on several benchmark datasets indicate that the examples generated by the proposed antonym replacement methodology can reveal transitive inconsistencies in the state-of-the-art NLI models.

Downloads

Next from EMNLP 2025

Seeing Culture: A Benchmark for Visual Reasoning and Grounding

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES