China

We introduce SemCSE, an unsupervised method for learning semantic embeddings of scientific texts. Building on recent advances in contrastive learning for text embeddings, our approach leverages LLM-generated summaries of scientific abstracts to train a model that positions semantically related summaries closer together in the embedding space. This resulting objective ensures that the model captures the true semantic content of a text, in contrast to traditional citation-based approaches that do not necessarily reflect semantic similarity. To validate this, we propose a novel benchmark designed to assess a model&#39;s ability to understand and encode the semantic content of scientific texts, demonstrating that our method enforces a stronger semantic separation within the embedding space. Additionally, we evaluate SemCSE on the comprehensive SciRepEval benchmark for scientific text embeddings, where it achieves state-of-the-art performance among models of its size, thus highlighting the benefits of a semantically focused training approach.

EMNLP 2025

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts

scientific texts

contrastive learning

embeddings

unsupervised learning

We introduce SemCSE, an unsupervised method for learning semantic embeddings of scientific texts. Building on recent advances in contrastive learning for text embeddings, our approach leverages LLM-generated summaries of scientific abstracts to train a model that positions semantically related summaries closer together in the embedding space. This resulting objective ensures that the model captures the true semantic content of a text, in contrast to traditional citation-based approaches that do not necessarily reflect semantic similarity. To validate this, we propose a novel benchmark designed to assess a model's ability to understand and encode the semantic content of scientific texts, demonstrating that our method enforces a stronger semantic separation within the embedding space. Additionally, we evaluate SemCSE on the comprehensive SciRepEval benchmark for scientific text embeddings, where it achieves state-of-the-art performance among models of its size, thus highlighting the benefits of a semantically focused training approach.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

In exploratory search, users often submit vague queries to investigate unfamiliar topics, but receive limited feedback about how the search engine understood their input. This leads to a self-reinforcing cycle of mismatched results and trial-and-error reformulation. To address this, we study the task of generating user-facing natural language query intent descriptions that surface what the system likely inferred the query to mean, based on post-retrieval evidence. We propose QUIDS, a method that leverages dual-space contrastive learning to isolate intent-relevant information while suppressing irrelevant content. QUIDS combines a dual-encoder representation space with a disentangling decoder that works together to produce concise and accurate intent descriptions. Enhanced by intent-driven hard negative sampling, the model significantly outperforms state-of-the-art baselines across ROUGE, BERTScore, and human/LLM evaluations. Our qualitative analysis confirms QUIDS' effectiveness in generating accurate intent descriptions for exploratory search. Our work contributes to improving the interaction between users and search engines by providing feedback to the user in exploratory search settings.\footnote{Our code is available at \url{https://anonymous.4open.science/r/QID-6E00/}}

QUIDS: Query Intent Description for Exploratory Search via Dual Space Modeling

While virtually all existing work on Automated Essay Scoring (AES) models an essay as a word sequence, we put forward the novel view that an essay can be modeled as a graph and subsequently propose GAT-AES, a graph-attention network approach to AES. A key advantage of a graph-based approach to AES is that it allows us to easily capture the interactions among essay traits directly. Experimental results show that GAT-AES has achieved the best multi-trait scoring results to date on the ASAP++ dataset.

Graph-Based Multi-Trait Essay Scoring

Medical decision-making often involves integrating knowledge from multiple clinical specialties, typically achieved through multidisciplinary teams. Inspired by this collaborative process, recent work has leveraged large language models (LLMs) in multi-agent collaboration frameworks to emulate expert teamwork. While these approaches improve reasoning through agent interaction, they are limited by static, pre-assigned roles, which hinder adaptability and dynamic knowledge integration. To address these limitations, we propose KAMAC, a Knowledge-driven Adaptive Multi-Agent Collaboration framework that enables LLM agents to dynamically form and expand expert teams based on the evolving diagnostic context. KAMAC begins with one or more expert agents and then conducts a knowledge-driven discussion to identify and fill knowledge gaps by recruiting additional specialists as needed. This supports flexible, scalable collaboration in complex clinical scenarios, with decisions finalized through reviewing updated agent comments. Experiments on two real-world medical benchmarks demonstrate that KAMAC significantly outperforms both single-agent and advanced multi-agent methods, particularly in complex clinical scenarios (i.e., cancer prognosis) requiring dynamic, cross-specialty expertise.

A Knowledge-driven Adaptive Collaboration of LLMs for Enhancing Medical Decision-making

This work introduces Castle, the first framework for schema-only cascade update generation using large language models (LLMs). Despite recent advances in LLMs for Text2SQL code generation, existing approaches focus primarily on SELECT queries, neglecting the challenges of SQL update operations and their ripple effects. Traditional CASCADE UPDATE constraints are static and unsuitable for modern, denormalized databases, which demand dynamic, context-aware updates. Castle enables natural language instructions to trigger multi-column, causally consistent SQL UPDATE statements, without revealing table content to the model. By framing UPDATE SQL generation as a divide-and-conquer task with LLMs’ reasoning capacity, Castle can determine not only which columns must be directly updated, but also how those updates propagate through the schema, causing cascading updates — all via nested queries and substructures that ensure data confidentiality. We evaluate it on real-world causal update scenarios, demonstrating its ability to produce accurate SQL updates, and thereby highlighting the reasoning ability of LLMs in automated DBMS.

Castle: Causal Cascade Updates in Relational Databases with Large Language Models

Image Difference Captioning (IDC) aims to generate natural language descriptions that highlight subtle differences between two visually similar images. While recent advances leverage pre-trained vision-language models to align fine-grained visual differences with textual semantics, existing supervised approaches often overfit to dataset-specific language patterns and fail to capture accurate preferences on IDC, which often indicates fine-grained and context-aware distinctions. To address these limitations, we propose an adversarial direct preference optimization (ADPO) framework for IDC, which formulates IDC as a preference optimization problem under the Bradley-Terry-Luce model, directly aligning the captioning policy with pairwise difference preferences via Direct Preference Optimization (DPO). To model more accurate and diverse IDC preferences, we introduce an adversarially trained hard negative retriever that selects counterfactual captions, This results in a minimax optimization problem, which we solve via policy-gradient reinforcement learning, enabling the policy and retriever to improve jointly. Experiments on benchmark IDC datasets show that our approach outperforms existing baselines, especially in generating fine-grained and accurate difference descriptions.

Image Difference Captioning via Adversarial Preference Optimization

We introduce SATBench, a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs) through logical puzzles derived from Boolean satisfiability (SAT) problems. Unlike prior work that focuses on inference rule-based reasoning, which often involves deducing conclusions from a set of premises, our approach leverages the search-based nature of SAT problems, where the objective is to find a solution that fulfills a specified set of logical constraints. Each instance in SATBench is generated from a SAT formula, then translated into a story context and conditions using LLMs. The generation process is fully automated and allows for adjustable difficulty by varying the number of clauses. All 2100 puzzles are validated through both LLM-assisted and solver-based consistency checks, with human validation on a subset. Experimental results show that even the strongest model, o4-mini, achieves only 65.0% accuracy on hard UNSAT problems, close to the random baseline of 50%. SATBench exposes fundamental limitations in the search-based logical reasoning abilities of current LLMs and provides a scalable testbed for future research in logical reasoning.

SATBench: Benchmarking LLMs’ Logical Reasoning via Automated Puzzle Generation from SAT Formulas

Fact verification systems typically assess whether a claim is supported by retrieved evidence, assuming that truthfulness depends solely on what is stated. However, many real-world claims are half-truths, factually correct yet misleading due to the omission of critical context. Existing models struggle with such cases, as they are not designed to reason about what is left unsaid. We introduce the task of half-truth detection, and propose PolitiFact-Hidden, a new benchmark with 15k political claims annotated with sentence-level evidence alignment and inferred claim intent. To address this challenge, we present TRACER, a modular re-assessment framework that identifies omission-based misinformation by aligning evidence, inferring implied intent, and estimating the causal impact of hidden content. TRACER can be integrated into existing fact-checking pipelines and consistently improves performance across multiple strong baselines. Notably, it boosts Half-True classification F1 by up to 16 points, highlighting the importance of modeling omissions for trustworthy fact verification. We will release both the benchmark and code upon publication.

The Missing Parts: Augmenting Fact Verification with Half Truth Detection

Continuous instruction following closely mirrors real-world tasks by requiring models to solve sequences of interdependent steps, yet existing multi-step instruction datasets suffer from three key limitations: (1) lack of logical coherence across turns, (2) narrow topical breadth and depth, and (3) reliance on rigid templates or heavy manual effort. We introduce LoCt-Pipeline, a novel pipeline that leverages modern LLMs’ reasoning capabilities to assemble rich, topic-related single-instruction data into multi-turn dialogues, producing chains that are logically coherent, progressively deepen in content, and span diverse domains without fixed templates or extensive human annotation. We employed this pipeline to construct LoCt-Instruct to assess models’ problem-solving abilities. The generated chains serve as a testbed for benchmarking a variety of models, including reasoning-oriented architectures, instruction-tuned variants, and state-of-the-art closed-source LLMs on their capacity to follow and correctly respond to each step. Our results reveal a substantial performance gap between current LLMs and human solvers. These findings highlight the need for more robust continuous instruction following. We will provide our dataset and pipeline upon acceptance.

LoCt-Instruct: An Automatic Pipeline for Constructing Datasets of Logical Continuous Instructions

Developing a general-purpose system that can extract events with massive types is a long-standing target in Event Extraction (EE). In doing so, the basic challenge comes from the absence of an efficient and effective annotation framework to construct the corresponding datasets. In this paper, we propose an LLM-based collaborative annotation framework. Through collaboration among multiple LLMs and a subsequent voting process, it refines annotations of triggers from distant supervision and then carries out argument annotation. Finally, we create EEMT, the largest EE dataset to date, featuring over **200,000** samples, **3,465** event types, and **6,297** role types. Evaluation on human-annotated test set demonstrates that the proposed framework achieves the F1 scores of **90.1%** and **85.3%** for event detection and argument extraction, strongly validating its effectiveness. Besides, to alleviate the excessively long prompts caused by massive types, we propose an LLM-based Partitioning method for EE called LLM-PEE. It first recalls candidate event types and then splits them into multiple partitions for LLMs to extract. After fine-tuning on the EEMT training set, the distilled LLM-PEE with 7B parameters outperforms state-of-the-art methods by **5.4%** and **6.1%** in event detection and argument extraction. Besides, it also surpasses mainstream LLMs by **12.9%** on the unseen datasets, which strongly demonstrates the generalization capabilities of both the EEMT dataset and the LLM-PEE method.

Towards Event Extraction with Massive Types: LLM-based Collaborative Annotation and Partitioning Extraction

Entity Linking and Entity Disambiguation systems aim to link entity mentions to their corresponding entries, typically represented by descriptions within a predefined, static knowledge base. Current models assume that these knowledge bases are complete and up-to-date, rendering them incapable of handling entities not yet included therein. However, in an ever-evolving world, new entities emerge regularly, making these static resources insufficient for practical applications. To address this limitation, we introduce RAED, a model that retrieves external knowledge to improve factual grounding in entity descriptions. Using sources such as Wikipedia, RAED effectively disambiguates entities and bases their descriptions on factual information, reducing the dependence on parametric knowledge. Our experiments show that retrieval not only enhances overall description quality metrics, but also reduces hallucinations. Moreover, despite not relying on fixed entity inventories, RAED outperforms systems that require predefined candidate sets at inference time on Entity Disambiguation. Finally, we show that descriptions generated by RAED provide useful entity representations for downstream Entity Linking models, leading to improved performance in the extremely challenging Emerging Entity Linking task.

Downloads

Next from EMNLP 2025

QUIDS: Query Intent Description for Exploratory Search via Dual Space Modeling

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES