China

Understanding word meanings in context is a fundamental capability for Large Language Models (LLMs). Despite extensive evaluation efforts, the extent to which LLMs show evidence that they truly grasp word meanings remains underexplored. In this paper, we address this gap by evaluating the Word Sense Disambiguation (WSD) capabilities of instruction-tuned LLMs, comparing their performance to state-of-the-art systems specifically designed for the task. Notably, we find that leading models such as GPT-4o and DeepSeek-V3 reach performance on par with specialized WSD systems, while also demonstrating greater robustness across domains and levels of ambiguity. We further assess the top-performing model, i.e. GPT-4o, across three generative settings: definition generation, free explanation and example generation. Our results reveal that GPT-4o consistently achieves over 90% accuracy, with the highest performance observed when the model is allowed to freely to explain the meaning of target words in context. We release our code and data at: anonimizedurl.

EMNLP 2025

Do Large Language Models Understand Word Senses?

selection

understanding

generation

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Question Answering (QA) on narrative text poses a unique challenge for current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to reveal that all \textit{n}-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weights models, can strongly agree with the ranking identified by humans. Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at https://omitted.link.

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

Relation extraction (RE) is a core task in natural language processing, crucial for semantic understanding, knowledge graph construction, and enhancing downstream applications. However, Arabic RE remains a challenging task due to the language’s rich morphology, orthographic ambiguity, syntactic complexity, and wide dialectal variation. To advance research in this area, we present the largest and most diverse Arabic RE corpus to date: over 33K sentences (approx550K tokens) annotated with approx15K relation triples using 40 relation types. All annotations were manually curated by expert annotators, achieving a 85.2\% Cohen's κ inter-annotator agreement, ensuring high reliability. We benchmark the dataset using both supervised models and in-context learning with LLMs. Supervised models obtain an F1 score of 92.89\% for relation extraction, while LLMs achieve 72.73\% F1 in joint entity and relation extraction. These results establish strong baselines and expose key challenges, paving the way for future work in Arabic RE.

mathrmWojoodRelatioⁿs: Arabic Relation Extraction Corpus and Modeling

Vision-language Models (VLMs), such as CLIP and SigLIP, have become the de facto standard for multimodal tasks, serving as essential building blocks for recent Multimodal Large Language Models, including LLaVA and PaliGemma. However, current evaluations for VLMs remain heavily anchored to ImageNet. In this paper, we question whether ImageNet’s coverage is still sufficiently challenging for modern VLMs, and investigate the impact of adding novel and varied concept categories, i.e., semantically grouped fine-grained synsets. To this end, we introduce Concept-pedia, a novel, large-scale, semantically-annotated multimodal resource covering more than 165,000 concepts. Leveraging a language-agnostic, automatic annotation pipeline grounded in Wikipedia, Concept-pedia expands the range of visual concepts, including diverse abstract categories. Building on Concept-pedia, we also present a manually-curated Visual Concept Recognition evaluation benchmark, Concept-10k, that spans thousands of concepts across a wide range of categories. Our experiments show that current models, although excelling on ImageNet, struggle with Concept-10k. Not only do these findings highlight a persistent bias toward ImageNet-centric concepts, but they also underscore the urgent need for more representative benchmarks. By offering a broader and semantically richer testbed, Concept-10k aims to support the development of multimodal systems that better generalize to the complexities of real-world visual concepts.

Concept-pedia: a Wide-coverage Semantically-annotated Multimodal Dataset

Large language models (LLMs) have revolutionized natural language processing, yet their tendency to hallucinate poses serious challenges for reliable deployment. Despite numerous hallucination detection methods, their evaluations often rely on ROUGE, a metric based on lexical overlap that misaligns with human judgments. Through comprehensive human studies, we demonstrate that while ROUGE exhibits high recall, its extremely low precision leads to misleading performance estimates. In fact, several established detection methods show performance drops of up to 45.9% when assessed using human-aligned metrics like LLM-as-Judge. Moreover, our analysis reveals that simple heuristics based on response length can rival complex detection techniques, exposing a fundamental flaw in current evaluation practices. We argue that adopting semantically aware and robust evaluation frameworks is essential to accurately gauge the true performance of hallucination detection methods, ultimately ensuring the trustworthiness of LLM outputs.

The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs

With new large language models (LLMs) emerging frequently, it is important to consider the potential value of model-agnostic approaches that can provide interpretability across a variety of architectures. While recent advances in LLM interpretability show promise, many rely on complex, model-specific methods with high computational costs. To address these limitations, we propose NormXLogit, a novel technique for assessing the significance of individual input tokens. This method operates based on the input and output representations associated with each token. First, we demonstrate that during the pre-training of LLMs, the norms of word embeddings effectively capture token importance. Second, we reveal a significant relationship between a token's importance and the extent to which its representation can resemble the model's final prediction. Extensive analyses reveal that our approach outperforms existing gradient-based methods in terms of faithfulness and offers competitive performance in layer-wise explanations compared to leading architecture-specific techniques.

NormXLogit: The Head-on-Top Never Lies

As Large Language Models (LLMs) continue to scale, understanding how effectively their internal capacity is utilized becomes increasingly important, especially for inference-time efficiency. While existing scaling laws relate model size to loss and compute, they offer little insight into the representational dynamics of individual components. In this work, we focus on the Feed-Forward Network (FFN), a dominant sub-block in decoder-only transformers, and recast FFN width selection as a \emph{spectral utilization} problem. We introduce a lightweight, differentiable diagnostic suite comprising: \textbf{Hard Rank} (Participation Ratio), Soft Rank (spectral entropy), Spectral Concentration, and the composite Spectral Utilization Index (SUI), designed to quantify how many latent directions are meaningfully activated. Our spectral audit across GPT-2, LLaMA, and nGPT models reveals that spectral utilization grows with model size but not monotonically with width, often peaking at intermediate dimensions (e.g., D=2048). We identify clear instances of \emph{spectral collapse}, where wider FFNs concentrate variance into a narrow subspace, leaving much of the latent space unused.

Spectral Scaling Laws in Language Models: emphHow Effectively Do Feed-Forward Networks Use Their Latent Space?

Although reasoning has become a central focus for large language models, there remains a significant gap in corpora and human evaluation methods for lower-resource languages. To address this, we introduce the Trojsten Benchmark -- a dataset of 1,108 high-school competition problems in mathematics, physics, and programming sourced from Slovak archives -- that challenges conventional grading due to its complex formatting and language. We further introduce a LLM-based rubric-based grading approach to generate detailed evaluation rubrics and grade open-ended solutions, with our experiments revealing an average absolute grading difference of just 1.05 points compared to human evaluators. Extensive experiments benchmark various prompting techniques and models -- including GPT-3.5-Turbo, GPT-4, GPT-4o, and open-weight models like Llama 3 -- highlighting both promising performance and persistent challenges in multistep reasoning assessment.

Trojsten Benchmark: Evaluating LLM Problem-Solving in Slovak STEM Competition Problems

Large language models (LLMs) excel at many NLP tasks, yet their multi-step logical reasoning remains unreliable. Existing solutions such as Chain-of-Thought prompting generate intermediate steps but provide no internal check of their logical coherence. In this paper, we use the ``QK-score'', a lightweight metric based on query–key alignments within transformer attention heads, to evaluate the logical reasoning capabilities of LLMs. Our method automatically identifies attention heads that play a key role in distinguishing valid from invalid logical inferences, enabling efficient inference-time evaluation via a single forward pass. It reveals latent reasoning structure in LLMs and provides a scalable mechanistic alternative to ablation-based analysis. Across three benchmarks: ProntoQA-OOD, PARARULE-Plus, and MultiLogicEval, with models ranging from 1.5B to 70B parameters, the selected heads predict logical validity up to 14% better than the model probabilities, and remain robust under distractors and increasing reasoning depth of dleq 6.

Quantifying Logical Consistency in Transformers via Query-Key Alignment

Large language models (LLMs) are increasingly used in interactive applications, and human evaluation remains the gold standard for assessing their performance in multi-turn conversations. Since human studies are costly, time-consuming, and hard to reproduce, recent work explores using LLMs to simulate users for automatic assistant evaluation. However, there is no benchmark or systematic study to evaluate whether these simulated users are reliable stand-ins for real users. To address this, we introduce USimBench, a benchmark of 909 annotated human–LLM conversations on two interactive tasks---math tutoring and document creation. USimBench evaluates simulators based on how closely their messages match human behavior and how well their assistant ratings align with human judgments. Experiments on various simulator methods show that simulators conditioned on user profiles, capturing traits like background and message styles, align closely with human judgments. They reach Spearman's ρ of 0.7 on both tasks, providing a practical, scalable alternative to human evaluation.

SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?

Methods for lexical semantic change detection quantify changes in the meaning of words over time. Prior research produced many methods that excel when evaluated on established benchmarks that consist of annotated sets of pre-selected \emph{target} words; the targets are chosen ahead of time due to the prohibitively large cost of manually annotating all words. However, performance measured on small curated sets of words cannot reveal how well these methods would perform when applied to \emph{discovering} semantic changes among the \emph{full} corpus vocabulary, which is the actual end goal for many applications. In this paper, we implement a top-k set-up to evaluate semantic change discovery (SCDisc) despite lacking complete annotations (although a contribution of our efforts is a 50% extension of the annotations in the commonly used LiverpoolFC benchmark). We use our evaluation setup to investigate a battery of semantic change detection methods that utilize three different kinds of contextualized representations, as well as recent techniques specifically designed to reduce false-discovery rates, such as statistical significance tests and metric scaling. We find that despite great performance on pre-existing benchmarks, \emph{all} methods of semantic change detection struggle at ranking known changes higher compared to other words in the vocabulary when presented with a \emph{natural} distribution of instances. Finally, we verify that the majority of words with high detected change scores do not in fact experience meaning changes. In fact, for most of the methods less than a third of highest ranked changes were annotated to have changed in meaning. Given the large performance discrepancies between semantic-change scoring and discovery ``in the wild'', we recommend that researchers direct more attention to semantic change discovery and include it in their suite of evaluations when proposing new methods.

Downloads

Next from EMNLP 2025

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES