China

Across languages, numeral systems vary widely in how they construct and combine numbers. While humans consistently learn to navigate this diversity, large language models (LLMs) struggle with linguistic-mathematical puzzles involving cross-linguistic numeral systems, which humans can learn to solve successfully. We investigate why this task is difficult for LLMs through a series of experiments that untangle the linguistic and mathematical aspects of numbers in language. Our experiments establish that models cannot consistently solve such problems unless the mathematical operations in the problems are explicitly marked using known symbols (+, times, etc, as in &quot;twenty + three&quot;). In further ablation studies, we probe how individual parameters of numeral construction and combination affect performance. While humans use their linguistic understanding of numbers to make inferences about the implicit compositional structure of numerals, LLMs seem to lack this notion of implicit numeral structure. We conclude that the ability to flexibly infer compositional rules from implicit patterns in human-scale data remains an open challenge for current reasoning models.

EMNLP 2025

Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles

variable binding

multilingual number systems

mathematical reasoning

artificial intelligence

large language models

Across languages, numeral systems vary widely in how they construct and combine numbers. While humans consistently learn to navigate this diversity, large language models (LLMs) struggle with linguistic-mathematical puzzles involving cross-linguistic numeral systems, which humans can learn to solve successfully. We investigate why this task is difficult for LLMs through a series of experiments that untangle the linguistic and mathematical aspects of numbers in language. Our experiments establish that models cannot consistently solve such problems unless the mathematical operations in the problems are explicitly marked using known symbols (+, times, etc, as in "twenty + three"). In further ablation studies, we probe how individual parameters of numeral construction and combination affect performance. While humans use their linguistic understanding of numbers to make inferences about the implicit compositional structure of numerals, LLMs seem to lack this notion of implicit numeral structure. We conclude that the ability to flexibly infer compositional rules from implicit patterns in human-scale data remains an open challenge for current reasoning models.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Today, Large language models (LLMs) are reshaping the norms of human communication, sometimes decoupling words from genuine human thought. This transformation is deep, and undermines the trust and interpretive norms that were historically tied to authorship. We draw from linguistic philosophy and AI ethics to detail how large-scale text generation can induce semantic drift, erode accountability, and obfuscate intent and authorship. Our work here introduces conceptual frameworks including hybrid authorship graphs (modeling humans, LLMs, and texts in a provenance network), epistemic doppelgängers (LLM-generated texts that are indistinguishable from human-authored texts), and authorship entropy. We explore mechanisms such as “proof-of-interaction” authorship verification and educational reforms to restore confidence in language. While LLMs' benefits are undeniable (broader access, increased fluency, automation, etc.), the upheavals they introduce to the linguistic landscape demand reckoning. This paper provides a conceptual lens to chart these changes.

Large Language Models Threaten Language's Epistemic and Communicative Foundations

Variation is inherent in opinion-based annotation tasks, such as sentiment or hate speech analysis. It does not only arise from errors, fatigue, or sentence ambiguity, but also from genuine differences in opinion shaped by background, experience, and culture. In this paper, we show how annotators' confidence ratings can be of great use for disentangling subjective variation from uncertainty, and how they can be approximated by behavioral gaze features. We showcase the utilization of our approach through a hate speech detection task, showing that models are affected differently by instances of uncertainty and subjectivity. We demonstrate that human gaze patterns offer valuable indicators of subjective variation and uncertainty.

Disentangling Subjectivity and Uncertainty for Hate Speech Annotation and Modeling using Gaze

Transformers often struggle to generalize to longer sequences than those seen during training - a limitation known as length extrapolation. Most existing Relative Positional Encoding (RPE) methods attempt to address this by introducing either fixed linear biases or globally learned biases, which lack the capacity to adapt to different input contexts. In this work, we propose an additive RPE, Context-Aware Biases for Length Extrapolation (CABLE), a method that learns token-specific, context-aware biases for each attention head in transformers. By dynamically adjusting positional biases based on the input sequence, CABLE overcomes the rigidity of fixed RPEs. When evaluated on sequences longer than originally trained with, GPT-2 Medium (334M parameters) with CABLE achieves lower perplexity than counterparts using other widely adopted positional encoding methods. Additionally, by applying CABLE to the BERT base model we improved performance in long-context retrieval tasks. Our method significantly enhances the extrapolation performance of existing RPE methods tested on the FineWeb-Edu-10B and WikiText-103 datasets.

Context-aware Biases for Length Extrapolation

Distinguishing between rhetorical questions and informational questions is a challenging task, as many rhetorical questions have similar surface forms to informational questions. Existing datasets, however, do not contain many questions that can be rhetorical or informational in different contexts. We introduce Studying Rhetorically Ambiguous Questions, SRAQ, a new dataset explicitly constructed to support the study of such rhetorical ambiguity. The questions in SRAQ can be interpreted as either rhetorical or informational depending on the context. We evaluate the performance of state-of-the-art language models on this dataset and find that they struggle to recognize many rhetorical questions.

Studying Rhetorically Ambiguous Questions

Reinforcement Learning from Human Feedback (RLHF) aligns language models with human preferences but faces efficiency challenges. We explore two approaches leveraging human gaze prediction to enhance RLHF: (1) gaze-aware reward models and (2) gaze-based distribution of sparse rewards at token level. Our experiments show gaze-informed RLHF achieves faster convergence while maintaining or slightly improving performance, reducing computational requirements during policy optimization. Human visual attention patterns provide valuable signals for policy training, suggesting a promising direction for improving RLHF efficiency through human-like attention mechanisms.

Enhancing RLHF with Human Gaze Modeling

Editing documents and PDFs using natural language instructions is desirable for many reasons – ease of use, increasing accessibility to non-technical users, and for creativity. To do this automatically, a system needs to first understand the user’s intent and convert this to an executable plan or command, and then the system needs to identify or localize the elements that the user desires to edit. While there exist methods that can accomplish these tasks, a major bottleneck in these systems is the inability to ground the spatial edit location effectively. We address this gap through our proposed system, DELOC (Document Element LOCalizer). DELOC adapts the grounding capabilities of existing Multimodal Large Language Model (MLLM) from natural images to PDFs. This adaptation involves two novel contributions: 1) synthetically generating PDF-grounding instruction tuning data from partially annotated datasets; and 2) synthetic data cleaning via Code-NLI, an NLI-inspired process to clean data using generated Python code. The effectiveness of DELOC is apparent in the >3x zero-shot improvement it achieves over the next best Multimodal LLM, GPT-4o.

DELOC: Document Element Localizer

Comics offer a compelling yet under-explored domain for computational narrative analysis, combining text and imagery in ways distinct from purely textual or audiovisual media. We introduce ComicScene154, a manually annotated dataset of scene-level narrative arcs derived from public-domain comic books spanning diverse genres. By conceptualizing comics as an abstraction for narrative-driven, multimodal data, we highlight their potential to inform broader research on multi-modal storytelling. To demonstrate the utility of ComicScene154, we present a baseline scene segmentation pipeline, providing an initial benchmark that future studies can build upon. Our results indicate that ComicScene154 constitutes a valuable resource for advancing computational methods in multimodal narrative understanding and expanding the scope of comic analysis within the Natural Language Processing community.

ComicScene154: A Scene Dataset for Comic Analysis

The Consensual Assessment Technique (CAT) evaluates creativity through holistic expert judgments. We investigate the use of two advanced Large Language Models (LLMs), Claude-3-Opus and GPT-4o, to evaluate poetry by a methodology inspired by the CAT. Using a dataset of 90 poems, we found that these LLMs can surpass the results achieved by non-expert human judges at matching a ground truth based on publication venue, particularly when assessing smaller subsets of poems. Claude-3-Opus exhibited slightly superior performance than GPT-4o. We show that LLMs are viable tools for accurately assessing poetry, paving the way for their broader application into other creative domains.

Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique

Large language models (LLMs) frequently generate confident yet inaccurate responses, introducing significant risks for deployment in safety-critical domains. We present a novel, test-time approach to detecting model hallucination through systematic analysis of information flow across model layers. We target cases when LLMs process inputs with ambiguous or insufficient context. Our investigation reveals that hallucination manifests as usable information deficiencies in inter-layer transmissions. While existing approaches primarily focus on final-layer output analysis, we demonstrate that tracking cross-layer information dynamics (mathcalLI) provides robust indicators of model reliability, accounting for both information gain and loss during computation. I improves model reliability by immediately integrating with universal LLMs without additional training or architectural modifications.

Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Ambiguous Prompts and Unanswerable Questions

Human moderators in online discussions face a heterogeneous range of tasks, which go beyond content moderation, or policing. They also support and improve discussion quality, which is challenging to model (and evaluate) in NLP due to its inherent subjectivity and the scarcity of annotated resources. We address this gap by introducing PerspectiveMod, a dataset of online comments annotated for the question: *"Does this comment require moderation, and why?"* Annotations were collected from both expert moderators and trained non-experts. **PerspectiveMod** is unique in its intentional variation across (a) the level of moderation experience embedded in the source data (professional vs. non-professional moderation environments), (b) the annotator profiles (experts vs. trained crowdworkers), and (c) the richness of each moderation judgment, both in terms on fine-grained comment properties (drawn from argumentation and deliberative theory) and in the representation of the individuality of the annotator (socio-demographics and attitudes towards the task). We advance understanding of the task's complexity by providing interpretation layers that account for its subjectivity. Our statistical analysis highlights the value of collecting annotator perspectives, including their experiences, attitudes, and views on AI, as a foundation for developing more context-aware and interpretively robust moderation tools.

Downloads

Next from EMNLP 2025

Large Language Models Threaten Language's Epistemic and Communicative Foundations

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES