China

The rapid advancements of Large Language models (LLMs) necessitate robust benchmarks. In this paper, we present AraEval, a pioneering and comprehensive evaluation suite specifically developed to assess the advanced knowledge, reasoning, truthfulness, and instruction following capabilities of foundation models within the Arabic context. AraEval includes a diverse set of evaluation tasks that test various dimensions of knowledge and reasoning, with a total of 24,378 samples. These tasks cover areas such as linguistic understanding, factual recall, logical inference, commonsense reasoning, mathematical problem-solving, and domain-specific expertise, ensuring that the evaluation goes beyond basic language comprehension. It covers multiple domains of knowledge, such as science, history, religion, and literature, ensuring that the LLMs are tested on a broad spectrum of topics relevant to Arabic-speaking contexts. AraEval is designed to facilitate comparisons across different foundation models, enabling LLM developers and users to benchmark performance effectively. In addition, it provides diagnostic insights to identify specific areas where models excel or struggle, guiding further development. Datasets and evaluation integration can be found at [https:028//redacted/for/anon/sub].

EMNLP 2025

AraEval: An Arabic Multi-Task Evaluation Suite for Large Language Models

generalization of nlp models

resources and evaluation

generation

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Since the advent of large language models (LLMs), prompt engineering has been a crucial step for eliciting desired responses for various Natural Language Processing (NLP) tasks. However, prompt engineering remains an impediment for end users due to rapid advances in models, tasks, and associated best practices. To mitigate this, Automatic Prompt Optimization (APO) techniques have recently emerged that use various automated techniques to help improve the performance of LLMs on various tasks. In this paper, we present a comprehensive survey summarizing the current progress and remaining challenges in this field. We provide a formal definition of APO, a 5-part unifying framework, and then proceed to rigorously categorize all relevant works based on their salient features therein. We hope to spur further research guided by our framework.

A Systematic Survey of Automatic Prompt Optimization Techniques

Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of meta error patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate evaluation protocols and reinforcement learning research.

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

The well-known rhetorical framework, ABT (And, But, Therefore), mirrors natural human cognition in structuring an argument's logical progression - apropos to academic communication. However, distilling the complexities of research into clear and concise prose requires careful sequencing of ideas and formulating clear connections between them. This presents a quiet inequitability for contributions from authors who struggle with English proficiency or academic writing conventions. We see this as impetus to introduce: Mondrian, a framework that identifies the key components of an abstract and reorients itself to properly reflect the ABT logical progression. The framework is composed of a deconstruction stage, reconstruction stage, and rephrasing. We introduce a novel metric for evaluating deviation from ABT structure, named EB-DTW, which accounts for both ordinality and a non-uniform distribution of importance in a sequence. Our overall approach aims to improve the comprehensibility of academic writing, particularly for non-native English speakers, along with a complementary metric. The effectiveness of Mondrian is tested with automatic metrics and extensive human evaluation, and demonstrated through impressive quantitative and qualitative results, with organization and overall coherence of an abstract improving by an average of 27.71% and 24.71%.

Mondrian: A Framework for Logical Abstract (Re)Structuring

Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text. While this suggests internalized understanding of hierarchical syntax and dependency relations, the precise mechanism by which they represent syntactic structure is an open area within interpretability research. Probing provides one way to identify the mechanism of syntax being linearly encoded in activations, however, no comprehensive study has yet established whether a model's probing accuracy reliably predicts its downstream syntactic performance. Adopting a "mechanisms vs. outcomes" framework, we evaluate 32 open-weight transformer models and find that syntactic features extracted via probing fail to predict outcomes of targeted syntax evaluations across English linguistic phenomena. Our results highlight a substantial disconnect between latent syntactic representations found via probing and observable syntactic behaviors in downstream tasks.

Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations

As large language models (LLMs) become integral to code-related tasks, a central question emerges: do LLMs truly understand program execution semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs. Unlike prior code generation benchmarks, this task directly tests a model’s understanding of code execution semantics. EquiBench consists of 2400 program pairs across four languages and six categories. These pairs are generated through program analysis, compiler scheduling, and superoptimization, ensuring high-confidence labels, nontrivial difficulty, and full automation. The transformations span syntactic edits, structural modifications, and algorithmic changes, covering a broad spectrum of semantic variation. We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline. Further analysis reveals that models often rely on syntactic similarity rather than exhibiting robust reasoning over execution semantics, highlighting fundamental limitations.

EquiBench: Benchmarking Large Language Models’ Reasoning about Program Semantics via Equivalence Checking

Large language models (LLMs) require continual knowledge updates to keep pace with the evolving world. While various model editing methods have been proposed, most face critical challenges in lifelong learning contexts due to two fundamental limitations: (1) Edit Overshooting - parameter updates intended for a specific fact spill over to unrelated regions, causing interference with previously retained knowledge; and (2) Knowledge Entanglement - polysemantic neurons' overlapping encoding of multiple concepts makes it difficult to isolate and edit a single fact. In this paper, we propose MicroEdit, a neuron-level editing method that performs minimal and controlled interventions within LLMs. By leveraging a sparse autoencoder (SAE), MicroEdit disentangles knowledge representations and activates only a minimal set of necessary neurons for precise parameter updates. This targeted design enables fine-grained control over the editing scope, effectively mitigating interference and preserving unrelated knowledge. Extensive experiments show that MicroEdit outperforms prior methods and robustly handles lifelong knowledge editing across QA and Hallucination settings on LLaMA and Mistral.

MicroEdit: Neuron-level Knowledge Disentanglement and Localization in Lifelong Model Editing

Understanding word meanings in context is a fundamental capability for Large Language Models (LLMs). Despite extensive evaluation efforts, the extent to which LLMs show evidence that they truly grasp word meanings remains underexplored. In this paper, we address this gap by evaluating the Word Sense Disambiguation (WSD) capabilities of instruction-tuned LLMs, comparing their performance to state-of-the-art systems specifically designed for the task. Notably, we find that leading models such as GPT-4o and DeepSeek-V3 reach performance on par with specialized WSD systems, while also demonstrating greater robustness across domains and levels of ambiguity. We further assess the top-performing model, i.e. GPT-4o, across three generative settings: definition generation, free explanation and example generation. Our results reveal that GPT-4o consistently achieves over 90% accuracy, with the highest performance observed when the model is allowed to freely to explain the meaning of target words in context. We release our code and data at: anonimizedurl.

Do Large Language Models Understand Word Senses?

Question Answering (QA) on narrative text poses a unique challenge for current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to reveal that all \textit{n}-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weights models, can strongly agree with the ranking identified by humans. Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at https://omitted.link.

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

Relation extraction (RE) is a core task in natural language processing, crucial for semantic understanding, knowledge graph construction, and enhancing downstream applications. However, Arabic RE remains a challenging task due to the language’s rich morphology, orthographic ambiguity, syntactic complexity, and wide dialectal variation. To advance research in this area, we present the largest and most diverse Arabic RE corpus to date: over 33K sentences (approx550K tokens) annotated with approx15K relation triples using 40 relation types. All annotations were manually curated by expert annotators, achieving a 85.2\% Cohen's κ inter-annotator agreement, ensuring high reliability. We benchmark the dataset using both supervised models and in-context learning with LLMs. Supervised models obtain an F1 score of 92.89\% for relation extraction, while LLMs achieve 72.73\% F1 in joint entity and relation extraction. These results establish strong baselines and expose key challenges, paving the way for future work in Arabic RE.

mathrmWojoodRelatioⁿs: Arabic Relation Extraction Corpus and Modeling

Vision-language Models (VLMs), such as CLIP and SigLIP, have become the de facto standard for multimodal tasks, serving as essential building blocks for recent Multimodal Large Language Models, including LLaVA and PaliGemma. However, current evaluations for VLMs remain heavily anchored to ImageNet. In this paper, we question whether ImageNet’s coverage is still sufficiently challenging for modern VLMs, and investigate the impact of adding novel and varied concept categories, i.e., semantically grouped fine-grained synsets. To this end, we introduce Concept-pedia, a novel, large-scale, semantically-annotated multimodal resource covering more than 165,000 concepts. Leveraging a language-agnostic, automatic annotation pipeline grounded in Wikipedia, Concept-pedia expands the range of visual concepts, including diverse abstract categories. Building on Concept-pedia, we also present a manually-curated Visual Concept Recognition evaluation benchmark, Concept-10k, that spans thousands of concepts across a wide range of categories. Our experiments show that current models, although excelling on ImageNet, struggle with Concept-10k. Not only do these findings highlight a persistent bias toward ImageNet-centric concepts, but they also underscore the urgent need for more representative benchmarks. By offering a broader and semantically richer testbed, Concept-10k aims to support the development of multimodal systems that better generalize to the complexities of real-world visual concepts.

Downloads

Next from EMNLP 2025

A Systematic Survey of Automatic Prompt Optimization Techniques

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

A Systematic Survey of Automatic Prompt Optimization Techniques

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads