China

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap—the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

EMNLP 2025

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

evaluation methodologies

multimodality

benchmarking

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Scientific fact-checking has largely focused on textual and tabular sources, neglecting scientific charts—a primary medium for conveying quantitative evidence and supporting statistical reasoning in research communication. We introduce ClimateViz, the first large-scale benchmark for scientific fact-checking grounded in real-world, expert-curated scientific charts. ClimateViz comprises 49,862 claims paired with 2,896 visualizations, each labeled as support, refute, or not enough information. To enable interpretable verification, each instance includes structured knowledge graph explanations that capture statistical patterns, temporal trends, spatial comparisons, and causal relations. We conduct a comprehensive evaluation of state-of-the-art multimodal large language models, including proprietary and open-source ones, under zero-shot and few-shot settings. Our results show that current models struggle to perform fact-checking when statistical reasoning over charts is required: even the best-performing systems, such as Gemini 2.5 and InternVL 2.5, achieve only 76.2–77.8% accuracy in label-only output settings, which is far below human performance (89.3% and 92.7%). While few-shot prompting yields limited improvements, explanation-augmented outputs significantly enhance performance in some closed-source models, notably o3 and Gemini 2.5.

ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts

Audio descriptions (ADs) are indispensable for blind or visually-impaired individuals (BVIs), enabling them to understand the narrative and appreciate the visual diversity of movies. There is an explosion of interest in automatic AD generation for trimmed clips and many new metrics have also been proposed. However, they typically compare single ground-truth ADs against their prediction. We posit that ADs should not be treated as independent captions and pivot to a video-level evaluation. We propose ADQA, a question-answering benchmark to evaluate whether the generated ADs would help BVIs appreciate and understand the story. We motivate the QA framework by quantifying the subjective nature of ADs through an alignment between two AD sources of the same movie. ADQA features visual appreciation (VA) questions about specific visual facts and narrative understanding (NU) questions created using plot sentences associated with videos. Evaluation of current AD generation methods show a large gap to human performance, estimated by using the second AD source. Based on our findings, we provide several recommendations for future work on AD generation.

What You See is What You Ask: Evaluating Audio Descriptions

Large Language Models (LLMs) are trained on massive web-crawled corpora, often containing personal information, copyrighted text, and benchmark datasets. This inadvertent inclusion in the training dataset, known as data leakage, poses significant risks and could compromise the safety of LLM outputs. Despite its criticality, existing studies do not examine how leaked instances in the pre-training data influence LLMs' output and detection capabilities. In this paper, we conduct an experimental survey to elucidate the relationship between data leakage in training datasets and its effects on the generation and detection by LLMs. Our experiments reveal that LLMs often generate outputs containing leaked information, even when there is little such data in the training dataset. Moreover, the fewer the leaked instances, the more difficult it becomes to detect such leakage. Finally, we demonstrate that enhancing leakage detection through few-shot learning can help mitigate the impact of the leakage rate in the training data on detection performance.

Investigating How Pre-training Data Leakage Affects Models' Reproduction and Detection Capabilities

Standardized benchmarks are central to evaluating and comparing model performance in Natural Language Processing (NLP). However, Large Language Models (LLMs) have exposed shortcomings in existing benchmarks, and so far there is no clear solution. In this paper, we survey a wide scope of benchmarking issues, and provide an overview of solutions as they are suggested in the literature. We observe that these solutions often tackle a limited number of issues, neglecting other facets. Therefore, we propose concrete checklists to cover all aspects of benchmarking issues, both for benchmark creation and usage. Additionally, we discuss the advantages of adding minimal-sized test-suites to benchmarking, ensuring downstream applicability on real-world use cases.

In Benchmarks We Trust ... Or Not?

We investigate the identification of idiomatic expressions—a semantically non-compositional subclass of multiword expressions (MWEs)—in running text using large language models (LLMs) without any fine-tuning. Instead, we adopt a prompt-based approach and evaluate a range of prompting strategies, including zero-shot, few-shot, and chain-of-thought variants, across multiple languages, datasets, and model types. Our experiments show that, with well-crafted prompts, LLMs can perform competitively with supervised models trained on annotated data. These findings highlight the potential of prompt-based LLMs as a flexible and effective alternative for idiomatic expression identification.

Easy as PIE? Identifying Multi-Word Expressions with LLMs

Alignment is the critical process of minimizing harmful outputs by teaching large language models (LLMs) to prefer safe, helpful and appropriate responses. While the majority of alignment research and datasets remain overwhelmingly English-centric, ensuring safety across diverse linguistic and cultural contexts requires localized resources. In this paper, we introduce the first Polish preference dataset PLLuM-Align, created entirely through human annotation to reflect Polish language and cultural nuances. The dataset includes response rating, ranking, and multi-turn dialog data. Designed to reflect the linguistic subtleties and cultural norms of Polish, this resource lays the groundwork for more aligned Polish LLMs and contributes to the broader goal of multilingual alignment in underrepresented languages.

PLLuM-Align: Polish Preference Dataset for Large Language Model Alignment

Large language models (LLMs) have achieved remarkable success across diverse domains. However, their potential as effective language teachers—particularly in complex pedagogical scenarios like teaching Chinese as a second language—remains inadequately assessed. To address this gap, we propose the first pedagogical competence benchmark for LLMs, rigorously evaluating their performance against international standards for Chinese language teachers. Our framework spans three core dimensions: (1) basic knowledge, covering 32 subtopics across five major categories (linguistics, Chinese culture, pedagogy, etc.); (2) international teacher examination, based on data collected from international Chinese teacher certification exams; and (3) teaching practice evaluation, where target LLMs summarize knowledge points and design instructional content for a student model, followed by testing the student model to assess the LLM’s ability to distill and teach key concepts. We conduct a comprehensive evaluation of 13 latest multilingual and Chinese LLMs. The results reveal that most existing models struggle to achieve a 60% overall score, highlighting significant room for improvement. This study contributes to the development of AI-assisted language education tools capable of rivaling human teaching excellence.

Can Large Language Models Be Good Language Teachers?

To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its own prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether models can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. We find a trade-off. When simply asked to generate counterfactual explanations, models typically produce SCEs that are valid, but far from minimal, despite this being a well-established property of good counterfactuals. Worryingly, when explicitly instructed to provide minimal counterfactual explanations, the resulting SCEs typically fail to change the models' predictions. No model is able to reliably satisfy both criteria. We examine why models are unable to do this task, arguing they do not engage in self-modelling, the ability to internally predict how they would behave in alternative situations. We argue this is unlikely to be incentivised by standard training techniques and suggest that new learning objectives are required for LLMs to reliably explain themselves counterfactually. Our code is available in the anonymous repository: https://anonymous.4open.science/r/SCEs-3747/README.md.

LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks but remain prone to hallucinations. Detecting hallucinations is essential for safety-critical applications, and recent methods leverage attention map properties to this end, though their effectiveness remains limited. In this work, we investigate the spectral features of attention maps by interpreting them as adjacency matrices of graph structures. We propose the LapEigvals method, which utilises the top-k eigenvalues of the Laplacian matrix derived from the attention maps as an input to hallucination detection probes. Empirical evaluations demonstrate that our approach achieves state-of-the-art hallucination detection performance among attention-based methods. Extensive ablation studies further highlight the robustness and generalisation of LapEigvals, paving the way for future advancements in the hallucination detection domain.

Hallucination Detection in LLMs Using Spectral Features of Attention Maps

We introduce HAMLET, a holistic and automated framework for evaluating the long-context comprehension of large language models (LLMs). HAMLET structures source texts into a three-level key-fact hierarchy at root-, branch-, and leaf-levels, and employs query-focused summarization to evaluate how well models recall and faithfully represent information at each level. To validate the reliability of our fully automated pipeline, we conduct a systematic human study, showing that our automatic evaluation achieves over 90% agreement with expert human judgments, while reducing the cost by up to 25 times. HAMLET reveals that LLMs struggle with fine-grained comprehension, especially at the leaf level, and are sensitive to positional effects like the lost-in-the-middle. Analytical queries pose greater challenges than narrative ones, and consistent performance gaps emerge between open-source and proprietary models, as well as across model scales. A sample of the data is available at https://anonymous.4open.science/r/Eval-1B07/README.md. The full code and datasets will be released.

Downloads

Next from EMNLP 2025

ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES