China

Large language models (LLMs) have demonstrated impressive performance in machine translation, but still struggle with unseen low-resource languages, especially those written in underrepresented scripts. To investigate whether LLMs can translate such languages with the help of linguistic resources, we introduce Lotus, a benchmark designed to evaluate translation for Mongolian (in traditional script) and Yi. Our study shows that while linguistic resources can improve translation quality as measured by automatic metrics, LLMs remain limited in their ability to handle these languages effectively. We hope our work provides insights for the low-resource NLP community and fosters further progress in machine translation for underrepresented languages. Our code and data are available.

EMNLP 2025

Can Large Language Models Translate Unseen Languages in Underrepresented Scripts?

large language model

low-resource language

machine translation

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Adversarial text defense is a significant strategy to protect modern NLP models from being attacked. Typical text defense methods usually enhance the model’s robustness by model retraining or equipping it with a data preprocessing step, aiming to eliminate the non-robust features and preserve the robust ones. Although some efforts have been made to recognize the robust features, e.g., by the information bottleneck (IB) technique, how to fully disentangle the robust and non-robust representation remains a big challenge. To alleviate this problem, we propose a novel text defense method, named Disentangled Information Bottleneck (DisIB), with two major merits. Firstly, we separate the robust features and non-robust features with a disentangled two-line framework rather than the one-line compression network in IB. This prevents the loss of robust features caused by information compression and produces complete robust features. Secondly, we design a discriminator network to approximate the minimum mutual information of the two lines, which sufficiently disentangles robust and non-robust features. To validate the effectiveness of our DisIB, we conduct a total of 96 defense experiments on four datasets by defending four popular attack methods. Experimental results elaborate that our method significantly outperforms six baselines, with accuracy improvements ranging from 3.8% to 20.7%.

Disentangled Information Bottleneck for Adversarial Text Defense

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap—the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

Scientific fact-checking has largely focused on textual and tabular sources, neglecting scientific charts—a primary medium for conveying quantitative evidence and supporting statistical reasoning in research communication. We introduce ClimateViz, the first large-scale benchmark for scientific fact-checking grounded in real-world, expert-curated scientific charts. ClimateViz comprises 49,862 claims paired with 2,896 visualizations, each labeled as support, refute, or not enough information. To enable interpretable verification, each instance includes structured knowledge graph explanations that capture statistical patterns, temporal trends, spatial comparisons, and causal relations. We conduct a comprehensive evaluation of state-of-the-art multimodal large language models, including proprietary and open-source ones, under zero-shot and few-shot settings. Our results show that current models struggle to perform fact-checking when statistical reasoning over charts is required: even the best-performing systems, such as Gemini 2.5 and InternVL 2.5, achieve only 76.2–77.8% accuracy in label-only output settings, which is far below human performance (89.3% and 92.7%). While few-shot prompting yields limited improvements, explanation-augmented outputs significantly enhance performance in some closed-source models, notably o3 and Gemini 2.5.

ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts

Audio descriptions (ADs) are indispensable for blind or visually-impaired individuals (BVIs), enabling them to understand the narrative and appreciate the visual diversity of movies. There is an explosion of interest in automatic AD generation for trimmed clips and many new metrics have also been proposed. However, they typically compare single ground-truth ADs against their prediction. We posit that ADs should not be treated as independent captions and pivot to a video-level evaluation. We propose ADQA, a question-answering benchmark to evaluate whether the generated ADs would help BVIs appreciate and understand the story. We motivate the QA framework by quantifying the subjective nature of ADs through an alignment between two AD sources of the same movie. ADQA features visual appreciation (VA) questions about specific visual facts and narrative understanding (NU) questions created using plot sentences associated with videos. Evaluation of current AD generation methods show a large gap to human performance, estimated by using the second AD source. Based on our findings, we provide several recommendations for future work on AD generation.

What You See is What You Ask: Evaluating Audio Descriptions

Large Language Models (LLMs) are trained on massive web-crawled corpora, often containing personal information, copyrighted text, and benchmark datasets. This inadvertent inclusion in the training dataset, known as data leakage, poses significant risks and could compromise the safety of LLM outputs. Despite its criticality, existing studies do not examine how leaked instances in the pre-training data influence LLMs' output and detection capabilities. In this paper, we conduct an experimental survey to elucidate the relationship between data leakage in training datasets and its effects on the generation and detection by LLMs. Our experiments reveal that LLMs often generate outputs containing leaked information, even when there is little such data in the training dataset. Moreover, the fewer the leaked instances, the more difficult it becomes to detect such leakage. Finally, we demonstrate that enhancing leakage detection through few-shot learning can help mitigate the impact of the leakage rate in the training data on detection performance.

Investigating How Pre-training Data Leakage Affects Models' Reproduction and Detection Capabilities

Standardized benchmarks are central to evaluating and comparing model performance in Natural Language Processing (NLP). However, Large Language Models (LLMs) have exposed shortcomings in existing benchmarks, and so far there is no clear solution. In this paper, we survey a wide scope of benchmarking issues, and provide an overview of solutions as they are suggested in the literature. We observe that these solutions often tackle a limited number of issues, neglecting other facets. Therefore, we propose concrete checklists to cover all aspects of benchmarking issues, both for benchmark creation and usage. Additionally, we discuss the advantages of adding minimal-sized test-suites to benchmarking, ensuring downstream applicability on real-world use cases.

In Benchmarks We Trust ... Or Not?

We investigate the identification of idiomatic expressions—a semantically non-compositional subclass of multiword expressions (MWEs)—in running text using large language models (LLMs) without any fine-tuning. Instead, we adopt a prompt-based approach and evaluate a range of prompting strategies, including zero-shot, few-shot, and chain-of-thought variants, across multiple languages, datasets, and model types. Our experiments show that, with well-crafted prompts, LLMs can perform competitively with supervised models trained on annotated data. These findings highlight the potential of prompt-based LLMs as a flexible and effective alternative for idiomatic expression identification.

Easy as PIE? Identifying Multi-Word Expressions with LLMs

Alignment is the critical process of minimizing harmful outputs by teaching large language models (LLMs) to prefer safe, helpful and appropriate responses. While the majority of alignment research and datasets remain overwhelmingly English-centric, ensuring safety across diverse linguistic and cultural contexts requires localized resources. In this paper, we introduce the first Polish preference dataset PLLuM-Align, created entirely through human annotation to reflect Polish language and cultural nuances. The dataset includes response rating, ranking, and multi-turn dialog data. Designed to reflect the linguistic subtleties and cultural norms of Polish, this resource lays the groundwork for more aligned Polish LLMs and contributes to the broader goal of multilingual alignment in underrepresented languages.

PLLuM-Align: Polish Preference Dataset for Large Language Model Alignment

Large language models (LLMs) have achieved remarkable success across diverse domains. However, their potential as effective language teachers—particularly in complex pedagogical scenarios like teaching Chinese as a second language—remains inadequately assessed. To address this gap, we propose the first pedagogical competence benchmark for LLMs, rigorously evaluating their performance against international standards for Chinese language teachers. Our framework spans three core dimensions: (1) basic knowledge, covering 32 subtopics across five major categories (linguistics, Chinese culture, pedagogy, etc.); (2) international teacher examination, based on data collected from international Chinese teacher certification exams; and (3) teaching practice evaluation, where target LLMs summarize knowledge points and design instructional content for a student model, followed by testing the student model to assess the LLM’s ability to distill and teach key concepts. We conduct a comprehensive evaluation of 13 latest multilingual and Chinese LLMs. The results reveal that most existing models struggle to achieve a 60% overall score, highlighting significant room for improvement. This study contributes to the development of AI-assisted language education tools capable of rivaling human teaching excellence.

Can Large Language Models Be Good Language Teachers?

To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its own prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether models can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. We find a trade-off. When simply asked to generate counterfactual explanations, models typically produce SCEs that are valid, but far from minimal, despite this being a well-established property of good counterfactuals. Worryingly, when explicitly instructed to provide minimal counterfactual explanations, the resulting SCEs typically fail to change the models' predictions. No model is able to reliably satisfy both criteria. We examine why models are unable to do this task, arguing they do not engage in self-modelling, the ability to internally predict how they would behave in alternative situations. We argue this is unlikely to be incentivised by standard training techniques and suggest that new learning objectives are required for LLMs to reliably explain themselves counterfactually. Our code is available in the anonymous repository: https://anonymous.4open.science/r/SCEs-3747/README.md.

Downloads

Next from EMNLP 2025

Disentangled Information Bottleneck for Adversarial Text Defense

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES