China

When training large language models (LLMs), it is common practice to track downstream task performance throughout the training process and select the checkpoint with the highest score. However, downstream metrics often exhibit substantial fluctuations, making it difficult to identify the checkpoint that truly represents the best-performing model. In this study, we empirically analyze the stability of downstream task performance in an LLM trained on diverse web-scale corpora. We find that task scores frequently fluctuate throughout training, both at the aggregate and example levels. To address this instability, we investigate two post-hoc checkpoint integration methods: checkpoint averaging and ensemble, motivated by the hypothesis that aggregating neighboring checkpoints can reduce performance volatility. We demonstrate both empirically and theoretically that these methods improve downstream performance stability without requiring any changes to the training procedure.

EMNLP 2025

Instability in Downstream Task Performance During LLM Pretraining

training instability

llm pretraining

llm evaluation

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

The rise of LoRA-sharing communities lets users enjoy powerful, efficient, and personalized LLMs by simply downloading small and pluggable LoRAs. However, this convenient share-and-play ecosystem also introduces a new attack surface, where attackers can distribute malicious LoRAs to a community eager to try out shared assets. Despite the high-risk potential, no prior art has comprehensively explored LoRA's attack surface under the downstream-enhancing share-and-play context. In this paper, we investigate how backdoors can be injected into task-enhancing LoRAs and examine the mechanisms of such infections. We find that with a simple, efficient, yet specific recipe, **a backdoor LoRA can be trained once and then seamlessly merged (in a transferable/training-free fashion) with multiple task-enhancing LoRAs, retaining both its malicious backdoor and benign downstream capabilities.** This allows attackers to scale the distribution of compromised LoRAs with minimal effort by leveraging the rich pool of existing shared LoRA assets. We note that such merged LoRAs are particularly infectious — because their malicious intent is cleverly concealed behind improved downstream capabilities, creating a strong incentive for voluntary download — and dangerous — because under local deployment, no safety measures exist to intervene when things go wrong. Our work is among the first to study this new threat model of training-free distribution of downstream-capable-yet-backdoor-injected LoRAs, highlighting the urgent need for heightened security awareness in the LoRA ecosystem. **Warning: This paper contains offensive content and involves a real-life tragedy.**

LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem

Large language models (LLMs) are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive multilingual evaluation remains challenging due to limited benchmarks and questionable translation quality. To better assess these disparities, we introduce MAKIEval, an automatic multilingual framework for evaluating cultural awareness in LLMs across languages, regions, and topics. MAKIEval evaluates open-ended text generation, capturing how models express culturally grounded knowledge in natural language. Leveraging Wikidata’s multilingual structure as a cross-lingual anchor, it automatically identifies cultural entities in model outputs and links them to structured knowledge, enabling scalable, language-agnostic evaluation without manual annotation or translation. We then introduce four metrics that capture complementary dimensions of cultural awareness: granularity, diversity, cultural specificity, and consensus across languages. We assess 7 LLMs developed from different parts of the world, encompassing both open-source and proprietary systems, across 13 languages, 19 countries and regions, and 6 culturally salient topics (e.g., food, clothing). Notably, we find that models tend to exhibit stronger cultural awareness in English, suggesting that English prompts more effectively activate culturally grounded knowledge. We publicly release our code and data.

MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs

Persona prompting is increasingly used in large language models (LLMs) to simulate the attitudes, values, and perspectives of various sociodemographic groups. However, different persona prompting strategies can significantly affect outcomes, raising concerns about the representativeness of such simulations. We systematically examine how different strategies for persona prompting, specifically role adoption formats and demographic priming strategies, influence LLM behavior across diverse identity groups. We evaluate five open-source LLMs for simulating 15 intersectional demographic groups across both open- and closed-ended tasks. Our findings show that LLMs struggle to simulate marginalized groups, particularly nonbinary, Hispanic, and Middle Eastern identities, exhibiting more stereotypes and lower alignment. However, prompting in an interview-style format and name-based priming consistently improve representativeness, and yield more diverse outputs. Surprisingly, larger models like Llama-3.3-70B perform worse than smaller ones, with OLMo-2-7B achieving the best results. Our findings offer actionable guidance for designing sociodemographic persona prompts in LLM-based simulation studies.

The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models

The Spiral of Silence (SoS) theory posits that, in human societies, fear of social isolation drives individuals holding a minority opinion to quieten down, allowing the majority opinion to dominate public discourse. When agents are large language models (LLMs) rather than humans, the classic affective explanation no longer applies because language models do not have emotions or social anxiety. Therefore, a fundamental question appears: Can purely statistical language generation mechanisms give rise to SOS dynamics in collectives of LLM agents? We introduce an evaluation framework based on rating sequences and design four controlled experimental conditions by varying the presence of persona configurations and historical interaction signals. To measure opinion dynamics, we employ concentration metrics, including Interquartile Range and Kurtosis, along with trend analysis methods such as the Mann-Kendall test and Spearman rank correlation coefficient. We experiment on six widely used open source models: DeepSeek-V2-Lite-Chat, Llama-3.1-8B-Instruct, Mistral-8B-Instruct-2410, and Qwen-2.5-Instruct series (1.5 B, 3 B, 7 B), covering cross-family comparisons on a similar scale and within-family scaling analyses for Qwen, and a close source model GPT-4o-mini. The results of the experiment indicate that \text{(i)} most of the models show a strong default bias in the absence of social signals; \text{(ii)} persona introduces opinion heterogeneity, while history exerts an anchoring force; and \text{(iii)} combining both signals self-reinforcing the majority opinion dominance appears much more frequent in the test cases than others, despite the lack of affect of the agents. These findings challenge traditional affect-based explanations of SoS and provide empirical evidence to understand and mitigate opinion convergence in LLM-based agent systems and offer a conceptual link between computational sociology and the design of responsible artificial intelligence systems.

Spiral of Silence in Large Language Model Agents

Graphical user interface (GUI) agents powered by multimodal large language models (MLLMs) have demonstrated impressive capabilities in understanding and interacting with operating system environments. However, despite their strong task performance, these models often exhibit hallucinations—systematic errors in action prediction that compromise reliability. In this study, we conduct a comprehensive analysis of the hallucinatory behaviors exhibited by GUI agent models in an icon localization task. We introduce a novel evaluation framework that moves beyond traditional accuracy-based metrics by categorizing model predictions into four distinct types: correct predictions, biased hallucinations, misleading hallucinations, and confusing hallucinations. This fine-grained classification provides deeper insights into model failure modes. Furthermore, we investigate the distribution of output logits corresponding to different response types and reveal key deviations from the behavior observed in traditional classification tasks. To support this analysis, we propose a new metric derived from the structural characteristics of the logits distribution, offering a fresh perspective on model confidence and uncertainty in GUI interaction tasks.

Understanding GUI Agent Localization Biases through Logit Sharpness

This paper is the first investigation of the connection between emotion, embodiment, and everyday language in a large sample of natural language data. We created corpora of body part mentions (BPMs) in online English text (blog posts and tweets). This includes a subset featuring human annotations for the emotions of the person whose body part is mentioned in the text. We show that BPMs are common in personal narratives and tweets (~5% to 10% of posts include BPMs) and that their usage patterns vary markedly by time and location. Using word--emotion association lexicons and our annotated data, we show that text containing BPMs tends to be more emotionally charged, even when the BPM is not explicitly used to describe a physical reaction to the emotion in the text. Finally, we discover a strong and statistically significant correlation between body-related language and a variety of poorer health outcomes. In sum, we argue that investigating the role of body-part related words in language can open up valuable avenues of future research at the intersection of NLP, the affective sciences, and the study of human wellbeing.

The Language of Interoception: Examining Embodiment and Emotion Through a Corpus of Body Part Mentions

Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in activation space and require per-attribute tuning. We introduce K-Steering, a unified and flexible approach that trains a single non-linear multi-label classifier on hidden activations and computes intervention directions via gradients at inference time. This avoids linearity assumptions, removes the need for storing and tuning separate attribute vectors, and allows dynamic composition of behaviors without retraining. To evaluate our method, we propose two new benchmarks, TONEBANK and DEBATEMIX, targeting compositional behavioral control. Empirical results across 3 model families, validated by both activation-based classifiers and LLM-based judges, demonstrate that K-Steering outperforms strong baselines in accurately steering multiple behaviors.

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Emotional reasoning is essential for improving human-AI interactions, particularly in mental health support and empathetic systems. However, current approaches, which primarily map sensory inputs to fixed emotion labels, fail to understand the intricate relationships between motivations, thoughts, and emotions, thereby limiting their ability to generalize across flexible emotional reasoning tasks. To address this, we propose a novel third-person appraisal agent that simulates human-like emotional reasoning through three phases: Primary Appraisal, Secondary Appraisal, and Reappraisal. In the Primary Appraisal phase, a third-person generator powered by a large language model (LLM) infers emotions based on cognitive appraisal theory. The Secondary Appraisal phase uses an evaluator LLM to provide feedback, guiding the generator in refining its predictions. The generator then uses counterfactual reasoning to adjust its process and explore alternative emotional responses. The Reappraisal phase utilizes reinforced fine-tuning (ReFT) by employing a reflective actor-critic framework to further enhance the model’s performance and generalization. This process uses reward signals and learns from appraisal trajectories without human annotations. Our approach outperforms baseline LLMs in various emotional reasoning tasks, demonstrating superior generalization and interpretability. To the best of our knowledge, this is the first cognition-based architecture designed to enhance emotional reasoning in LLMs, advancing AI towards human-like emotional understanding.

Third-Person Appraisal Agent: Simulating Human Emotional Reasoning in Text with Large Language Models

With the growing adoption of retrieval-augmented generation (RAG) systems, various attack methods have been proposed to degrade their performance. However, most existing approaches rely on unrealistic assumptions in which external attackers have access to internal components such as the retriever. To address this issue, we introduce a realistic black-box attack based on the RAG paradox, a structural vulnerability that emerges from the system’s effort to enhance trust by revealing both the retrieved documents and their sources to users. This transparency enables attackers to observe which sources are used and how information is phrased, allowing them to craft poisoned documents that are more likely to be retrieved and upload them to the identified sources. Moreover, as RAG systems directly provide retrieved content to users, these documents must not only be retrievable but also appear natural and credible to prevent users from questioning the search results. Unlike prior work that focuses solely on improving document retrievability, our attack method explicitly considers both retrievability and user trust in the retrieved content. Through extensive offline and online experiments, we demonstrate that our method significantly degrades system performance without internal access, while generating natural-looking poisoned documents.

The RAG Paradox: A Black-Box Attack Exploiting Unintentional Vulnerabilities in Retrieval-Augmented Generation Systems

Model NLP models are commonly trained (or fine-tuned) on datasets from untrusted platforms like HuggingFace, posing significant risks of data poisoning attacks. A practical yet underexplored challenge arises when such backdoors are discovered after model deployment, making retraining-required defenses less desirable due to computational costs and data constraints. In this work, we propose Guided Module Substitution (GMS), an effective retraining-free method based on guided merging of the victim model with a single proxy model. Specifically, GMS selectively replaces modules in the victim model based on a trade-off signal between utility and backdoor. GMS offers four desirable properties: (1) robustness to the choice and trustworthiness of the proxy model, (2) applicability under relaxed data assumptions, (3) stability across hyperparameters, and (4) transferability across different attacks. Extensive experiments on encoder models and decoder LLMs demonstrate the strong effectiveness of GMS. GMS significantly outperforms even the strongest defense baseline, particularly against challenging attacks like LWS.

Downloads

Next from EMNLP 2025

LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES