China

Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce \textsc{MoMentS} (Multi\textbf{mo}dal \textbf{Ment}al \textbf{S}tates), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. \textsc{MoMentS} includes over 2,344 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters’ mental states. While the visual modality generally enhances model performance, current systems still struggle to integrate it effectively, underscoring the need for further research into AI’s multimodal understanding of human behavior.

EMNLP 2025

MoMentS: A Comprehensive Multimodal Benchmark for Theory of Mind

social intelligence

theory of mind

video understanding

multimodality

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Knowledge editing is a promising way to improve factuality in large language models, but recent studies have shown significant model degradation during sequential editing. In this paper, we formalize the popular locate-then-edit methods as a two-step fine-tuning process, allowing us to precisely identify the root cause of this degradation. We show that model degradation occurs due to (1) over-optimization of internal activations and (2) continuous norm-growth of edited matrices. To mitigate these issues, we introduce two regularization techniques: (1) Most-Probable Early Stopping (MPES) and (2) explicit Frobenius norm-constraint. We demonstrate that applying these simple yet effective regularization techniques at key points in the editing process can substantially mitigate model degradation. Combining these regularization methods enables scaling locate-then-edit methods to 10,000 edits while reducing editing time by 42-61%. These results show that targeted regularization is essential for lifelong knowledge editing.

Lifelong Knowledge Editing requires Better Regularization

When training large language models (LLMs), it is common practice to track downstream task performance throughout the training process and select the checkpoint with the highest score. However, downstream metrics often exhibit substantial fluctuations, making it difficult to identify the checkpoint that truly represents the best-performing model. In this study, we empirically analyze the stability of downstream task performance in an LLM trained on diverse web-scale corpora. We find that task scores frequently fluctuate throughout training, both at the aggregate and example levels. To address this instability, we investigate two post-hoc checkpoint integration methods: checkpoint averaging and ensemble, motivated by the hypothesis that aggregating neighboring checkpoints can reduce performance volatility. We demonstrate both empirically and theoretically that these methods improve downstream performance stability without requiring any changes to the training procedure.

Instability in Downstream Task Performance During LLM Pretraining

The rise of LoRA-sharing communities lets users enjoy powerful, efficient, and personalized LLMs by simply downloading small and pluggable LoRAs. However, this convenient share-and-play ecosystem also introduces a new attack surface, where attackers can distribute malicious LoRAs to a community eager to try out shared assets. Despite the high-risk potential, no prior art has comprehensively explored LoRA's attack surface under the downstream-enhancing share-and-play context. In this paper, we investigate how backdoors can be injected into task-enhancing LoRAs and examine the mechanisms of such infections. We find that with a simple, efficient, yet specific recipe, **a backdoor LoRA can be trained once and then seamlessly merged (in a transferable/training-free fashion) with multiple task-enhancing LoRAs, retaining both its malicious backdoor and benign downstream capabilities.** This allows attackers to scale the distribution of compromised LoRAs with minimal effort by leveraging the rich pool of existing shared LoRA assets. We note that such merged LoRAs are particularly infectious — because their malicious intent is cleverly concealed behind improved downstream capabilities, creating a strong incentive for voluntary download — and dangerous — because under local deployment, no safety measures exist to intervene when things go wrong. Our work is among the first to study this new threat model of training-free distribution of downstream-capable-yet-backdoor-injected LoRAs, highlighting the urgent need for heightened security awareness in the LoRA ecosystem. **Warning: This paper contains offensive content and involves a real-life tragedy.**

LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem

Large language models (LLMs) are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive multilingual evaluation remains challenging due to limited benchmarks and questionable translation quality. To better assess these disparities, we introduce MAKIEval, an automatic multilingual framework for evaluating cultural awareness in LLMs across languages, regions, and topics. MAKIEval evaluates open-ended text generation, capturing how models express culturally grounded knowledge in natural language. Leveraging Wikidata’s multilingual structure as a cross-lingual anchor, it automatically identifies cultural entities in model outputs and links them to structured knowledge, enabling scalable, language-agnostic evaluation without manual annotation or translation. We then introduce four metrics that capture complementary dimensions of cultural awareness: granularity, diversity, cultural specificity, and consensus across languages. We assess 7 LLMs developed from different parts of the world, encompassing both open-source and proprietary systems, across 13 languages, 19 countries and regions, and 6 culturally salient topics (e.g., food, clothing). Notably, we find that models tend to exhibit stronger cultural awareness in English, suggesting that English prompts more effectively activate culturally grounded knowledge. We publicly release our code and data.

MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs

Persona prompting is increasingly used in large language models (LLMs) to simulate the attitudes, values, and perspectives of various sociodemographic groups. However, different persona prompting strategies can significantly affect outcomes, raising concerns about the representativeness of such simulations. We systematically examine how different strategies for persona prompting, specifically role adoption formats and demographic priming strategies, influence LLM behavior across diverse identity groups. We evaluate five open-source LLMs for simulating 15 intersectional demographic groups across both open- and closed-ended tasks. Our findings show that LLMs struggle to simulate marginalized groups, particularly nonbinary, Hispanic, and Middle Eastern identities, exhibiting more stereotypes and lower alignment. However, prompting in an interview-style format and name-based priming consistently improve representativeness, and yield more diverse outputs. Surprisingly, larger models like Llama-3.3-70B perform worse than smaller ones, with OLMo-2-7B achieving the best results. Our findings offer actionable guidance for designing sociodemographic persona prompts in LLM-based simulation studies.

The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models

The Spiral of Silence (SoS) theory posits that, in human societies, fear of social isolation drives individuals holding a minority opinion to quieten down, allowing the majority opinion to dominate public discourse. When agents are large language models (LLMs) rather than humans, the classic affective explanation no longer applies because language models do not have emotions or social anxiety. Therefore, a fundamental question appears: Can purely statistical language generation mechanisms give rise to SOS dynamics in collectives of LLM agents? We introduce an evaluation framework based on rating sequences and design four controlled experimental conditions by varying the presence of persona configurations and historical interaction signals. To measure opinion dynamics, we employ concentration metrics, including Interquartile Range and Kurtosis, along with trend analysis methods such as the Mann-Kendall test and Spearman rank correlation coefficient. We experiment on six widely used open source models: DeepSeek-V2-Lite-Chat, Llama-3.1-8B-Instruct, Mistral-8B-Instruct-2410, and Qwen-2.5-Instruct series (1.5 B, 3 B, 7 B), covering cross-family comparisons on a similar scale and within-family scaling analyses for Qwen, and a close source model GPT-4o-mini. The results of the experiment indicate that \text{(i)} most of the models show a strong default bias in the absence of social signals; \text{(ii)} persona introduces opinion heterogeneity, while history exerts an anchoring force; and \text{(iii)} combining both signals self-reinforcing the majority opinion dominance appears much more frequent in the test cases than others, despite the lack of affect of the agents. These findings challenge traditional affect-based explanations of SoS and provide empirical evidence to understand and mitigate opinion convergence in LLM-based agent systems and offer a conceptual link between computational sociology and the design of responsible artificial intelligence systems.

Spiral of Silence in Large Language Model Agents

Graphical user interface (GUI) agents powered by multimodal large language models (MLLMs) have demonstrated impressive capabilities in understanding and interacting with operating system environments. However, despite their strong task performance, these models often exhibit hallucinations—systematic errors in action prediction that compromise reliability. In this study, we conduct a comprehensive analysis of the hallucinatory behaviors exhibited by GUI agent models in an icon localization task. We introduce a novel evaluation framework that moves beyond traditional accuracy-based metrics by categorizing model predictions into four distinct types: correct predictions, biased hallucinations, misleading hallucinations, and confusing hallucinations. This fine-grained classification provides deeper insights into model failure modes. Furthermore, we investigate the distribution of output logits corresponding to different response types and reveal key deviations from the behavior observed in traditional classification tasks. To support this analysis, we propose a new metric derived from the structural characteristics of the logits distribution, offering a fresh perspective on model confidence and uncertainty in GUI interaction tasks.

Understanding GUI Agent Localization Biases through Logit Sharpness

This paper is the first investigation of the connection between emotion, embodiment, and everyday language in a large sample of natural language data. We created corpora of body part mentions (BPMs) in online English text (blog posts and tweets). This includes a subset featuring human annotations for the emotions of the person whose body part is mentioned in the text. We show that BPMs are common in personal narratives and tweets (~5% to 10% of posts include BPMs) and that their usage patterns vary markedly by time and location. Using word--emotion association lexicons and our annotated data, we show that text containing BPMs tends to be more emotionally charged, even when the BPM is not explicitly used to describe a physical reaction to the emotion in the text. Finally, we discover a strong and statistically significant correlation between body-related language and a variety of poorer health outcomes. In sum, we argue that investigating the role of body-part related words in language can open up valuable avenues of future research at the intersection of NLP, the affective sciences, and the study of human wellbeing.

The Language of Interoception: Examining Embodiment and Emotion Through a Corpus of Body Part Mentions

Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in activation space and require per-attribute tuning. We introduce K-Steering, a unified and flexible approach that trains a single non-linear multi-label classifier on hidden activations and computes intervention directions via gradients at inference time. This avoids linearity assumptions, removes the need for storing and tuning separate attribute vectors, and allows dynamic composition of behaviors without retraining. To evaluate our method, we propose two new benchmarks, TONEBANK and DEBATEMIX, targeting compositional behavioral control. Empirical results across 3 model families, validated by both activation-based classifiers and LLM-based judges, demonstrate that K-Steering outperforms strong baselines in accurately steering multiple behaviors.

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Emotional reasoning is essential for improving human-AI interactions, particularly in mental health support and empathetic systems. However, current approaches, which primarily map sensory inputs to fixed emotion labels, fail to understand the intricate relationships between motivations, thoughts, and emotions, thereby limiting their ability to generalize across flexible emotional reasoning tasks. To address this, we propose a novel third-person appraisal agent that simulates human-like emotional reasoning through three phases: Primary Appraisal, Secondary Appraisal, and Reappraisal. In the Primary Appraisal phase, a third-person generator powered by a large language model (LLM) infers emotions based on cognitive appraisal theory. The Secondary Appraisal phase uses an evaluator LLM to provide feedback, guiding the generator in refining its predictions. The generator then uses counterfactual reasoning to adjust its process and explore alternative emotional responses. The Reappraisal phase utilizes reinforced fine-tuning (ReFT) by employing a reflective actor-critic framework to further enhance the model’s performance and generalization. This process uses reward signals and learns from appraisal trajectories without human annotations. Our approach outperforms baseline LLMs in various emotional reasoning tasks, demonstrating superior generalization and interpretability. To the best of our knowledge, this is the first cognition-based architecture designed to enhance emotional reasoning in LLMs, advancing AI towards human-like emotional understanding.

Downloads

Next from EMNLP 2025

Lifelong Knowledge Editing requires Better Regularization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES