China

Understanding sources of a model&#39;s uncertainty regarding its predictions is crucial for effective human-AI collaboration. Prior work proposes to use numerical uncertainty or hedges (``I&#39;m not sure, but...&#39;&#39;), which do not explain uncertainty arising from conflicting evidence, leaving users unable to resolve disagreements or rely on the output. We introduce CLUE (**C**onflict-&amp;Agreement-aware **L**anguage-model **U**ncertainty **E**xplanations), the first framework to generate natural language explanations of model uncertainty by: (i) identifying relationships between spans of text that expose claim-evidence or inter-evidence conflicts/agreements driving the model&#39;s predictive uncertainty in an unsupervised way; and (ii) generating explanations via prompting and attention steering to verbalize these critical interactions. Across three language models and two fact-checking datasets, we demonstrate that CLUE generates explanations that are more faithful to model uncertainty and more consistent with fact-checking decisions than prompting for explanation of uncertainty without span-interaction guidance. Human evaluators find our explanations more helpful, more informative, less redundant, and better logically aligned with the input than this prompting baseline. CLUE requires no fine-tuning or architectural changes, making it plug-and-play for any white-box language model. By explicitly linking uncertainty to evidence conflicts, it offers practical support for fact-checking and readily generalizes to other tasks that require reasoning over complex information.

EMNLP 2025

Explaining Sources of Uncertainty in Automated Fact-Checking

Understanding sources of a model's uncertainty regarding its predictions is crucial for effective human-AI collaboration. Prior work proposes to use numerical uncertainty or hedges (``I'm not sure, but...''), which do not explain uncertainty arising from conflicting evidence, leaving users unable to resolve disagreements or rely on the output. We introduce CLUE (**C**onflict-&Agreement-aware **L**anguage-model **U**ncertainty **E**xplanations), the first framework to generate natural language explanations of model uncertainty by: (i) identifying relationships between spans of text that expose claim-evidence or inter-evidence conflicts/agreements driving the model's predictive uncertainty in an unsupervised way; and (ii) generating explanations via prompting and attention steering to verbalize these critical interactions. Across three language models and two fact-checking datasets, we demonstrate that CLUE generates explanations that are more faithful to model uncertainty and more consistent with fact-checking decisions than prompting for explanation of uncertainty without span-interaction guidance. Human evaluators find our explanations more helpful, more informative, less redundant, and better logically aligned with the input than this prompting baseline. CLUE requires no fine-tuning or architectural changes, making it plug-and-play for any white-box language model. By explicitly linking uncertainty to evidence conflicts, it offers practical support for fact-checking and readily generalizes to other tasks that require reasoning over complex information.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

This empirical study analyzes the effects of the pre-training corpus on the quality of learned transformer representations. We focus on the representation quality induced solely through pre-training. Our experiments show that pre-training on a small, specialized corpus can yield effective representations, and that the success of combining a generic and a specialized corpus depends on the distributional similarity between the target task and the specialized corpus.

Domain Pre-training Impact on Representations

For language models to generalize correctly to novel expressions, it is critical that they exploit access compositional meanings when this is justified. Even if we don’t know what a “pelp” is, we can use our knowledge of numbers to understand that “ten pelps” makes more pelps than “two pelps”. Static word embeddings such as Word2vec made strong, indeed excessive, claims about compositionality. The SOTA generative, transformer models and graph models, however, go too far in the other direction by providing no real limits on shifts in meaning due to context. To quantify the additive compositionality, we formalize a two-step, generalized evaluation that (i) measures the linearity between known entity attributes and their embeddings via canonical correlation analysis, and (ii) evaluates additive generalization by reconstructing embeddings for unseen attribute combinations and checking reconstruction metrics such as L2 loss, cosine similarity, and retrieval accuracy. These metrics also capture failure cases where linear composition breaks down. Sentences, knowledge graphs, and word embeddings are evaluated and tracked the compositionality across all layers and training stages. Stronger compositional signals are observed in later training stages across data modalities, and in deeper layers of the transformer-based model before a decline at the top layer. Code will be publicly available on GitHub upon acceptance.

Quantifying Compositionality of Classic and State-of-the-Art Embeddings

Transformer-based models are highly vulnerable to adversarial attacks, where even small perturbations can cause significant misclassifications. This paper introduces textitI-Guard, a defense framework to increase the robustness of transformer-based models against adversarial perturbations. textitI-Guard leverages model interpretability to identify influential parameters responsible for adversarial misclassifications. By selectively fine-tuning a small fraction of model parameters, our approach effectively balances performance on both original and adversarial test sets. We conduct extensive experiments on English and code-mixed Hinglish datasets and demonstrate that textitI-Guard significantly improves model robustness. Furthermore, we demonstrate the transferability of textitI-Guard in handling other character-based perturbations.

I-GUARD: Interpretability-Guided Parameter Optimization for Adversarial Defense

Recently, text-to-molecule models have shown great potential across various chemical applications, e.g., drug-discovery. These models adapt language models to molecular data by representing molecules as sequences of atoms. However, they rely on atom-level tokenizations, which primarily focus on modeling local connectivity, thereby limiting the ability of models to capture the global structural context within molecules. To tackle this issue, we propose a novel text-to-molecule model, coined Context-Aware Molecular T5 (CAMT5). Inspired by the significance of the substructure-level contexts in understanding molecule structures, e.g., ring systems, we introduce substructure-level tokenization for text-to-molecule models. Building on our tokenization scheme, we develop an importance-based training strategy that prioritizes key substructures, enabling CAMT5 to better capture the molecular semantics. Extensive experiments verify the superiority of CAMT5 in various text-to-molecule generation tasks. Intriguingly, we find that CAMT5 shows notable training efficiency, i.e., outperforming the state-of-the-art methods using only 2% of training tokens. In addition, we propose a simple yet effective ensemble strategy that aggregates the outputs of text-to-molecule models with different tokenizations to further boost the generation performance.

Training Text-to-Molecule Models with Context-Aware Tokenization

Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce \textsc{MoMentS} (Multi\textbf{mo}dal \textbf{Ment}al \textbf{S}tates), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. \textsc{MoMentS} includes over 2,344 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters’ mental states. While the visual modality generally enhances model performance, current systems still struggle to integrate it effectively, underscoring the need for further research into AI’s multimodal understanding of human behavior.

MoMentS: A Comprehensive Multimodal Benchmark for Theory of Mind

Knowledge editing is a promising way to improve factuality in large language models, but recent studies have shown significant model degradation during sequential editing. In this paper, we formalize the popular locate-then-edit methods as a two-step fine-tuning process, allowing us to precisely identify the root cause of this degradation. We show that model degradation occurs due to (1) over-optimization of internal activations and (2) continuous norm-growth of edited matrices. To mitigate these issues, we introduce two regularization techniques: (1) Most-Probable Early Stopping (MPES) and (2) explicit Frobenius norm-constraint. We demonstrate that applying these simple yet effective regularization techniques at key points in the editing process can substantially mitigate model degradation. Combining these regularization methods enables scaling locate-then-edit methods to 10,000 edits while reducing editing time by 42-61%. These results show that targeted regularization is essential for lifelong knowledge editing.

Lifelong Knowledge Editing requires Better Regularization

When training large language models (LLMs), it is common practice to track downstream task performance throughout the training process and select the checkpoint with the highest score. However, downstream metrics often exhibit substantial fluctuations, making it difficult to identify the checkpoint that truly represents the best-performing model. In this study, we empirically analyze the stability of downstream task performance in an LLM trained on diverse web-scale corpora. We find that task scores frequently fluctuate throughout training, both at the aggregate and example levels. To address this instability, we investigate two post-hoc checkpoint integration methods: checkpoint averaging and ensemble, motivated by the hypothesis that aggregating neighboring checkpoints can reduce performance volatility. We demonstrate both empirically and theoretically that these methods improve downstream performance stability without requiring any changes to the training procedure.

Instability in Downstream Task Performance During LLM Pretraining

The rise of LoRA-sharing communities lets users enjoy powerful, efficient, and personalized LLMs by simply downloading small and pluggable LoRAs. However, this convenient share-and-play ecosystem also introduces a new attack surface, where attackers can distribute malicious LoRAs to a community eager to try out shared assets. Despite the high-risk potential, no prior art has comprehensively explored LoRA's attack surface under the downstream-enhancing share-and-play context. In this paper, we investigate how backdoors can be injected into task-enhancing LoRAs and examine the mechanisms of such infections. We find that with a simple, efficient, yet specific recipe, **a backdoor LoRA can be trained once and then seamlessly merged (in a transferable/training-free fashion) with multiple task-enhancing LoRAs, retaining both its malicious backdoor and benign downstream capabilities.** This allows attackers to scale the distribution of compromised LoRAs with minimal effort by leveraging the rich pool of existing shared LoRA assets. We note that such merged LoRAs are particularly infectious — because their malicious intent is cleverly concealed behind improved downstream capabilities, creating a strong incentive for voluntary download — and dangerous — because under local deployment, no safety measures exist to intervene when things go wrong. Our work is among the first to study this new threat model of training-free distribution of downstream-capable-yet-backdoor-injected LoRAs, highlighting the urgent need for heightened security awareness in the LoRA ecosystem. **Warning: This paper contains offensive content and involves a real-life tragedy.**

LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem

Large language models (LLMs) are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive multilingual evaluation remains challenging due to limited benchmarks and questionable translation quality. To better assess these disparities, we introduce MAKIEval, an automatic multilingual framework for evaluating cultural awareness in LLMs across languages, regions, and topics. MAKIEval evaluates open-ended text generation, capturing how models express culturally grounded knowledge in natural language. Leveraging Wikidata’s multilingual structure as a cross-lingual anchor, it automatically identifies cultural entities in model outputs and links them to structured knowledge, enabling scalable, language-agnostic evaluation without manual annotation or translation. We then introduce four metrics that capture complementary dimensions of cultural awareness: granularity, diversity, cultural specificity, and consensus across languages. We assess 7 LLMs developed from different parts of the world, encompassing both open-source and proprietary systems, across 13 languages, 19 countries and regions, and 6 culturally salient topics (e.g., food, clothing). Notably, we find that models tend to exhibit stronger cultural awareness in English, suggesting that English prompts more effectively activate culturally grounded knowledge. We publicly release our code and data.

MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs

Persona prompting is increasingly used in large language models (LLMs) to simulate the attitudes, values, and perspectives of various sociodemographic groups. However, different persona prompting strategies can significantly affect outcomes, raising concerns about the representativeness of such simulations. We systematically examine how different strategies for persona prompting, specifically role adoption formats and demographic priming strategies, influence LLM behavior across diverse identity groups. We evaluate five open-source LLMs for simulating 15 intersectional demographic groups across both open- and closed-ended tasks. Our findings show that LLMs struggle to simulate marginalized groups, particularly nonbinary, Hispanic, and Middle Eastern identities, exhibiting more stereotypes and lower alignment. However, prompting in an interview-style format and name-based priming consistently improve representativeness, and yield more diverse outputs. Surprisingly, larger models like Llama-3.3-70B perform worse than smaller ones, with OLMo-2-7B achieving the best results. Our findings offer actionable guidance for designing sociodemographic persona prompts in LLM-based simulation studies.

Premium content

Next from EMNLP 2025

Domain Pre-training Impact on Representations

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES