China

Multi-modal Dialogue Summarization (MDS) is an important task with great applications. To develop and improve the MDS model, a strong automatic evaluation method for MDS could save a lot of money and time. However, a meta-evaluation benchmark with human annotation is a foundation for developing and improving the automatic evaluation methods of MDS. The shortage of such a benchmark motivates us to introduce MDSEval, the first meta-evaluation benchmark for MDS, providing data-summary pairs and human annotation on summary quality across 8 aspects. Besides the novel benchmark dataset, we propose a novel filtering framework based on Mutually Exclusive Key Information (MEKI) across modalities used in enhancing our data quality. Further, our work is the first to define key evaluation aspects for MDS tasks. Our findings reveal that current multi-modal evaluation methods struggle to fairly rate summaries generated by advanced MLLMs. Our datasets, filtering methods, defined evaluation aspects, and findings will greatly benefit the development of MDS evaluation methods

EMNLP 2025

MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

meta-evaluation benchmark

multimodal evaluation

multimodal dialogue

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

While large language models (LLMs) have demonstrated impressive creative capabilities, most research has predominantly focused on English texts, often overlooking non-English literary traditions and lacking standardized methods for assessing creativity. In this paper, we investigate the ability of various LLMs to comprehend and generate Persian literary text with culturally relevant expressions. To this end, we construct a dataset of user-generated Persian literary content spanning 20 diverse topics. We evaluate model outputs based on four key dimensions: originality, fluency, flexibility, and elaboration by adapting the Torrance Tests of Creative Thinking to the Persian cultural context. Furthermore, we assess the models’ understanding of four fundamental literary devices: simile, metaphor, hyperbole, and antithesis. Combining human judgment with an LLM-as-judge approach, we evaluate the creativity of the generated texts. Our analysis reveals a strong agreement between human and model assessments, highlighting the potential of LLMs to meaningfully engage with Persian literary culture. However, these models still require further refinement to fully grasp and interpret literary devices.

Evaluating the Creativity of LLMs in Persian Literary Text Generation

Emotion context sensitivity—the ability to adjust emotional responses based on contexts—is a core component of human emotional intelligence. For example, being told, "You can come with me if you want," may elicit joy if the destination is a mall, but provoke fear if the destination is a trap house. As large language models (LLMs) are increasingly deployed in socially interactive settings, understanding this human ability becomes crucial for generating context-appropriate, emotion-aware responses. In this work, we introduce Trace, a novel benchmark for evaluating whether LLMs can understand emotion context sensitivity of humans. This benchmark consists of 1,626 social scenarios and comprises two complementary tests: a sensitivity test, which measures whether models can detect emotional shifts caused by context changes, and a robustness test, which evaluates whether models can maintain stable emotion predictions when context changes are emotionally irrelevant. Each scenario pair keeps the core event constant while systematically varying contextual details—time, place, or agent—based on insights from behavioral theory and emotion psychology. Experimental results show that even the best-performing LLMs lag behind human performance by 20% in the sensitivity test and 15% in the robustness test, indicating substantial room for improvement in emotion-aware reasoning.

Going to a trap house conveys more fear than "Going to a mall": Benchmarking Emotion Context Sensitivity for LLMs

Large language models (LLMs) have shown impressive performance by generating reasoning paths before final answers, but learning such a reasoning path requires costly human supervision. To address this issue, recent studies have explored self-training methods that improve reasoning capabilities using pseudo-labels generated by the LLMs themselves. Among these, confidence-based self-training fine-tunes LLMs to prefer reasoning paths with high-confidence answers, where confidence is estimated via majority voting. However, such methods exclusively focus on the quality of the final answer and may ignore the quality of the reasoning paths, as even an incorrect reasoning path leads to a correct answer by chance. Instead, we advocate the use of reasoning-level confidence to identify high-quality reasoning paths for self-training, supported by our empirical observations. We then propose a new self-training method, **CORE-PO**, that fine-tunes LLMs to prefer high-**CO**nfidence **RE**asoning paths through **P**olicy **O**ptimization. Our experiments show that CORE-PO improves the accuracy of outputs on four in-distribution and two out-of-distribution benchmarks, compared to existing self-training methods.

Self-Training Large Language Models with Confident Reasoning

Can Large Language Models~(LLMs) simulate humans in making important decisions? Recent research has unveiled the potential of using LLMs to develop role-playing language agents (RPLAs), mimicking mainly the knowledge and tones of various characters. However, imitative decision-making necessitates a more nuanced understanding of personas. In this paper, we benchmark the ability of LLMs in persona-driven decision-making. Specifically, we investigate whether LLMs can predict characters' decisions provided by the preceding stories in high-quality novels. Leveraging character analyses written by literary experts, we construct a dataset LIFECHOICE comprising 2,512 characters' decision points from 470 books. Then, we conduct comprehensive experiments on LIFECHOICE with various LLMs and RPLA methodologies. The results demonstrate that state-of-the-art LLMs exhibit promising capabilities in this task, yet substantial room for improvement remains. Hence, we further propose the CHARMAP method, which adopts persona-based memory retrieval and significantly advances RPLAs on this task.

Character is Destiny: Can Persona-assigned Language Models Make Personal Choices?

Sparse Mixture-of-Experts (SMoE) architectures are widely used in large language models (LLMs) due to their computational efficiency. However, though only a few experts are activated for each token, SMoE still requires loading all expert parameters, leading to high memory usage and challenges in deployment. Previous work has tried to reduce the overhead by pruning and merging experts, but primarily focused on expert-level operations, leaving neuron-level structure underexplored. We propose **DERN** (**D**ropping **E**xperts, **R**ecombining **N**eurons), a task-agnostic and retraining-free framework for expert pruning and reconstruction. We observe that experts are often misaligned and contain semantic conflicts at the neuron level, which poses challenges for direct merging. To solve this, DERN works in three steps: it first prunes redundant experts using router statistics; then it decomposes them into neuron-level expert segments, assigning each segment to its most compatible retained expert; and finally, it merges segments within each retained expert to build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE models show that DERN improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity, without extra training. It also greatly reduces the number of experts and memory usage, making SMoE LLMs easier to deploy in practice.

Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs

Multi-modal intent recognition (MIR) requires integrating non-verbal cues from real-world contexts to enhance human intention understanding, which has attracted substantial research attention in recent years. Despite promising advancements, a comprehensive survey summarizing recent advances and new frontiers remains absent. To this end, we present a thorough and unified review of MIR, covering different aspects including (1) Extensive survey: we take the first step to present a thorough survey of this research field covering textual, visual (image/video), and acoustic signals. (2) Unified taxonomy: we provide a unified framework including evaluation protocol and advanced methods to summarize the current progress in MIR. (3) Emerging frontiers: We discuss some future directions such as multi-task, multi-domain, and multi-lingual MIR, and give our thoughts respectively. (4) Abundant resources: we collect abundant open-source resources including relevant papers, data corpora, and leaderboards. We hope this survey can shed light on future research in MIR.

A Survey on Multi-modal Intent Recognition: Recent Advances and New Frontiers

Legal research depends on headnotes—concise summaries that help lawyers quickly identify relevant cases. Yet, many court decisions lack them due to the high cost of manual annotation. To address this gap, we introduce the Swiss Landmark Decisions Summarization (SLDS) dataset: 20K rulings from the Swiss Federal Supreme Court, each with headnotes in German, French, and Italian. SLDS has the potential to significantly improve access to legal information and transform legal research in Switzerland. We fine-tune open models (Qwen2.5, Llama 3.2, Phi-3.5) and compare them to larger general-purpose and reasoning-tuned LLMs, including GPT-4o, Claude 3.5 Sonnet, and the open-source DeepSeek R1. Using an LLM-as-a-Judge framework, we find that fine-tuned models perform well in terms of lexical similarity, while larger models generate more legally accurate and coherent summaries. Interestingly, reasoning-focused models show no consistent benefit, suggesting that factual precision is more important than deep reasoning in this task. We release SLDS under a CC BY 4.0 license to support future research in cross-lingual legal summarization.

Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland

Instruction-tuned large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference. However, we identify a critical limitation of such formatting: it induces a phenomenon we term diversity collapse, where the model generates semantically similar outputs for open-ended inputs, undermining creativity and variability. We systematically evaluate this effect across tasks like story completion and free-form generation, finding that (1) diversity collapse persists even under high-temperature sampling, and (2) structural tokens in templates significantly constrain the model’s output space. To contextualize these findings, we fine-tune using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity. Our analysis shows that format consistency between fine-tuning and inference is crucial for structure-sensitive tasks (e.g., GSM8K, IFEval), but has marginal influence on knowledge-heavy tasks (e.g., MMLU, WebQuestions). In contrast, output diversity is primarily governed by the presence or absence of structural tokens, with minimal formatting yielding the most diverse outputs. These findings reveal that current prompting conventions, while beneficial for alignment, may inadvertently suppress output diversity, underscoring the need for diversity-aware prompt design and instruction tuning.

The Price of Format: Diversity Collapse in LLMs

Large language models (LLMs) have recently been proposed as general-purpose agents for experimental design, with claims that they can perform in-context experimental design. We evaluate this hypothesis using open-source instruction-tuned LLMs applied to genetic perturbation and molecular property discovery tasks. We find that LLM-based agents show no sensitivity to experimental feedback: replacing true outcomes with randomly permuted labels has no impact on performance. Across benchmarks, classical methods such as linear bandits and Gaussian process optimization consistently outperform LLM agents. We further propose a simple hybrid method, LLM-guided Nearest Neighbour (LLMNN) sampling, that combines LLM prior knowledge with nearest-neighbor sampling to guide the design of experiments. LLMNN achieves competitive or superior performance across domains without requiring significant in-context adaptation. These results suggest that current open-source LLMs do not perform in-context experimental design in practice and highlight the need for hybrid frameworks that decouple prior-based reasoning from batch acquisition with updated posteriors.

LLMs for Bayesian Optimization in Scientific Domains: Are We There Yet?

Modern causal language models stack many attention blocks to improve performance, but not all blocks are necessary for every task. We propose Hopscotch, a simple yet effective method that identifies and skips attention blocks with least contributions to a task and adapts to preserve output quality. Hopscotch jointly optimizes which blocks to skip and how to scale the outputs of the remaining layers. By introducing lightweight, trainable scaling parameters to attention and MLP blocks, it mitigates distribution shifts in hidden states caused by removing attention blocks. Hopscotch does not modify model weights or require access to pretraining or instruction-tuning data, and is compatible with existing model compression techniques. When applied to textttLlama-3.1-8B and textttQwen-2.5-7B, Hopscotch achieves less than a 2% drop in performance even after skipping four attention blocks.

Downloads

Next from EMNLP 2025

Evaluating the Creativity of LLMs in Persian Literary Text Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES