China

LLM training is resource-intensive. Quantized training improves computational and memory efficiency but introduces quantization noise, which can hinder convergence and degrade model accuracy. Stochastic Rounding (SR) has emerged as a theoretically attractive alternative to deterministic rounding, offering unbiased gradient estimates. However, its interaction with other training factors—especially batch size—remains underexplored. In this paper, we present a theoretical and empirical study of mini-batch stochastic gradient descent (SGD) with SR, showing that increased batch sizes can compensate for reduced precision during backpropagation. Furthermore, we show that quantizing weights and activations impacts gradient variance in distinct ways. Our experiments validate these theoretical insights. Our experiments validate these theoretical insights.

EMNLP 2025

Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic Rounding

convergece analysis

stochastic rounding

efficient training

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Large language models (LLMs) have achieved remarkable success in generative tasks, yet they often fall short in ensuring the factual accuracy of their outputs this limiting their reliability in real-world applications where correctness is critical. In this paper, we present FactReasoner, a novel neuro-symbolic based factuality assessment framework that employs probabilistic reasoning to evaluate the truthfulness of long-form generated responses. FactReasoner decomposes a response into atomic units, retrieves relevant contextual information from external knowledge sources, and models the logical relationships (e.g., entailment, contradiction) between these units and their contexts using probabilistic encodings. It then estimates the posterior probability that each atomic unit is supported by the retrieved evidence. Our experiments on both labeled and unlabeled benchmark datasets demonstrate that FactReasoner often outperforms state-of-the-art prompt-based methods in terms of factual precision and recall.

FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models

The rapid advancements in large language models (LLMs) have led to the emergence of routing techniques, which aim to efficiently select the optimal LLM from diverse candidates to tackle specific tasks, optimizing performance while reducing costs. Current LLM routing methods are limited in effectiveness due to insufficient exploration of the intrinsic connection between user queries and the characteristics of LLMs. To address this issue, in this paper, we present **RadialRouter**, a novel framework for LLM routing which employs a lightweight Transformer-based backbone with a radial structure named **RadialFormer** to articulate the query-LLMs relationship. The optimal LLM selection is performed based on the final states of RadialFormer. The pipeline is further refined by an objective function that combines Kullback-Leibler divergence with the query-query contrastive loss to enhance robustness. Experimental results on RouterBench show that RadialRouter significantly outperforms existing routing methods by 9.2\% and 5.8\% in the *Balance* and *Cost First* scenarios, respectively. Additionally, its adaptability toward different performance-cost trade-offs and the dynamic LLM pool demonstrates practical application potential.

RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing

Multi-modal Dialogue Summarization (MDS) is an important task with great applications. To develop and improve the MDS model, a strong automatic evaluation method for MDS could save a lot of money and time. However, a meta-evaluation benchmark with human annotation is a foundation for developing and improving the automatic evaluation methods of MDS. The shortage of such a benchmark motivates us to introduce MDSEval, the first meta-evaluation benchmark for MDS, providing data-summary pairs and human annotation on summary quality across 8 aspects. Besides the novel benchmark dataset, we propose a novel filtering framework based on Mutually Exclusive Key Information (MEKI) across modalities used in enhancing our data quality. Further, our work is the first to define key evaluation aspects for MDS tasks. Our findings reveal that current multi-modal evaluation methods struggle to fairly rate summaries generated by advanced MLLMs. Our datasets, filtering methods, defined evaluation aspects, and findings will greatly benefit the development of MDS evaluation methods

MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

While large language models (LLMs) have demonstrated impressive creative capabilities, most research has predominantly focused on English texts, often overlooking non-English literary traditions and lacking standardized methods for assessing creativity. In this paper, we investigate the ability of various LLMs to comprehend and generate Persian literary text with culturally relevant expressions. To this end, we construct a dataset of user-generated Persian literary content spanning 20 diverse topics. We evaluate model outputs based on four key dimensions: originality, fluency, flexibility, and elaboration by adapting the Torrance Tests of Creative Thinking to the Persian cultural context. Furthermore, we assess the models’ understanding of four fundamental literary devices: simile, metaphor, hyperbole, and antithesis. Combining human judgment with an LLM-as-judge approach, we evaluate the creativity of the generated texts. Our analysis reveals a strong agreement between human and model assessments, highlighting the potential of LLMs to meaningfully engage with Persian literary culture. However, these models still require further refinement to fully grasp and interpret literary devices.

Evaluating the Creativity of LLMs in Persian Literary Text Generation

Emotion context sensitivity—the ability to adjust emotional responses based on contexts—is a core component of human emotional intelligence. For example, being told, "You can come with me if you want," may elicit joy if the destination is a mall, but provoke fear if the destination is a trap house. As large language models (LLMs) are increasingly deployed in socially interactive settings, understanding this human ability becomes crucial for generating context-appropriate, emotion-aware responses. In this work, we introduce Trace, a novel benchmark for evaluating whether LLMs can understand emotion context sensitivity of humans. This benchmark consists of 1,626 social scenarios and comprises two complementary tests: a sensitivity test, which measures whether models can detect emotional shifts caused by context changes, and a robustness test, which evaluates whether models can maintain stable emotion predictions when context changes are emotionally irrelevant. Each scenario pair keeps the core event constant while systematically varying contextual details—time, place, or agent—based on insights from behavioral theory and emotion psychology. Experimental results show that even the best-performing LLMs lag behind human performance by 20% in the sensitivity test and 15% in the robustness test, indicating substantial room for improvement in emotion-aware reasoning.

Going to a trap house conveys more fear than "Going to a mall": Benchmarking Emotion Context Sensitivity for LLMs

Large language models (LLMs) have shown impressive performance by generating reasoning paths before final answers, but learning such a reasoning path requires costly human supervision. To address this issue, recent studies have explored self-training methods that improve reasoning capabilities using pseudo-labels generated by the LLMs themselves. Among these, confidence-based self-training fine-tunes LLMs to prefer reasoning paths with high-confidence answers, where confidence is estimated via majority voting. However, such methods exclusively focus on the quality of the final answer and may ignore the quality of the reasoning paths, as even an incorrect reasoning path leads to a correct answer by chance. Instead, we advocate the use of reasoning-level confidence to identify high-quality reasoning paths for self-training, supported by our empirical observations. We then propose a new self-training method, **CORE-PO**, that fine-tunes LLMs to prefer high-**CO**nfidence **RE**asoning paths through **P**olicy **O**ptimization. Our experiments show that CORE-PO improves the accuracy of outputs on four in-distribution and two out-of-distribution benchmarks, compared to existing self-training methods.

Self-Training Large Language Models with Confident Reasoning

Can Large Language Models~(LLMs) simulate humans in making important decisions? Recent research has unveiled the potential of using LLMs to develop role-playing language agents (RPLAs), mimicking mainly the knowledge and tones of various characters. However, imitative decision-making necessitates a more nuanced understanding of personas. In this paper, we benchmark the ability of LLMs in persona-driven decision-making. Specifically, we investigate whether LLMs can predict characters' decisions provided by the preceding stories in high-quality novels. Leveraging character analyses written by literary experts, we construct a dataset LIFECHOICE comprising 2,512 characters' decision points from 470 books. Then, we conduct comprehensive experiments on LIFECHOICE with various LLMs and RPLA methodologies. The results demonstrate that state-of-the-art LLMs exhibit promising capabilities in this task, yet substantial room for improvement remains. Hence, we further propose the CHARMAP method, which adopts persona-based memory retrieval and significantly advances RPLAs on this task.

Character is Destiny: Can Persona-assigned Language Models Make Personal Choices?

Sparse Mixture-of-Experts (SMoE) architectures are widely used in large language models (LLMs) due to their computational efficiency. However, though only a few experts are activated for each token, SMoE still requires loading all expert parameters, leading to high memory usage and challenges in deployment. Previous work has tried to reduce the overhead by pruning and merging experts, but primarily focused on expert-level operations, leaving neuron-level structure underexplored. We propose **DERN** (**D**ropping **E**xperts, **R**ecombining **N**eurons), a task-agnostic and retraining-free framework for expert pruning and reconstruction. We observe that experts are often misaligned and contain semantic conflicts at the neuron level, which poses challenges for direct merging. To solve this, DERN works in three steps: it first prunes redundant experts using router statistics; then it decomposes them into neuron-level expert segments, assigning each segment to its most compatible retained expert; and finally, it merges segments within each retained expert to build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE models show that DERN improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity, without extra training. It also greatly reduces the number of experts and memory usage, making SMoE LLMs easier to deploy in practice.

Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs

Multi-modal intent recognition (MIR) requires integrating non-verbal cues from real-world contexts to enhance human intention understanding, which has attracted substantial research attention in recent years. Despite promising advancements, a comprehensive survey summarizing recent advances and new frontiers remains absent. To this end, we present a thorough and unified review of MIR, covering different aspects including (1) Extensive survey: we take the first step to present a thorough survey of this research field covering textual, visual (image/video), and acoustic signals. (2) Unified taxonomy: we provide a unified framework including evaluation protocol and advanced methods to summarize the current progress in MIR. (3) Emerging frontiers: We discuss some future directions such as multi-task, multi-domain, and multi-lingual MIR, and give our thoughts respectively. (4) Abundant resources: we collect abundant open-source resources including relevant papers, data corpora, and leaderboards. We hope this survey can shed light on future research in MIR.

A Survey on Multi-modal Intent Recognition: Recent Advances and New Frontiers

Legal research depends on headnotes—concise summaries that help lawyers quickly identify relevant cases. Yet, many court decisions lack them due to the high cost of manual annotation. To address this gap, we introduce the Swiss Landmark Decisions Summarization (SLDS) dataset: 20K rulings from the Swiss Federal Supreme Court, each with headnotes in German, French, and Italian. SLDS has the potential to significantly improve access to legal information and transform legal research in Switzerland. We fine-tune open models (Qwen2.5, Llama 3.2, Phi-3.5) and compare them to larger general-purpose and reasoning-tuned LLMs, including GPT-4o, Claude 3.5 Sonnet, and the open-source DeepSeek R1. Using an LLM-as-a-Judge framework, we find that fine-tuned models perform well in terms of lexical similarity, while larger models generate more legally accurate and coherent summaries. Interestingly, reasoning-focused models show no consistent benefit, suggesting that factual precision is more important than deep reasoning in this task. We release SLDS under a CC BY 4.0 license to support future research in cross-lingual legal summarization.

Downloads

Next from EMNLP 2025

FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES