China

Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce \textbf{Group-Aware Policy Optimization (GAPO)}, a simple extension of the recent and popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole. GAPO enables learning from the group-level properties such as diversity and coverage. We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses. Beyond this setup, GAPO generalizes to open-ended prompts and improves response diversity without compromising accuracy on standard LLM benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro). Our code will be made publicly available.

EMNLP 2025

Group-Aware Reinforcement Learning for Output Diversity in Large Language Models

uniform sampling

output diversity

grpo

policy optimization

llm training

mode collapse

diversity

language models

reinforcement learning

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Multi-modal entity alignment (MMEA) aims to identify equivalent entities between two multimodal knowledge graphs (MMKGs). However, the intrinsic noise within modalities, such as the inconsistency in visual modality and redundant attributes, has not been thoroughly investigated. Excessive noise not only weakens semantic representation but also increases the risk of overfitting in attention-based fusion methods. To address this, we propose LGEA, a novel LLMguided MMEA framework that prioritizes noise reduction before fusion. Specifically, LGEA introduces two key strategies: (1) fine-grained visual filtering to remove irrelevant images at the semantic level, and (2) contextual summarization of attribute information to enhance entity semantics. To our knowledge, we are the first work to apply LLMs for both visual filtering and attribute-level semantic enhancement in MMEA. Experiments on multiple benchmarks, including the noisy FB YG dataset, show that LGEA sets a new state-of-the-art (SOTA) in robust multi-modal alignment, highlighting the potential of noise-aware strategies as a promising direction for future MMEA research.

Breaking the Noise Barrier: LLM-Guided Semantic Filtering and Enhancement for Multi-Modal Entity Alignment

Multimodal large language models (MLLMs), built on large-scale pre-trained vision towers and language models, have shown great capabilities in multimodal understanding. However, most existing MLLMs are trained on single-turn vision question-answering tasks, which do not accurately reflect real-world human conversations. In this paper, we introduce MMDiag, a multi-turn multimodal dialogue dataset. This dataset is collaboratively generated through deliberately designed rules and GPT assistance, featuring strong correlations between questions, between questions and images, and among different image regions; thus aligning more closely with real-life situations. MMDiag serves as a strong benchmark for multi-turn multimodal dialogue learning and brings more challenges to the grounding and reasoning capabilities of MLLMs. Further, inspired by human vision processing, we present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities. DiagNote consists of two modules (Deliberate and Gaze) interacting with each other in turn to perform Chain-of-Thought and annotations respectively, throughout multi-turn dialogues. We empirically demonstrate the advantages of DiagNote in both grounding and jointly processing and reasoning with vision and language information over existing MLLMs.

Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and evaluation benchmarks have emerged for high-resource languages like English. However, a critical gap remains: the lack of comprehensive, high-quality benchmarks for low-resource languages such as Korean, which hinders reliable model development and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. We hope that our generation pipeline will be adaptable to other languages, accelerating multilingual VLM research. The code and dataset for KRETA are available at [anonymous.4open.science](https://anonymous.4open.science/r/KRETA-90D9/README.md).

KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts

Deep learning models have been successful in many areas, but understanding their behavior remains a challenge. Most prior explainable AI (XAI) approaches have focused on interpreting how models make predictions. In contrast, we introduce a novel approach that identifies textual descriptions most beneficial for model training. By analyzing which descriptions contribute most effectively to the model training, our method has the potential to provide insights into how the model prioritizes and utilizes information for decision-making. To achieve this, we propose a pipeline that generates textual descriptions using large language models, incorporates external knowledge bases, and refines them through influence estimation and CLIP score. Furthermore, leveraging the phenomenon of cross-modal transferability, we propose a novel benchmark task named ~\emph{cross-modal transfer classification} to examine the effectiveness of our textual descriptions. In zero-shot experiments, we demonstrate that our textual descriptions improve classification accuracy compared to baselines, leading to consistent performance gains across nine image classification datasets. Additionally, understanding which descriptions contribute most to model performance can shed light on how the model utilizes textual information in its decision-making.

Data Descriptions from Large Language Models with Influence Estimation

As large language models (LLMs) are progressively deployed in various real-world applications, personalization of LLMs has become increasingly important. While various approaches to LLM personalization such as prompt-based and training-based methods have been actively explored, the development of effective decoding-time algorithms remains largely overlooked, despite their demonstrated potential. In this paper, we propose Contrasting Personal Preference (CoPe), a novel decoding-time approach applied after performing parameter-efficient fine-tuning (PEFT) on user-specific data. Our core idea is to leverage reward-guided decoding specifically for personalization by maximizing each user's implicit reward signal. We evaluate CoPe across five open-ended personalized text generation tasks. Our empirical results demonstrate that CoPe achieves strong performance, improving personalization by an average of 10.57% in ROUGE-L without relying on external reward models or additional training procedures.

Personalized LLM Decoding via Contrasting Personal Preference

Large language models (LLMs) have shown remarkable performance across a wide range of natural language tasks. However, a critical challenge remains in that they sometimes generate factually incorrect answers. To address this, while many previous work has focused on identifying errors in their generation and further refining them, they are slow in deployment since they are designed to verify the response from LLMs only after their entire generation (from the first to last tokens) is done. Further, we observe that once LLMs generate incorrect tokens early on, there is a higher likelihood that subsequent tokens will also be factually incorrect. To this end, in this work, we propose Streaming-VR (Streaming Verification and Refinement), a novel approach designed to enhance the efficiency of verification and refinement of LLM outputs. Specifically, the proposed Streaming-VR enables on-the-fly verification and correction of tokens as they are being generated, similar to a streaming process, ensuring that each subset of tokens is checked and refined in real-time by another LLM as the LLM constructs its response. Through comprehensive evaluations on multiple datasets, we demonstrate that our approach not only enhances the factual accuracy of LLMs, but also offers a more efficient solution compared to prior refinement methods.

Efficient Real-time Refinement of Language Model Text Generation

Masked diffusion models (MDMs) offer a promising non-autoregressive alternative for large language modeling. Standard decoding methods for MDMs, such as confidence-based sampling, select tokens independently based on individual token confidences at each diffusion step. However, we observe that this independent token selection often results in generation orders resembling sequential autoregressive processes, limiting the advantages of non-autoregressive modeling. To mitigate this pheonomenon, we propose Reward-Weighted Sampling (RWS), a novel decoding strategy that leverages an external reward model to provide a principled global signal during the iterative diffusion process. Specifically, at each diffusion step, RWS evaluates the quality of the entire intermediate sequence and scales token logits accordingly, guiding token selection by integrating global sequence-level coherence. This method selectively increases the confidence of tokens that initially have lower scores, thereby promoting a more non-autoregressive generation order. Furthermore, we provide theoretical justification showing that reward-weighted logit scaling induces beneficial rank reversals in token selection and consistently improves expected reward. Experiments demonstrate that RWS significantly promotes non-autoregressive generation orders, leading to improvements across multiple evaluation metrics. These results highlight the effectiveness of integrating global signals in enhancing both the non-autoregressive properties and overall performance of MDMs.

Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLMs

Contrastive audio-language models are learned by semantically aligning different modalities in a shared embedding space. Existing research shows that zero-shot classification performance is sensitive to language nuances and prompt formulation. In addition, learned artifacts and spurious correlations from noisy pretraining often lead to semantic ambiguity in label interpretation. While recent work has explored few-shot prefix tuning methods, adapters, and prompt engineering strategies to mitigate these issues, the use of structured prior knowledge remains largely unexplored. In this work, we enhance CLAP predictions using structured reasoning over a knowledge graph (KG). We construct a large, audio-centric KG that encodes ontological relations comprising semantical, causal, and taxonomic connections reflective of everyday sound scenes and events. A systematic analysis of retrieval performance across major publicly available audio collections demonstrates that symbolic knowledge enables robust semantic grounding for contrastive audio-language models. This improvement is further supported by embedding visualizations of CLAP before and after incorporating the KG.

iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models

Large Language Models (LLMs) have demonstrated strong capabilities in transforming text descriptions or tables to data visualizations via instruction-tuning methods. However, it is not straightforward to apply these methods directly for a more real-world use case of visualizing data from long documents based on user-given intents, as opposed to the user pre-selecting the relevant content manually. We introduce the task of {\it intent-based chart generation} from documents: given a user-specified intent and document(s), the goal is to generate a chart adhering to the intent and grounded on the document(s) in a zero-shot setting. We propose an unsupervised, two-staged framework in which an LLM first extracts relevant information from the document(s) by decomposing the intent and iteratively validates and refines this data. Next, a heuristic-guided module selects an appropriate chart type before final code generation. To assess the data accuracy of the generated charts, we propose an attribution-based metric that uses a structured textual representation of charts, instead of relying on visual decoding metrics that often fail to capture the chart data effectively. To validate our approach, we curate a dataset comprising of 1,242 <intent, document, charts> tuples from two domains, finance and scientific, in contrast to the existing datasets that are largely limited to parallel text descriptions/ tables and their corresponding charts. We compare our approach with baselines using single-shot chart generation using LLMs and query-based retrieval methods; our method outperforms by upto 9 points and 17 points in terms of chart data accuracy and chart type respectively over the best baselines.

Doc2Chart: Intent-Driven Zero-Shot Chart Generation from Documents

Characters are important in narratives. They move the plot forward, create emotional connections, and embody the story's themes. Visual storytelling methods focus more on the plot and events relating to it, without building the narrative around specific characters. As a result, the generated stories feel generic, with character mentions being absent, vague, or incorrect. {To mitigate these issues, we introduce a new character-centric approach to visual story generation.} We present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions. Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark. Specifically, we develop an automated pipeline to enrich VIST with visual and textual character coreference chains. We also propose new evaluation metrics to measure the richness of characters and coreference in stories. Experimental results show that our model generates stories with recurring characters which are consistent and coreferent to larger extent compared to baselines and state-of-the-art systems.

Downloads

Next from EMNLP 2025

Breaking the Noise Barrier: LLM-Guided Semantic Filtering and Enhancement for Multi-Modal Entity Alignment

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES