China

Reinforcement Learning from Human Feedback (RLHF) aligns language models with human preferences but faces efficiency challenges. We explore two approaches leveraging human gaze prediction to enhance RLHF: (1) gaze-aware reward models and (2) gaze-based distribution of sparse rewards at token level. Our experiments show gaze-informed RLHF achieves faster convergence while maintaining or slightly improving performance, reducing computational requirements during policy optimization. Human visual attention patterns provide valuable signals for policy training, suggesting a promising direction for improving RLHF efficiency through human-like attention mechanisms.

EMNLP 2025

Enhancing RLHF with Human Gaze Modeling

human gaze

rlhf

human-centered nlp

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Editing documents and PDFs using natural language instructions is desirable for many reasons – ease of use, increasing accessibility to non-technical users, and for creativity. To do this automatically, a system needs to first understand the user’s intent and convert this to an executable plan or command, and then the system needs to identify or localize the elements that the user desires to edit. While there exist methods that can accomplish these tasks, a major bottleneck in these systems is the inability to ground the spatial edit location effectively. We address this gap through our proposed system, DELOC (Document Element LOCalizer). DELOC adapts the grounding capabilities of existing Multimodal Large Language Model (MLLM) from natural images to PDFs. This adaptation involves two novel contributions: 1) synthetically generating PDF-grounding instruction tuning data from partially annotated datasets; and 2) synthetic data cleaning via Code-NLI, an NLI-inspired process to clean data using generated Python code. The effectiveness of DELOC is apparent in the >3x zero-shot improvement it achieves over the next best Multimodal LLM, GPT-4o.

DELOC: Document Element Localizer

Comics offer a compelling yet under-explored domain for computational narrative analysis, combining text and imagery in ways distinct from purely textual or audiovisual media. We introduce ComicScene154, a manually annotated dataset of scene-level narrative arcs derived from public-domain comic books spanning diverse genres. By conceptualizing comics as an abstraction for narrative-driven, multimodal data, we highlight their potential to inform broader research on multi-modal storytelling. To demonstrate the utility of ComicScene154, we present a baseline scene segmentation pipeline, providing an initial benchmark that future studies can build upon. Our results indicate that ComicScene154 constitutes a valuable resource for advancing computational methods in multimodal narrative understanding and expanding the scope of comic analysis within the Natural Language Processing community.

ComicScene154: A Scene Dataset for Comic Analysis

The Consensual Assessment Technique (CAT) evaluates creativity through holistic expert judgments. We investigate the use of two advanced Large Language Models (LLMs), Claude-3-Opus and GPT-4o, to evaluate poetry by a methodology inspired by the CAT. Using a dataset of 90 poems, we found that these LLMs can surpass the results achieved by non-expert human judges at matching a ground truth based on publication venue, particularly when assessing smaller subsets of poems. Claude-3-Opus exhibited slightly superior performance than GPT-4o. We show that LLMs are viable tools for accurately assessing poetry, paving the way for their broader application into other creative domains.

Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique

Large language models (LLMs) frequently generate confident yet inaccurate responses, introducing significant risks for deployment in safety-critical domains. We present a novel, test-time approach to detecting model hallucination through systematic analysis of information flow across model layers. We target cases when LLMs process inputs with ambiguous or insufficient context. Our investigation reveals that hallucination manifests as usable information deficiencies in inter-layer transmissions. While existing approaches primarily focus on final-layer output analysis, we demonstrate that tracking cross-layer information dynamics (mathcalLI) provides robust indicators of model reliability, accounting for both information gain and loss during computation. I improves model reliability by immediately integrating with universal LLMs without additional training or architectural modifications.

Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Ambiguous Prompts and Unanswerable Questions

Human moderators in online discussions face a heterogeneous range of tasks, which go beyond content moderation, or policing. They also support and improve discussion quality, which is challenging to model (and evaluate) in NLP due to its inherent subjectivity and the scarcity of annotated resources. We address this gap by introducing PerspectiveMod, a dataset of online comments annotated for the question: *"Does this comment require moderation, and why?"* Annotations were collected from both expert moderators and trained non-experts. **PerspectiveMod** is unique in its intentional variation across (a) the level of moderation experience embedded in the source data (professional vs. non-professional moderation environments), (b) the annotator profiles (experts vs. trained crowdworkers), and (c) the richness of each moderation judgment, both in terms on fine-grained comment properties (drawn from argumentation and deliberative theory) and in the representation of the individuality of the annotator (socio-demographics and attitudes towards the task). We advance understanding of the task's complexity by providing interpretation layers that account for its subjectivity. Our statistical analysis highlights the value of collecting annotator perspectives, including their experiences, attitudes, and views on AI, as a foundation for developing more context-aware and interpretively robust moderation tools.

PerspectiveMod: A Perspectivist Resource for Deliberative Moderation

The construct of morality permeates our entire lives and influences our behavior and how we perceive others. It therefore comes at no surprise that morality also plays an important role in politics, as morally framed arguments are perceived as more appealing and persuasive. Thus, being able to identify moral framing in political communication and to detect subtle differences in politicians’ moral framing can provide the basis for many interesting analyses in the political sciences. In the paper, we release MoralFramingInPolitics (MFiP), a new corpus of German parliamentary debates where the speakers’ moral framing has been coded, using the framework of Moral Foundations Theory (MFT). Our fine-grained annotations distinguish different types of moral frames and also include narrative roles, together with the moral foundations for each frame. We then present models for frame type and moral foundation classification and explore the benefits of data augmentation and contrastive learning for the two tasks. All data and code will be made available to the research community.

Moral Framing in Politics (MFiP): A new resource and models for moral framing

Detecting hateful content is a challenging and important problem. Automated tools, like machine‑learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided‑CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of decoding configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability.

Evaluating Large Language Models for Detecting Antisemitism

Speech learning involves controlling a complex motor system for uttering speech sounds from articulatory gestures and discovering a set of discrete and invariant units that provide entry to the linguistic system. Importantly, children seem to learn the relationships between speech sounds, the corresponding articulatory gestures, and these units in a weakly-supervised manner, with no explicit labeling of auditory inputs and no access to the articulatory gestures they should produce to reach an acoustic target. In this study, we propose a computational agent learning to drive a virtual vocal apparatus in order to repeat an auditory speech input. This model combines i) an articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters, ii) two internal models respectively providing articulatory-to-acoustic forward predictions and acoustic-to-articulatory inverse computations, and iii) a (discrete) speech unit discovery module based on vector-quantized variational autoencoders (VQ-VAE). From this architecture, we provide two contributions. In a first experiment, we analyze the quantized embeddings learned by the VQ-VAE from ground truth data, and we show an interesting complementarity between acoustic and articulatory modalities which is potentially useful for the discovery of invariance. Then, we evaluate the performance of the proposed agent both at the acoustic and articulatory levels. We show that while most of the agent's productions are intelligible, the underlying articulatory trajectories of those productions are not systematically plausible. Finally, we present future perspectives for testing a developmental scenario for speech learning using end-to-end neural models.

Decode, move and speak! Self-supervised learning of speech units, gestures and sounds relationships using vocal imitation

Large Language Models (LLMs) have demonstrated impressive performance across various domains. 
However, the enormous number of model parameters makes fine-tuning challenging, significantly limiting their application and deployment. 
Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA), reducing memory usage but causing performance degradation. 
Additionally, converting fine-tuned models to low-precision representations further degrades performance. 
In this paper, we identify an imbalance in fine-tuning quantized LLMs with LoRA: overly complex adapter inputs and outputs versus low effective trainability of the adapter, leading to underfitting during fine-tuning.
Thus, we propose Quantized LLMs fine-tuning with Balanced Low-Rank Adaptation (Q-BLoRA), which simplifies the adapter’s inputs and outputs while increasing the adapter’s rank to alleviate underfitting during fine-tuning. 
For low-precision deployment, we propose Quantization-Aware fine-tuning with Balanced Low-Rank Adaptation (QA-BLoRA), which aligns with the block-wise quantization and facilitates quantization-aware fine-tuning of low-rank adaptation based on the parameter merging of Q-BLoRA.
Both Q-BLoRA and QA-BLoRA are easily implemented and offer the following optimizations: (i) Q-BLoRA consistently achieves state-of-the-art accuracy compared to baselines and other variants; (ii) QA-BLoRA enables the direct generation of low-precision inference models, which exhibit significant performance improvements over other low-precision models.
We validate the effectiveness of Q-BLoRA and QA-BLoRA across various models and scenarios.
Code will be made available at https://github.com/xiaocaigou/qbaraqahira.

Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance in Adaptation

Argument(ation) mining (AM) is the automated process of identification and extraction of argumentative structures in natural language. This field has seen rapid advancements, offering powerful tools to analyze and interpret complex and large discourse in diverse domains (political debates, medical reports, etc.). In this paper we introduce an AM-boosted version of BCause, a large-scale deliberation platform. The system enables the extraction and analysis of arguments from online discussions in the context of deliberative democracy, which aims to enhance the understanding and accessibility of structured argumentation in large-scale deliberation processes.

Downloads

Next from EMNLP 2025

DELOC: Document Element Localizer

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES