China

We observed that Contrastive Language-Image Pretraining (CLIP) models struggle with real-world downstream tasks such as road traffic anomaly detection, due to their inability to effectively capture spatial and action relationships between objects within images. To address this, we compile and curate a dataset with 1M samples of images using language supervision provided by the common image caption dataset, in which each image is paired with subject-relationship-object descriptions emphasizing spatial and action interactions, and train a \textbf{S}patial and \textbf{A}ction relationship aware \textbf{CLIP} (\textbf{SA-CLIP}) model. We evaluated the proposed model on the Visual Spatial Reasoning (VSR) dataset and further verified its effectiveness on the Detection-of-Traffic-Anomaly (DoTA) dataset. Experiment results show that the proposed SA-CLIP demonstrates strong abilities in understanding spatial relationships while achieving good zero-shot performance on the traffic anomaly detection task.

EMNLP 2025

SA-CLIP: Language Guided Image Spatial and Action Feature Learning

image text matching

cross-modal application

dependency parsing

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

While powerful, large language models (LLMs) present significant fine-tuning challenges due to their size. Parameter-efficient fine-tuning (PEFT) methods like LoRA provide solutions, yet suffer from critical optimizer inefficiencies; notably basis redundancy in LoRA's B matrix when using AdamW, which fundamentally limits performance. We address this by optimizing the B matrix on the Stiefel manifold, imposing explicit orthogonality constraints that achieve near-perfect orthogonality and full effective rank. This geometric approach dramatically enhances parameter efficiency and representational capacity. Our Stiefel optimizer consistently outperforms AdamW across benchmarks with both LoRA and DoRA, demonstrating that geometric constraints are the key to unlocking LoRA's full potential for effective LLM fine-tuning.

Riemannian Optimization for LoRA on the Stiefel Manifold

Self‐report questionnaires have long been used to assess LLM personality traits, yet they fail to capture behavioral nuances due to biases and meta‐knowledge contamination. This paper proposes a novel multi‐observer framework for personality trait assessments in LLM agents that draws on informant‐report methods in psychology. Instead of relying on self-assessments, we employ multiple observer agents. Each observer is configured with a specific relational context (e.g., family member, friend, or coworker) and engages the subject LLM in dialogue before evaluating its behavior across the Big Five dimensions. We show that these observer‐report ratings align more closely with human judgments than traditional self‐reports and reveal systematic biases in LLM self-assessments. We also found that aggregating responses from 5 to 7 observers reduces systematic biases and achieves optimal reliability. Our results highlight the role of relationship context in perceiving personality and demonstrate that a multi-observer paradigm offers a more reliable, context-sensitive approach to evaluating LLM.

Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models

Large Language Models (LLMs) excel at linear reasoning tasks but remain underexplored on non-linear structures such as those found in natural debates, which are best expressed as graphs of arguments. We evaluate whether LLMs can approximate structured reasoning from Computational Argumentation Theory (CAT). Specifically, we use Quantitative Argumentation Debate (QuAD) semantics, which assign arguments acceptability scores based on their attack and support relations. Given only dialogue-formatted debates from two NoDE datasets, models are prompted to rank arguments without access to the underlying graph. We test several LLMs under advanced instruction strategies, including Chain-of-Thought and In-Context Learning. While models show moderate alignment with QuAD rankings, performance degrades with longer inputs or disrupted discourse flow. Advanced prompting helps mitigate these effects by reducing biases related to argument length and position. Our findings highlight both the promise and limitations of LLMs in modeling formal argumentation semantics and motivate future work on graph-aware reasoning.

Can LLMs Judge Debates? Evaluating Non-Linear Reasoning via Argumentation Theory Semantics

Expanding the long-context capabilities of Multi-modal Large Language Models (MLLMs) is critical for advancing video understanding and high-resolution image analysis. Achieving this requires systematic improvements in model architecture, data construction, and training strategies, particularly to address challenges such as performance degradation with increasing image counts and high computational costs. In this paper, we propose a hybrid architecture that integrates Mamba and Transformer blocks, introduce data construction methods that capture both temporal and spatial dependencies, and employ a progressive training strategy. Our released model, LongLLaVA (Long-Context Large Language and Vision Assistant), demonstrates an effective balance between efficiency and performance. LongLLaVA achieves competitive results across various benchmarks while maintaining high throughput and low memory consumption. Notably, it can process nearly one thousand images on a single A100 80GB GPU, underscoring its potential for a wide range of multi-modal applications.

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

Research in uncertainty quantification (UQ) for large language models (LLMs) is increasingly important towards guaranteeing the reliability of this groundbreaking technology. We explore the integration of LLM UQ methods in argumentative LLMs (ArgLLMs), an explainable LLM framework for decision-making based on computational argumentation in which UQ plays a critical role. We conduct experiments to evaluate ArgLLMs’ accuracy on claim verification tasks when using different LLM UQ methods, inherently performing an assessment of the UQ methods’ effectiveness. Moreover, the experimental procedure itself is a novel way of evaluating the effectiveness of UQ methods, especially when intricate and potentially contentious statements are present. Our results demonstrate that, despite its simplicity, direct prompting is an effective UQ strategy in ArgLLMs, outperforming considerably more complex approaches.

Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models

Recent advancements in Aspect-Based Sentiment Analysis (ABSA) have shown promising results, yet the semantics derived solely from textual data remain limited. To overcome this challenge, we propose a novel approach by venturing into the unexplored territory of generating sentimental images. Our method introduce a \emph{synthetic image generation framework} tailored to produce images that are highly congruent with both textual and sentimental information for aspect-based sentiment analysis. Specifically, we firstly develop a supervised image generation model to generate synthetic images with alignment to both text and sentiment information. Furthermore, we employ a visual refinement technique to substantially enhance the quality and pertinence of the generated images. After that, we propose a multi-modal model to integrate both the original text and the synthetic images for aspect-based sentiment analysis. Extensive evaluations on multiple benchmark datasets demonstrate that our model significantly outperforms state-of-the-art methods. These results highlight the effectiveness of our supervised image generation approach in enhancing ABSA.

Aspect-based Sentiment Analysis via Synthetic Image Generation

Engagement and motivation are crucial for second-language acquisition, yet maintaining learner interest in educational conversations remains a challenge. While prior research has explored what makes educational texts interesting, still little is known about the linguistic features that drive engagement in conversations. To address this gap, we introduce IntrEx, the first large-scale dataset annotated for interestingness and expected interestingness in teacher-student interactions. Built upon the Teacher-Student Chatroom Corpus (TSCC), IntrEx extends prior work by incorporating sequence-level annotations, allowing for the study of engagement beyond isolated turns to capture how interest evolves over extended dialogues. We employ a rigorous annotation process with over 100 second-language learners, using a comparison-based rating approach inspired by reinforcement learning from human feedback (RLHF) to improve agreement. We analyze how linguistic and cognitive factors, such as concreteness, comprehensibility, readability, and uptake, influence engagement in educational dialogues. Finally, we investigate whether large language models (LLMs) can predict human interestingness judgments. We find that carefully fine-tuned LLMs (7B-8B parameters) on interesting ratings outperform larger proprietary models like GPT-4o, demonstrating the potential for specialised datasets to model engagement in educational settings.

IntrEx: A Dataset for Modeling Engagement in Educational Conversations

Although offensive language continually evolves overtime, even recent studies using LLMs have predominantly relied on outdated datasets and rarely evaluated the generalization ability on unseen texts. In this study, we constructed a large-scale dataset of contemporary political discourse and employed three refined judgments in the absence of ground truth. Each judgment reflects a representative offensive language detection method and is carefully designed for optimal conditions. We identified distinct patterns for each judgment and demonstrated tendencies of label agreement using a leave-one-out strategy. By establishing pseudo-labels as ground trust for quantitative performance assessment, we observed that a strategically designed single prompting achieves comparable performance to more resource-intensive methods. This suggests a feasible approach applicable in real-world settings with inherent constraints.

From Ground Trust to Truth: Disparities in Offensive Language Judgments on Contemporary Korean Political Discourse

As machine translation systems approach human-level quality, traditional evaluation methodologies struggle to detect subtle translation errors. We critically examine limitations in current gold-standard approaches (MQM and ESA), including excessive categorization complexity, rough severity granularity, significant bias towards accuracy at the expense of fluency, and concerning annotation time constraints. Through in-depth analysis of English-Russian translations from WMT24, we demonstrate that employing highly qualified professional translators without strict time limitations produces substantially different results from standard evaluations. We propose RATE (Refined Assessment for Translation Evaluation) framework and collect high-quality annotations with streamlined error categorization, expanded severity ratings and multidimensional scoring that balances accuracy and fluency assessments. Our analysis reveals that state-of-the-art MT systems may have surpassed human translations in accuracy while still lagging in fluency, a critical distinction obscured by existing accuracy-biased metrics. Our findings indicate that improving evaluation depth and expertise may be as critical to advancing the field as developing better translation systems.

Refined Assessment for Translation Evaluation: Rethinking Machine Translation Evaluation in the Era of Human-Level Systems

Retrieval-Augmented Generation (RAG) has emerged as a promising technique to enhance the quality and relevance of responses generated by large language models. While recent advancements have mainly focused on improving RAG for text-based queries, RAG on multi-modal documents containing both texts and images has not been fully explored. Especially when fine-tuning does not work. This paper proposes BRIT, a novel multi-modal RAG framework that effectively unifies various text-image connections in the document into a multi-modal graph and retrieves the texts and images as a query-specific sub-graph. By traversing both image-to-text and text-to-image paths in the graph, BRIT retrieve not only directly query-relevant images and texts but also further relevant contents to answering complex cross-modal multi-hop questions. To evaluate the effectiveness of BRIT, we introduce MM-RAG test set specifically designed for multi-modal question answering tasks that require to understand the text-image relations. Our comprehensive experiments demonstrate the superiority of BRIT, highlighting its ability to handle cross-modal questions on the multi-modal documents.

Downloads

Next from EMNLP 2025

Riemannian Optimization for LoRA on the Stiefel Manifold

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES