China

We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an open-ended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs&#39; consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models&#39; fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers.

EMNLP 2025

All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark

We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an open-ended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs' consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models' fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Entity alignment (EA) involves identifying and linking equivalent entities across different knowledge graphs (KGs). While knowledge graphs provide structured information about real-world entities, only a small fraction of these entities are aligned. The entity alignment process is challenging due to heterogeneity in KGs, such as differences in structure, terminology, and attribute details. Traditional EA methods use multi-aspect entity embeddings to align entities. Although these methods perform well in certain scenarios, their effective- ness is often constrained by sparse or incomplete data in knowledge graphs and the limitations of embedding techniques. We propose ProLEA ( Profile Generation and Reasoning with LLMs for Entity Alignment) an entity alignment method that combines large language models (LLMs) with entity embed- dings. LLMs generate contextual profiles for entities based on their properties. Candidate entities identified by entity embedding techniques are then re-evaluated by the LLMs, using its background knowledge and the generated profile. A thresholding mechanism is introduced to resolve conflicts between LLMs predictions and embedding-based alignments. This method enhances alignment accuracy, robustness, and explainability, particularly for complex, het- erogeneous knowledge graphs. Furthermore, ProLEA is a generalized framework. Its profile generation and LLM-enhanced entity align- ment components can improve the performance of existing entity alignment models.

Entity Profile Generation and Reasoning with LLMs for Entity Alignment

Meeting summarization with large language models (LLMs) remains error-prone, often producing outputs with hallucinations, omissions, and irrelevancies. We present FRAME, a modular pipeline that reframes summarization as a semantic enrichment task. FRAME extracts and scores salient facts, organizes them thematically, and uses these to enrich an outline into an abstractive summary. To personalize summaries, we introduce SCOPE, a reason-out-loud protocol that has the model build a reasoning trace by answering nine questions before content selection. For evaluation, we propose P-MESA, a multi-dimensional, reference-free evaluation framework to assess if a summary fits a target reader. P-MESA reliably identifies error instances, achieving geq 89% balanced accuracy against human annotations and strongly aligned with human severity ratings (ρ geq 0.70). On QMSum and FAME, FRAME reduces hallucination and omission by 2 out of 5 points (measured with MESA), while SCOPE improves knowledge fit and goal alignment over prompt-only baselines. Our findings advocate for rethinking summarization to improve control, faithfulness, and personalization.

Re-FRAME the Meeting Summarization SCOPE: Fact-Based Summarization and Personalization via Questions

Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, the extent to which LRMs can think in other languages is less studied. This is as important as answer accuracy for real-world applications since users may find the thinking trace useful for oversight only if expressed in their languages. In this work, we comprehensively evaluate two leading families of LRMs on our established benchmark XReasoning. Surprisingly, even the most advanced models often revert to English or produce fragmented reasoning in other languages, revealing a substantial gap in the capability of thinking in non-English languages. Promoting models to reason in the user's language via prompt hacking enhances readability and oversight. This could gain user trust, but reduces answer accuracy, exposing an important trade-off. We further demonstrate that targeted post-training, even with just 100 instances, can mitigate this language mismatch, although accuracy is still degraded. Our results reveal the limited multilingual reasoning capabilities of current LRMs and suggest directions for future research. All code and datasets released at https://Anonymous.

When Models Reason in Your Language: Controlling Thinking Language Comes at the Cost of Accuracy

Cross-lingual topic modeling (CLTM) is an essential task in the field of data mining and natural language processing, aiming to extract aligned and semantically coherent topics from bilingual corpora. Recent advances in cross-lingual neural topic models have widely leveraged bilingual dictionaries to achieve word-level topic alignment. However, two critical challenges remain in cross-lingual topic modeling, the topic mismatch issue and the degeneration of intra-lingual topic interpretability. Due to linguistic diversity, some translated word pairs may not represent semantically coherent topics despite being lexical equivalents, and the objective of cross-lingual topic alignment in CLTM can consequently degrade topic interpretability within intra languages. To address these issues, we propose a novel document-level prototype-based contrastive learning paradigm for cross-lingual topic modeling. Additionally, we design a retrieval-based positive sampling strategy for contrastive learning without data augmentation. Furthermore, we introduce ProtoXTM, a cross-lingual neural topic model based on document-level prototype-based contrastive learning. Extensive experiments indicate that our approach achieves state-of-the-art performance on cross-lingual and mono-lingual benchmarks, demonstrating enhanced topic interpretability.

ProtoXTM: Cross-Lingual Topic Modeling with Document-Level Prototype-based Contrastive Learning

Recent computational work typically frames morphophonology as generating surface forms (SFs) from abstract underlying representations (URs) by applying phonological rules or constraints. This generative stance presupposes that every morpheme has a well-defined UR from which all allomorphs can be derived, a theory-laden assumption that is expensive to annotate, especially in low-resource settings. We adopt an alternative view. Allomorphs and their phonological variants are treated as the basic, observed lexicon, not as outputs of abstract URs. The modeling task therefore shifts from deriving SFs to selecting the correct SF, given a meaning and a phonological context. This discriminative formulation removes the need to posit or label URs and lets the model exploit the surface evidence directly.

Radical Allomorphy: Phonological Surface Forms without Phonology

Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos. Due to the rich visual information, a single image can generate thousands of vision tokens, leading to high computational costs during the prefilling stage and significant memory overhead during decoding. Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations. However, these methods often struggle in shallow layers due to the lack of sufficient contextual information. We argue that many visual tokens are inherently redundant even in shallow layers and can be safely and effectively pruned with appropriate contextual signals. In this work, we propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM. The PPM is lightweight, model-agnostic, and operates independently of the LVLM architecture, ensuring seamless integration with various models. Extensive experiments on multiple benchmarks demonstrate that CoViPAL outperforms training-free pruning methods under equal token budgets and surpasses training-based methods with comparable supervision. CoViPAL offers a scalable and efficient solution to improve inference efficiency in LVLMs without compromising accuracy.

CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models

In the absence of sizable training data for most world languages and NLP tasks, translation-based strategies such as translate-test---evaluating on noisy source language data translated from the target language---and translate-train---training on noisy target language data translated from the source language---have been established as competitive approaches for cross-lingual transfer (XLT). For token classification tasks, these strategies require label projection: mapping the labels from each token in the original sentence to its counterpart(s) in the translation. To this end, it is common to leverage multilingual word aligners (WAs) derived from encoder language models such as mBERT or LaBSE. Despite obvious associations between machine translation (MT) and WA, research on extracting alignments with MT models is largely limited to exploiting cross-attention in encoder-decoder architectures, yielding poor WA results. In this work, in contrast, we propose TransAlign, a novel word aligner that utilizes the encoder of a massively multilingual MT model. We show that TransAlign not only achieves strong WA performance but substantially outperforms popular WA and state-of-the-art non-WA-based label projection methods in MT-based XLT for token classification.

TransAlign: Machine Translation Encoders are Strong Word Aligners, Too

We observed that Contrastive Language-Image Pretraining (CLIP) models struggle with real-world downstream tasks such as road traffic anomaly detection, due to their inability to effectively capture spatial and action relationships between objects within images. To address this, we compile and curate a dataset with 1M samples of images using language supervision provided by the common image caption dataset, in which each image is paired with subject-relationship-object descriptions emphasizing spatial and action interactions, and train a \textbf{S}patial and \textbf{A}ction relationship aware \textbf{CLIP} (\textbf{SA-CLIP}) model. We evaluated the proposed model on the Visual Spatial Reasoning (VSR) dataset and further verified its effectiveness on the Detection-of-Traffic-Anomaly (DoTA) dataset. Experiment results show that the proposed SA-CLIP demonstrates strong abilities in understanding spatial relationships while achieving good zero-shot performance on the traffic anomaly detection task.

SA-CLIP: Language Guided Image Spatial and Action Feature Learning

While powerful, large language models (LLMs) present significant fine-tuning challenges due to their size. Parameter-efficient fine-tuning (PEFT) methods like LoRA provide solutions, yet suffer from critical optimizer inefficiencies; notably basis redundancy in LoRA's B matrix when using AdamW, which fundamentally limits performance. We address this by optimizing the B matrix on the Stiefel manifold, imposing explicit orthogonality constraints that achieve near-perfect orthogonality and full effective rank. This geometric approach dramatically enhances parameter efficiency and representational capacity. Our Stiefel optimizer consistently outperforms AdamW across benchmarks with both LoRA and DoRA, demonstrating that geometric constraints are the key to unlocking LoRA's full potential for effective LLM fine-tuning.

Riemannian Optimization for LoRA on the Stiefel Manifold

Self‐report questionnaires have long been used to assess LLM personality traits, yet they fail to capture behavioral nuances due to biases and meta‐knowledge contamination. This paper proposes a novel multi‐observer framework for personality trait assessments in LLM agents that draws on informant‐report methods in psychology. Instead of relying on self-assessments, we employ multiple observer agents. Each observer is configured with a specific relational context (e.g., family member, friend, or coworker) and engages the subject LLM in dialogue before evaluating its behavior across the Big Five dimensions. We show that these observer‐report ratings align more closely with human judgments than traditional self‐reports and reveal systematic biases in LLM self-assessments. We also found that aggregating responses from 5 to 7 observers reduces systematic biases and achieves optimal reliability. Our results highlight the role of relationship context in perceiving personality and demonstrate that a multi-observer paradigm offers a more reliable, context-sensitive approach to evaluating LLM.

Downloads

Next from EMNLP 2025

Entity Profile Generation and Reasoning with LLMs for Entity Alignment

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES