China

A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts.
Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words&#39; and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. We explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models---Diffusion Classifier, CLIP, and ViLT---on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at [link redacted for anonymity].

EMNLP 2025

Evaluating Compositional Generalisation in VLMs and Diffusion Models

compositional generalisation

diffusion models

vision-language models

A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts.
Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words' and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. We explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models---Diffusion Classifier, CLIP, and ViLT---on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at [link redacted for anonymity].

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

This paper presents a systematic evaluation of nearest neighbors in a range of semantic spaces across textual and visual modalities.
Focusing on the abstractness-concreteness continuum, we define an overlap measure to compare concepts differing in their linguistic vs. perceptual nature, and indeed find that alignment is primarily determined by modality and concreteness: Models from the same modality show stronger alignment than cross-modal models, and spaces of concrete concepts show stronger alignment than those of abstract ones.

Evaluating Textual and Visual Semantic Neighborhoods of Abstract and Concrete Concepts

In this work, we investigate the relationship between the quality of explanations produced by different models and the amount of implicit knowledge the are able to provide beyond the input. We approximate explanation quality via accuracy on a downstream task with a standardized pipeline (GEISER) and study its correlation with three different association measures, each capturing different aspects of implicitness, defined as a combination of relevance and novelty. We conduct experiments with three SOTA LLMs on four tasks involving implicit knowledge, with explanations either confirming or contradicting the correct label. Our results demonstrate that providing quality explanations consistently improves the accuracy of LLM predictions, even when the models are not explicitly trained to take explanations as input, and underline the correlation between implicit content delivered by the explanation and its effectiveness.

Explanations explained. Influence of Free-text Explanations on LLMs and the Role of Implicit Knowledge

Frames capture aspects of an issue that are emphasized in a debate by interlocutors and can help us understand how political language conveys different perspectives and ultimately shapes people's opinions. The Media Frame Corpus (MFC) is the most commonly used framework with categories and detailed guidelines for operationalizing frames. It is, however, focused on a few salient U.S. news issues, making it unclear how well these frames can capture news issues in other cultural contexts. To explore this, we introduce $\texttt{FrameNews-PT}$, a dataset of Brazilian Portuguese news articles covering political and economic news and annotate it within the MFC framework.
Through several annotation rounds, we evaluate the extent to which MFC frames generalize to the Brazilian debate issues. We further evaluate how fine-tuned and zero-shot models perform on out-of-domain data.
Results show that the 15 MFC frames remain broadly applicable with minor revisions of the guidelines. However, some MFC frames are rarely used, and novel news issues are analyzed using general 'fallback' frames. We conclude that cross-cultural frame use requires careful consideration.

Generalizability of Media Frames: Corpus creation and analysis across countries

This study addresses the problem of hallucinated span detection in the outputs of large language models. It has received less attention than output-level hallucination detection despite its practical importance. Prior work has shown that attentions often exhibit irregular patterns when hallucinations occur. Motivated by these findings, we extract features from the attention matrix that provide complementary views capturing (a) whether certain tokens are influential or ignored, (b) whether attention is biased toward specific subsets, and (c) whether a token is generated referring to a narrow or broad context, in the generation. These features are input to a Transformer-based classifier to conduct sequential labelling to identify hallucinated spans. Experimental results indicate that the proposed method outperforms strong baselines on hallucinated span detection with longer input contexts, such as data-to-text and summarisation tasks.

Hallucinated Span Detection with Multi-View Attention Features

In this work, we explore the prediction of lexical complexity by combining supervised approaches and the use of large language models (LLMs). We first evaluate the impact of different prompting strategies (zero-shot, one-shot, and chain-of-thought) on the quality of the predictions, comparing the results with human annotations from the CompLex 2.0 corpus. Our results indicate that LLMs, and in particular gpt-4o, benefit from explicit instructions to better approximate human judgments, although some discrepancies remain. Moreover, a calibration approach to better align LLMs predictions and human judgements based on few manually annotated data appears as a promising solution to improve the reliability of the annotations in a supervised scenario.

How Do Large Language Models Evaluate Lexical Complexity?

Semantic parsing of long documents remains challenging due to quadratic growth in pairwise composition and memory requirements. We introduce \textbf{Hierarchical Segment-Graph Memory (HSGM)}, a novel framework that decomposes an input of length $N$ into $M$ meaningful segments, constructs \emph{Local Semantic Graphs} on each segment, and extracts compact \emph{summary nodes} to form a \emph{Global Graph Memory}. HSGM supports \emph{incremental updates}—only newly arrived segments incur local graph construction and summary‐node integration—while \emph{Hierarchical Query Processing} locates relevant segments via top-$K$ retrieval over summary nodes and then performs fine-grained reasoning within their local graphs.

Theoretically, HSGM reduces worst-case complexity from $O(N^2)$ to $O\bigl(N\,k + (N/k)^2\bigr)$,
with segment size $k \ll N$, and we derive Frobenius‐norm bounds on the approximation error introduced by node summarization and sparsification thresholds. Empirically, on three benchmarks—long-document AMR parsing, segment-level semantic role labeling (OntoNotes), and legal event extraction—HSGM achieves \emph{2–4× inference speedup}, \emph{$>$60\% reduction} in peak memory, and \emph{$\ge95\%$} of baseline accuracy. Our approach unlocks scalable, accurate semantic modeling for ultra-long texts, enabling real-time and resource-constrained NLP applications.

HSGM: Hierarchical Segment-Graph Memory for Scalable Long-Text Semantics

Prior work has shown that presupposition in generated questions can introduce unverified assumptions, leading to inconsistencies in claim verification. Additionally, prompt sensitivity remains a significant challenge for large language models (LLMs), resulting in performance variance as high as **3–6%**. While recent advancements have reduced this gap, our study demonstrates that prompt sensitivity remains a persistent issue. To address this, we propose a structured and robust claim verification framework that reasons through presupposition-free, decomposed questions. Extensive experiments across multiple prompts, datasets, and LLMs reveal that even state-of-the-art models remain susceptible to prompt variance and presupposition. Our method consistently mitigates these issues, achieving up to a **2–5%** improvement.

If We May De-Presuppose: Robustly Verifying Claims through Presupposition-Free Question Decomposition

Large Language Models (LLMs) have demonstrated remarkable generalization across diverse NLP tasks, yet they often produce outputs lacking semantic coherence due to insufficient grounding in structured linguistic knowledge. This paper proposes a novel method for injecting Frame Semantics into a pretrained LLaMA model using Low-Rank Adaptation (LoRA). Leveraging FrameNet (a rich resource of over 1,000 semantic frames) we construct a training corpus comprising structured triples of frame definitions, frame elements, and lexical units. Our method encodes these examples into the model via LoRA adapters and evaluates performance using zero-shot prompting for textual entailment and semantic role labeling (SRL) over Framenet. Experimental results show that our adapted frame-aware LLM substantially outperforms the baseline across closed, open-ended, and multiple-choice prompts. Moreover, we observe significant improvements in SRL accuracy, demonstrating the efficacy of combining frame-semantic theory with parameter-efficient pretraining.

Injecting Frame Semantics into Large Language Models via Prompt-Based Fine-Tuning

As language models continue to scale, the demand for knowledge editing, a retraining-free knowledge update method, has increased. However, since knowledge editing directly alters token prediction probabilities acquired during pretraining, the probabilities may diverge from the empirical distribution. In this study, we analyze the impact of knowledge editing to compare the alignment between token prediction probabilities and task accuracy by calculating confidence calibration before and after knowledge editing. Our results reveal that, for tasks requiring semantic understanding, the range of increase in token prediction probabilities tends to be smaller than that of accuracy improvement, suggesting that knowledge editing methods lead to less confidence in prediction.

Knowledge Editing Induces Underconfidence in Language Models

Large language models are increasingly deployed across diverse applications. This often includes tasks LLMs have not encountered during training.
This implies that enumerating and obtaining the high-quality training data for all tasks is infeasible. Thus, we often need to rely on transfer learning using datasets with different characteristics, and anticipate out-of-distribution requests.
Motivated by this practical need, we propose an analysis framework, building a transfer learning matrix and dimensionality reduction, to dissect these cross-task interactions.
We train and analyze 10 models to identify latent abilities (e.g., Reasoning, Sentiment Classification, NLU, Arithmetic)
and discover the side effects of the transfer learning.
Our findings reveal that performance improvements often defy explanations based on surface-level dataset similarity or source data quality. Instead, hidden statistical factors of the source dataset, such as class distribution and generation length proclivities, alongside specific linguistic features, are actually more influential.
This work offers insights into the complex dynamics of transfer learning, paving the way for more predictable and effective LLM adaptation.

Downloads

Next from EMNLP 2025

Evaluating Textual and Visual Semantic Neighborhoods of Abstract and Concrete Concepts

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES