China

In this work, we investigate how multimodal large language models (MLLMs) reconcile memorized world knowledge and visual input. Understanding this balance is essential for building reliable models that can correctly choose between conflicting sources of information. To study this, we introduce Visual CounterFact, a dataset that constructs realistic visual counterfactuals targeting familiar attributes like object color and size. These examples violate learned priors while preserving visual plausibility, enabling precise comparisons between perception and memory. Using this dataset, we find that MLLMs often default to perception, even when prompted to retrieve general knowledge. In these cases, performance on knowledge-based prompts drops significantly, suggesting that models are overly influenced by visual inputs, even when the question targets memorized facts. Through studying the forward-pass, we observe that model predictions initially reflect stored priors, then transition to visually grounded answers in mid-to-late layers. This transition is often unstable, with models flipping between the two sources of information across layers. To control this behavior, we introduce Pixels Versus Priors steering vectors, which allow us to edit model behavior toward preferring either world knowledge priors or visual input. These activation-level interventions produce significant attention shifts towards or away from the image, depending on our steering vector direction. Our findings offer a new framework for interpreting and controlling MLLMs, advancing our ability to understand and control the interaction between memory and perception in multimodal models.

EMNLP 2025

Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

mechanistic interpretability

vision-language models

steering vectors

multimodal models

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Robust evaluation is critical for deploying trustworthy retrieval-augmented generation (RAG) systems. However, current LLM-based evaluation frameworks predominantly rely on directly prompting resource-intensive models with complex multi-stage prompts, underutilizing models' reasoning capabilities and introducing significant computational cost. In this paper, we present RAG-Zeval (RAG-Zero Evaluator), a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task. Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments with detailed explanation in one-pass. We introduce a ranking-based outcome reward mechanism, using preference judgments rather than absolute scores, to address the challenge of obtaining precise pointwise reward signals. To this end, we synthesize the ranking references by generating quality-controlled responses with zero human annotation. Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments and outperforming baselines that rely on LLMs with 10-100times more parameters. Our approach also exhibits superior interpretability in response evaluation.

RAG-Zeval: Enhancing RAG Responses Evaluator through End-to-End Reasoning and Ranking-Based Reinforcement Learning​

Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora – counting string appearances and retrieving the enclosing documents – yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18times) and memory use during both indexing (3.2times reduction) and querying (down to a negligible amount). We index 46TB of Internet text in 50 days with a single 128-core CPU node (or 19 hours if using 75 such nodes). We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 40% in SQuAD), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on Infini-gram mini indexes.

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

High lexical variation, ambiguous references, and long-range dependencies make entity resolution in literary texts particularly challenging. We present Mahānāma, the first large-scale dataset for end-to-end Entity Discovery and Linking (EDL) in Sanskrit, a morphologically rich and under-resourced language. Derived from the Mahābhārata , the world’s longest epic, the dataset comprises over 109K named entity mentions mapped to 5.5K unique entities, and is aligned with an English knowledge base to support cross-lingual linking. The complex narrative structure of Mahānāma, coupled with extensive name variation and ambiguity, poses significant challenges to resolution systems. Our evaluation reveals that current coreference and entity linking models struggle when evaluated on the global context of the test set. These results highlight the limitations of current approaches in resolving entities within such complex discourse. Mahānāma thus provides a unique benchmark for advancing entity resolution, especially in literary domains.

Mahānāma: A Unique Testbed for Literary Entity Discovery and Linking

Language model evaluations often fail to characterize consequential failure modes, forcing experts to inspect outputs and build new benchmarks. We introduce task elicitation, a method that automatically builds new evaluations to profile model behavior. Task elicitation finds hundreds of natural-language tasks---an order of magnitude more than prior work---where frontier models exhibit systematic failures, in domains ranging from forecasting to online harassment. For example, we find that Sonnet 3.5 over-associates quantum computing and AGI and that o3-mini is prone to hallucination when fabrications are repeated in-context.

Adaptively profiling models with task elicitation

While recent studies explore Large Language Models' (LLMs) performance on Theory of Mind (ToM) reasoning tasks, research on ToM abilities require more nuanced social context is limited, such as white lies. We introduce TactfulToM, a novel English benchmark designed to evaluate LLMs' ability to understand white lies within real-life conversations and reason about prosocial motivations behind them, particularly used to spare others' feelings and maintain social harmony. Our benchmark is generated through a multi-stage human-in-the-loop pipeline where LLMs expand manually designed seed stories into conversations to maintain the information asymmetry between participants necessary for authentic white lies. We show that Tactful ToM is challenging for state-of-the-art models, which perform below humans, revealing shortcomings in their ability to fully comprehend the ToM reasoning enabling understanding of white lies.

TactfulToM: Do LLMs have the Theory of Mind ability to understand White Lies?

Large Language models have demonstrated promising performance in research ideation across scientific domains. Hypothesis development, the process of generating a highly specific declarative statement connecting a research idea with empirical validation, has received relatively less attention. Existing approaches trivially deploy retrieval augmentation and focus only on the quality of the final output ignoring the underlying reasoning process behind ideation. We present textttHypER (textbfHypothesis Generation with textbfExplanation and textbfReasoning), a small language model (SLM) trained for literature-guided reasoning and evidence-based hypothesis generation. textttHypER is trained in a multi-task setting to discriminate between valid and invalid scientific reasoning chains in presence of controlled distractions. We find that textttHypER outperformes the base model, distinguishing valid from invalid reasoning chains (+22\% average absolute F1), generates better evidence-grounded hypotheses (0.327 vs. 0.305 base model) with high feasibility and impact as judged by human experts (>3.5 on 5-point Likert scale). We will release our dataset of temporal scientific reasoning chains, along with the code and models.

HypER: Literature-grounded Hypothesis Generation and Distillation with Provenance

In tasks like question answering and fact-checking, models must discern relevant information from extensive corpora in an "open-book" setting. Conventional transformer-based models excel at classifying input data, but (i) often falter due to sensitivity to noise and (ii) lack explainability regarding their decision process. To address these challenges, we introduce ATTUN, a novel transformer architecture designed to enhance model transparency and resilience to noise by refining the attention mechanisms. Our approach involves a dedicated module that directly modifies attention weights, allowing the model to both improve predictions and identify the most relevant sections of input data. We validate our methodology using fact-checking datasets and show promising results in question answering. Experiments show up to a 51% improvement in F1 score over state-of-the-art systems for detecting relevant context, and up to an 18% gain in task accuracy when integrating ATTUN into a model.

Refining Attention for Explainable and Noise-Robust Fact-Checking with Transformers

Mechanistic interpretation has greatly contributed to a more detailed understanding of generative language models, enabling significant progress in identifying structures that implement key behaviors through interactions between internal components. In contrast, interpretability in information retrieval (IR) remains relatively coarse-grained, and much is still unknown as to how IR models determine whether a document is relevant to a query. In this work, we address this gap by mechanistically analyzing how one commonly used model, a cross-encoder, estimates relevance. We find that the model extracts traditional relevance signals, such as term frequency and inverse document frequency, in early-to-middle layers. These concepts are then combined in later layers, similar to the well-known probabilistic ranking function, BM25. Overall, our analysis offers a more nuanced understanding of how IR models compute relevance. Isolating these components lays the groundwork for future interventions that could enhance transparency, mitigate safety risks, and improve scalability.

Pathway to Relevance: How Cross-Encoders Implement a Semantic Variant of BM25

Although Large Language Models (LLMs) have demonstrated significant potential in medical diagnostics and clinical decision-making, existing biomedical NLP benchmarks primarily focus on qualitative reasoning tasks, lacking rigorous evaluation of quantitative computation capabilities extensively used in clinical settings, particularly for Chinese language scenarios. To address this gap, we introduce CMedCalc-Bench, the first fine-grained benchmark specifically designed for Chinese medical calculation tasks. CMedCalc-Bench consists of 69 typical calculation tasks spanning multiple clinical domains such as cardiology, endocrinology, nephrology, and emergency medicine, featuring over 1,000 real-world Chinese clinical cases. We develop an innovative multi-stage evaluation framework that separately evaluates clinical entity extraction and numerical computation processes, enabling detailed diagnosis of model deficiencies at different stages. Experimental results show that existing mainstream models significantly underperform on Chinese medical computation tasks, highlighting critical issues like inaccurate entity recognition and imprecise numerical calculations.

CMedCalc-Bench: A Fine-Grained Benchmark for Chinese Medical Calculations in LLM

Large Audio-Language Models (LALMs) are increasingly deployed in real-world applications, yet their robustness against malicious audio injection attacks remains underexplored. This study systematically evaluates five leading LALMs across four attack scenarios: Audio Interference Attack, Instruction Following Attack, Context Injection Attack, and Judgment Hijacking Attack. Using metrics like Defense Success Rate, Context Robustness Score, and Judgment Robustness Index, their vulnerabilities and resilience were quantitatively assessed. Experimental results reveal significant performance disparities among models; no single model consistently outperforms others across all attack types. The position of malicious content critically influences attack effectiveness, particularly when placed at the beginning of sequences. A negative correlation between instruction-following capability and robustness suggests models adhering strictly to instructions may be more susceptible, contrasting with greater resistance by safety-aligned models. Additionally, system prompts show mixed effectiveness, indicating the need for tailored strategies. This work introduces a benchmark framework and highlights the importance of integrating robustness into training pipelines. Findings emphasize developing multi-modal defenses and architectural designs that decouple capability from susceptibility for secure LALMs deployment.

Downloads

Next from EMNLP 2025

RAG-Zeval: Enhancing RAG Responses Evaluator through End-to-End Reasoning and Ranking-Based Reinforcement Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES