China

Recent work has investigated the capabilities of large language models (LLMs) as zero-shot models for generating individual-level characteristics (e.g., to serve as risk models or augment survey datasets). However, when should a user have confidence that an LLM will provide high-quality predictions for their particular task? To address this question, we conduct a large-scale empirical study of LLMs&#39; zero-shot predictive capabilities across a wide range of tabular prediction tasks. We find that LLMs&#39; performance is highly variable, both on tasks within the same dataset and across different datasets. However, when the LLM performs well on the base prediction task, its predicted probabilities become a stronger signal for individual-level accuracy. Then, we construct metrics to predict LLMs&#39; performance at the task level, aiming to distinguish between tasks where LLMs may perform well and where they are likely unsuitable. We find that some of these metrics, each of which are assessed without labeled data, yield strong signals of LLMs&#39; predictive performance on new tasks.

EMNLP 2025

Predicting Language Models’ Success at Zero-Shot Probabilistic Prediction

llms

uncertainty

calibration

Recent work has investigated the capabilities of large language models (LLMs) as zero-shot models for generating individual-level characteristics (e.g., to serve as risk models or augment survey datasets). However, when should a user have confidence that an LLM will provide high-quality predictions for their particular task? To address this question, we conduct a large-scale empirical study of LLMs' zero-shot predictive capabilities across a wide range of tabular prediction tasks. We find that LLMs' performance is highly variable, both on tasks within the same dataset and across different datasets. However, when the LLM performs well on the base prediction task, its predicted probabilities become a stronger signal for individual-level accuracy. Then, we construct metrics to predict LLMs' performance at the task level, aiming to distinguish between tasks where LLMs may perform well and where they are likely unsuitable. We find that some of these metrics, each of which are assessed without labeled data, yield strong signals of LLMs' predictive performance on new tasks.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

A dangerous assumption that can be made from prior work on the bias transfer hypothesis (BTH) is that biases do not transfer from pre-trained large language models (LLMs) to adapted models. We invalidate this assumption by studying the BTH in causal models under prompt adaptations, as prompting is an extremely popular and accessible adaptation strategy used in real-world applications. In contrast to prior work, we find that biases can transfer through prompting and that popular prompt-based mitigation methods do not consistently prevent biases from transferring. Specifically, the correlation between intrinsic biases and those after prompt adaptation remained moderate to strong across demographics and tasks: gender (rho >= 0.94) in co-reference resolution, and for age (rho >= 0.98), religion (rho >= 0.69), etc., in question answering. Further, we find that biases remain strongly correlated when varying few-shot composition parameters, such as sample size, stereotypical content, occupational distribution and representational balance (rho >= 0.90). We evaluate several prompt-based debiasing strategies and find that different approaches have distinct strengths, but none consistently reduce bias transfer across models, tasks or demographics. These results demonstrate that correcting bias, and potentially improving reasoning ability, in intrinsic models may be reliable ways to prevent propagation of biases to downstream tasks.

Bias after Prompting: Persistent Discrimination in Large Language Models

Large Language Models (LLMs) typically rely on a large number of parameters for token embedding, leading to substantial storage requirements and memory footprints. In particular, LLMs deployed on edge devices are memory-bound, and reducing the memory footprint by compressing the embedding layer not only frees up the memory bandwidth but also speeds up inference. To address this, we introduce CARVQ, a post-training novel Corrective Adaptor combined with group Residual Vector Quantization. CARVQ relies on the composition of both linear and non-linear maps and mimics the original model embedding to compress to approximately 1.6 bits without requiring specialized hardware to support lower-bit storage. We test our method on pre-trained LLMs such as LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3.2-3B-Instruct, LLaMA-3.1-8B, Qwen2.5-7B, Qwen2.5-Math-7B and Phi-4, evaluating on common generative, discriminative, math and reasoning tasks. We show that in most cases, CARVQ can achieve lower average bitwidth-per-parameter while maintaining reasonable perplexity and accuracy compared to scalar quantization. Our contributions include a novel compression technique that is compatible with state-of-the-art transformer quantization methods and can be seamlessly integrated into any hardware supporting 4-bit memory to reduce the model's memory footprint in memory-constrained devices. This work demonstrates a crucial step toward the efficient deployment of LLMs on edge devices.

CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression

Prior work has shown that a significant driver of performance in reasoning models is their ability to reason and self-correct. A distinctive marker in these reasoning traces is the token wait, which often signals reasoning behavior such as backtracking. Despite being such a complex behavior, little is understood of exactly why models do or do not decide to reason in this particular manner, which limits our understanding of what makes a reasoning model so effective. In this work, we address the question whether model's latents preceding wait tokens contain relevant information for modulating the subsequent reasoning process. To this end we train crosscoders at multiple layers layers of DeepSeek-R1-Distill-Llama-8B and its base version, and, introduce a novel latent attribution patching technique for the crosscoder setting. Using our technique, we locate a small set of features relevant for promoting/surpressing wait tokens' probabilities. Finally, through a targeted series of experiments analyzing max-activating examples and causal interventions, we show that many of our identified features indeed are relevant for the reasoning process and give rise to different types of reasoning patterns such as restarting from the beginning, recalling prior knowledge, expressing uncertainty, and double-checking.

Internal states before wait modulate reasoning patterns

LLM-as-a-judge evaluation metrics have gained popularity as an inexpensive and performant substitute for human evaluation. However, we find that the meta-evaluation setting in which the reliability of these LLM evaluators is established is substantially different from their use in model development. To address this, we propose a new meta-evaluation methodology that more closely aligns with practice by examining evaluators' ability to distinguish test system pairs that are closer in capability. Our fine-grained approach shows that LLM evaluator correlations with human judgments falls from ~0.8 to ~0.3 when evaluated in realistic settings, showcasing a key limitation of current norms. Equipped with this better methodology, we next analyze the impact that the choice of the reference model makes to LLM-as-a-judge evaluator performance. Our meta-evaluation strategy demonstrates that single-reference evaluators only perform well at ranking test systems that fall within particular capability ranges, even if the standard meta-evaluation reports high overall correlation. Taken together, our analysis shows critical issues with current LLM (meta-)evaluation and recommend avenues for improvement.

The Progress Illusion: Revisiting meta-evaluation standards of LLM evaluators

The escalating volume of academic research, coupled with a shortage of qualified reviewers, necessitates innovative approaches to peer review. In this work, we propose: (1) ReviewEval, a comprehensive evaluation framework for AI‐generated reviews that measures alignment with human assessments, verifies factual accuracy, assesses analytical depth, identifies degree of constructiveness and adherence to reviewer guidelines; and (2) ReviewAgent, an LLM‐based review generation agent featuring a novel alignment mechanism to tailor feedback to target conferences and journals, along with a self‐refinement loop that iteratively optimizes its intermediate outputs and an external improvement loop using ReviewEval to improve upon the final reviews. ReviewAgent improves actionable insights by 6.78% and 47.62% over existing AI baselines and expert reviews respectively. Further, it boosts analytical depth by 3.97% and 12.73%, enhances adherence to guidelines by 10.11% and 47.26% respectively. This paper establishes essential metrics for AI‐based peer review and substantially enhances the reliability and impact of AI‐generated reviews in academic research.

ReviewEval: An Evaluation Framework for AI-Generated Reviews

Retrieval-Augmented Generation (RAG) enhances language models (LMs) by retrieving and incorporating relevant information to address the user’s request. However, existing embedding-based semantic relevance measurements are ineffective and neural retrievers require expensive fine-tuning. To address the limitations, we propose You Only Use Reactive Attention slice (YOURA), an attention-based, training-free, fine-tuning-free technique to quantify the semantic relevance of two sentences. YOURA leverages a novel retrieval heuristics called reaction score, which measures how the LM's self-attention holistically "reacts" to the appended query and greedily retrieves the most reactive sentences. In addition, we propose a sentence extraction algorithm to facilitate the context preprocessing by splitting the context token sequence and mapping the sequences to sentences efficiently. Evaluation on three open-source pre-trained LMs across six tasks - single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic, and needle-in-a-haystack task - demonstrates that our framework improves the QA task accuracy by up to 15% and inference throughput by up to 30%.

You Only Use Reactive Attention Slice When Retrieving From Long Context

Small Language Models (SLMs) are becoming increasingly popular in specialized fields, such as industrial applications, due to their efficiency, lower computational requirements, and ability to be fine-tuned for domain-specific tasks, enabling accurate and cost-effective solutions. However, performing complex reasoning using SLMs in specialized fields such as Industry 4.0 remains challenging. In this paper, we propose a knowledge distillation framework for industrial asset health, which transfers reasoning capabilities via Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) to smaller, more efficient models (SLMs). We discuss the advantages and the process of distilling LLMs using multi-choice question answering (MCQA) prompts to enhance reasoning and refine decision-making. We also perform in-context learning to verify the quality of the generated knowledge and benchmark the performance of fine-tuned SLMs with generated knowledge against widely used LLMs. The results show that the fine-tuned SLMs outperform the base models by a significant margin, narrowing the gap to their LLM counterparts.

Fine-Tuned Thoughts: Leveraging Chain-of-Thought Reasoning for Industrial Asset Health Monitoring

Compositional generalization benchmarks seek to assess whether learning agents can successfully combine familiar concepts in novel ways. COGS (Kim & Linzen 2020, COGS, EMNLP) provides a suite of such tasks in the area of interpretive semantics (mapping sentences to logical forms). A noteworthy finding for COGS is that model performance varies widely across tasks. In this paper, we argue that these performance differences reflect deep properties of these tasks. We focus on two COGS tasks: an easy task (models are generally successful) and a hard task (no present-day models get any traction). Using both experiments and conceptual analysis, we argue that the easy task requires only a single distributional generalization that is well-supported by the training data, whereas the hard task involves a learning target that is ambiguous or even contradicted by the training data. We additionally argue that pretraining can disambiguate the hard task without compromising the goal of testing compositional generalization. Overall, our findings offer practical guidance to designers of compositional generalization benchmarks and also yield new insights into the nature of compositionality itself.

Distinguishing fair from unfair compositional generalization tasks

Autoregressive models excel in sequential modeling and have proven to be effective for vision- language data. However, the spatial nature of visual signals conflicts with the sequential dependencies of next-token prediction, leading to suboptimal results. This work proposes a plug-and-play refinement module to enhance the complex spatial correspondence modeling within the generated visual sequence. This module operates as a post-pretraining step to jointly refine all generated tokens of autoregressive model, enhancing vision-language model- ing under a shared sequential prediction frame- work. By leveraging global context and relationship across the tokens, our method mitigates the error accumulation issue within the sequential generation. Experiments demonstrate that the proposed method improves the generation quality, enhancing the model’s ability to produce semantically consistent results.

Visual Self-Refinement for Autoregressive Models

Planning in modern LLM agents relies on the utilization of LLM as an internal world model, acquired during pretraining. However, existing agent designs lack a capacity to effectively assimilate new observations into dynamic updates of the world model. This inability means that reliance on the LLM’s static internal world model is progressively prone to misalignment with the underlying true state of the world, leading to the generation of divergent and erroneous plans. We introduce a hierarchical agent architecture, CoEx, in which hierarchical state abstraction allows LLM planning to co-evolve with a dynamically updated model of the world. CoEx plans and interacts with the world by using LLM reasoning to orchestrate dynamic plans consisting of subgoals, and its learning mechanism continuously incorporates these subgoal experiences into a persistent world model in the form of a neurosymbolic belief state, comprising textual inferences and code-based symbolic memory. We evaluate our agent across a diverse set of agent scenarios involving rich environments and complex tasks including ALFWorld, PDDL, and Jericho. Our experiments show that CoEx outperforms existing agent paradigms in planning and exploration.

Downloads

Next from EMNLP 2025

Bias after Prompting: Persistent Discrimination in Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES