China

Recent advancements in large language models (LLMs) have shown promise in feature engineering for tabular data, but concerns about their reliability persist, especially due to variability in generated outputs. We introduce a multi-level diagnosis and evaluation framework to assess the robustness of LLMs in feature engineering, focusing on identifying key variables, relationships, and decision boundaries for predicting target classes across diverse domains. We demonstrate that the robustness of LLMs varies significantly over different datasets, and that high-quality LLM-generated features can improve prediction performance by up to 10.52%. This work opens a new direction for assessing and enhancing the reliability of LLM-driven feature engineering.

EMNLP 2025

Multi-level Diagnosis and Evaluation for Robust Tabular Feature Engineering with Large Language Models

evaluation methodologies

structured prediction

robustness

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

World models have been widely utilized in robotics, gaming, and auto-driving. However, their applications on natural language tasks are relatively limited. In this paper, we construct the dialogue world model, which could predict the user's emotion, sentiment, and intention, and future utterances. By defining a POMDP, we argue emotion, sentiment and intention can be modeled as the user belief and solved by maximizing the information bottleneck. By this user belief modeling, we apply the model-based reinforcement learning framework to the dialogue system, and propose a framework called DreamCUB. Experiments show that the pretrained dialogue world model can achieve state-of-the-art performances on emotion classification and sentiment identification, while dialogue quality is also enhanced by joint training of the policy, critic and dialogue world model. Further analysis shows that this manner holds a reasonable exploration-exploitation balance and also transfers well to out-of-domain scenarios such as empathetic dialogues.

Dream to Chat: Model-based Reinforcement Learning on Dialogues with User Belief Modeling

Large Language Models (LLMs) exhibit strong reasoning capabilities and are widely applied in event forecasting. However, studies have demonstrated that LLMs exhibit human-like cognitive biases, systematic patterns of deviation from rationality in decision-making. To explore the cognitive biases in event forecasting, we introduce CogForecast, a human-curated dataset comprising six topics. Experimental results on three LLMs reveal significant cognitive biases in LLM-based event forecasting methods. To address this issue, we propose MCA, a Multi-Cognition Agentic framework. Specifically, MCA leverages LLMs to act as multi-cognition event participants, performing perspective-taking based on the cognitive patterns of event participants to alleviate the inherent cognitive biases in LLMs and offer diverse analytical perspectives. Then, MCA clusters agents according to their predictions and derives a final answer through a group-level reliability scoring method. Experimental results on a dataset including eight event categories demonstrate the effectiveness of MCA. Using Llama-3.1-70B, MCA achieves an accuracy of 82.3% (79.5% for the human crowd). Additionally, we demonstrate that MCA can alleviate the cognitive biases in LLMs and investigate three influencing factors. Our code and dataset will be publicly released.

Beyond Inherent Cognition Biases in LLM-Based Event Forecasting: A Multi-Cognition Agentic Framework

Peer review is essential for maintaining academic quality, but the increasing volume of submissions places a significant burden on reviewers. Large language models (LLMs) offer potential assistance in this process, yet their susceptibility to textual adversarial attacks raises reliability concerns. This paper investigates the robustness of LLMs used as automated reviewers in the presence of such attacks. We focus on three key questions: (1) The effectiveness of LLMs in generating reviews compared to human reviewers. (2) The impact of adversarial attacks on the reliability of LLM-generated reviews. (3) Challenges and potential mitigation strategies for LLM-based review. Our evaluation reveals significant vulnerabilities, as text manipulations can distort LLM assessments. We offer a comprehensive evaluation of LLM performance in automated peer reviewing and analyze its robustness against adversarial attacks. Our findings emphasize the importance of addressing adversarial risks to ensure AI strengthens, rather than compromises, the integrity of scholarly communication.

Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks

Texts generated by large language models (LLMs) are increasingly widespread online. Due to the lack of effective attribution mechanisms, the enforcement of copyright and the prevention of misuse remain significant challenges in the context of LLM-generated content. LLMs watermark emerges as a crucial technology to trace the source of AI-generated content. However, most existing watermarking methods reduce the fidelity of semantics. To address this issue, this paper introduces a novel watermarking framework. To enhance the fidelity of semantics, we propose low-entropy POS-guided token partitioning mechanism and z-score-driven dynamic bias mechanism. Moreover, to enhance the robustness against potential bias sparsity exploitation attack, we propose a relative position encoding (RPE) mechanism, which can uniformly distribute bias in the generated text. Evaluated across 6 baselines, 4 tasks, and 5 LLMs under 8 attacks, compared to the KGW, our watermark improves semantic fidelity by 24.53% (RC-PPL) and robustness by 3.75% (F1). Our code is publicly available, facilitating reproducibility in LLM watermarking research.

Watermarking with Low-Entropy POS-Guided Token Partitioning and Z-Score-Driven Dynamic Bias for Large Language Models

Watermarking is a key technique for detecting AI-generated text. In this work, we study its vulnerabilities and introduce the Smoothing Attack, a novel watermark removal method. By leveraging the relationship between the model's confidence and watermark detectability, our attack selectively smoothes the watermarked content, erasing watermark traces while preserving text quality. We validate our attack on open-source models ranging from 1.3 B to 30B parameters on 10 different watermarks, demonstrating its effectiveness. Our findings expose critical weaknesses in existing watermarking schemes and highlight the need for stronger defenses.

Watermark Smoothing Attacks against Language Models

Few-shot multi-intent spoken language understanding (SLU) aims to identify users' multiple intents and key slots using a tiny amount of annotated data. Recent advances in large language models (LLMs) have utilized instruction learning frameworks to model intent-slot interdependencies, typically requiring abundant data for effective training. However, in few-shot scenarios, these frameworks face challenges such as mismatches between the number of generated slots and input lengths, relational confusion in multi-intent scenarios and neglect of task-specific variations in intent counts across utterances. To overcome the challenges, we propose PICD-Instruct, a novel generative framework based on Basic Instructions (BI), Pairwise Interaction Instructions (PII) and Contrastive Distinct Instructions (CDI). Specifically, BI directs LLMs to generate entities along with associated words, thereby mitigating mismatches in quantitative correspondences. PII explicitly captures dual-task interdependencies by guiding LLMs to pair each intent with its related entities. CDI enhances understanding of utterances by guiding LLMs to determine whether two utterances share the same intent count. Experimental results on public datasets indicate that PICD-Instruct achieves state-of-the-art performance.

PICD-Instruct: A Generative Instruction Learning Framework for Few-Shot Multi-Intent Spoken Language Understanding

The increasing prevalence of scam calls, particularly on online platforms for recruitment, ride-hailing, and delivery services, has become a significant social and economic issue. Traditional approaches to scam call detection rely on labeled data and assume a static distribution of scam narratives. However, scammers continuously evolve their tactics, making these methods less effective. In this paper, we propose a novel approach leveraging large language models (LLMs) to detect continuously evolving scam calls. By abstracting scam and normal call rules based on expert knowledge, we develop a hierarchical few-shot prompting framework. This framework consists of a discrimination module to identify scam characteristics, a reflection module to reduce false positives by comparing with normal call features, and a summary step to synthesize the final detection results. Our method is evaluated on real-world and synthesized datasets, demonstrating superior performance in detecting evolving scam calls with minimal labeled data. Furthermore, we show that the framework is highly adaptable to new scam detection scenarios, requiring only modifications to the expert rules.

Detecting Continuously Evolving Scam Calls under Limited Annotation: A LLM-Augmented Expert Rule Framework

This study investigates the position bias in information retrieval, where models tend to overemphasize content at the beginning of passages while neglecting semantically relevant information that appears later. To analyze the extent and impact of position bias, we introduce a new evaluation framework consisting of two position-aware retrieval benchmarks (SQuAD-PosQ, FineWeb-PosQ) and an intuitive diagnostic metric, the Position Sensitivity Index (PSI), for quantifying position bias from a worst-case perspective. We conduct a comprehensive evaluation across the full retrieval pipeline, including BM25, dense embedding models, ColBERT-style late-interaction models, and full-interaction reranker models. Our experiments show that when relevant information appears later in the passage, dense embedding models and ColBERT-style models suffer significant performance degradation (an average drop of 15.6%). In contrast, BM25 and reranker models demonstrate greater robustness to such positional variation. These findings provide practical insights into model sensitivity to the position of relevant information and offer guidance for building more position-robust retrieval systems. Code and data are publicly available at: https://github.com/NovaSearch-Team/position-bias-in-IR.

An Empirical Study of Position Bias in Modern Information Retrieval

Conversational Recommender Systems (CRSs) aim to engage users in dialogue to provide tailored recommendations. While traditional CRSs focus on eliciting preferences and retrieving items, real-world e-commerce interactions involve more complex decision-making, where users consider multiple factors beyond simple attributes. To capture this complexity, we introduce Conversational Sales (CSALES), a novel task that integrates preference elicitation, recommendation, and persuasion within a unified conversational framework. To support realistic and systematic evaluation, we present CSUSER, an evaluation protocol with LLM-based user simulator grounded in real-world behavioral data by modeling fine-grained user profiles for personalized interaction. We also propose CSI, a conversational sales agent that proactively infers contextual user profiles and strategically selects actions through conversation. Comprehensive experiments show that CSI significantly improves both recommendation success and persuasive effectiveness across diverse user profiles.

Towards Personalized Conversational Sales Agents: Contextual User Profiling for Strategic Action

With the widespread application of large language models (LLMs) across various domains, techniques for enhancing their security have progressed rapidly. In this paper, we reveal that although existing defense methods can improve the robustness of LLMs against jailbreaks, they compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of LLM mechanism interpretability, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against both single-turn and multi-turn jailbreak attacks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training.

Downloads

Next from EMNLP 2025

Dream to Chat: Model-based Reinforcement Learning on Dialogues with User Belief Modeling

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES