China

Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in activation space and require per-attribute tuning. We introduce K-Steering, a unified and flexible approach that trains a single non-linear multi-label classifier on hidden activations and computes intervention directions via gradients at inference time. This avoids linearity assumptions, removes the need for storing and tuning separate attribute vectors, and allows dynamic composition of behaviors without retraining. To evaluate our method, we propose two new benchmarks, TONEBANK and DEBATEMIX, targeting compositional behavioral control. Empirical results across 3 model families, validated by both activation-based classifiers and LLM-based judges, demonstrate that K-Steering outperforms strong baselines in accurately steering multiple behaviors.

EMNLP 2025

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

steering

safety

language model

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Emotional reasoning is essential for improving human-AI interactions, particularly in mental health support and empathetic systems. However, current approaches, which primarily map sensory inputs to fixed emotion labels, fail to understand the intricate relationships between motivations, thoughts, and emotions, thereby limiting their ability to generalize across flexible emotional reasoning tasks. To address this, we propose a novel third-person appraisal agent that simulates human-like emotional reasoning through three phases: Primary Appraisal, Secondary Appraisal, and Reappraisal. In the Primary Appraisal phase, a third-person generator powered by a large language model (LLM) infers emotions based on cognitive appraisal theory. The Secondary Appraisal phase uses an evaluator LLM to provide feedback, guiding the generator in refining its predictions. The generator then uses counterfactual reasoning to adjust its process and explore alternative emotional responses. The Reappraisal phase utilizes reinforced fine-tuning (ReFT) by employing a reflective actor-critic framework to further enhance the model’s performance and generalization. This process uses reward signals and learns from appraisal trajectories without human annotations. Our approach outperforms baseline LLMs in various emotional reasoning tasks, demonstrating superior generalization and interpretability. To the best of our knowledge, this is the first cognition-based architecture designed to enhance emotional reasoning in LLMs, advancing AI towards human-like emotional understanding.

Third-Person Appraisal Agent: Simulating Human Emotional Reasoning in Text with Large Language Models

With the growing adoption of retrieval-augmented generation (RAG) systems, various attack methods have been proposed to degrade their performance. However, most existing approaches rely on unrealistic assumptions in which external attackers have access to internal components such as the retriever. To address this issue, we introduce a realistic black-box attack based on the RAG paradox, a structural vulnerability that emerges from the system’s effort to enhance trust by revealing both the retrieved documents and their sources to users. This transparency enables attackers to observe which sources are used and how information is phrased, allowing them to craft poisoned documents that are more likely to be retrieved and upload them to the identified sources. Moreover, as RAG systems directly provide retrieved content to users, these documents must not only be retrievable but also appear natural and credible to prevent users from questioning the search results. Unlike prior work that focuses solely on improving document retrievability, our attack method explicitly considers both retrievability and user trust in the retrieved content. Through extensive offline and online experiments, we demonstrate that our method significantly degrades system performance without internal access, while generating natural-looking poisoned documents.

The RAG Paradox: A Black-Box Attack Exploiting Unintentional Vulnerabilities in Retrieval-Augmented Generation Systems

Model NLP models are commonly trained (or fine-tuned) on datasets from untrusted platforms like HuggingFace, posing significant risks of data poisoning attacks. A practical yet underexplored challenge arises when such backdoors are discovered after model deployment, making retraining-required defenses less desirable due to computational costs and data constraints. In this work, we propose Guided Module Substitution (GMS), an effective retraining-free method based on guided merging of the victim model with a single proxy model. Specifically, GMS selectively replaces modules in the victim model based on a trade-off signal between utility and backdoor. GMS offers four desirable properties: (1) robustness to the choice and trustworthiness of the proxy model, (2) applicability under relaxed data assumptions, (3) stability across hyperparameters, and (4) transferability across different attacks. Extensive experiments on encoder models and decoder LLMs demonstrate the strong effectiveness of GMS. GMS significantly outperforms even the strongest defense baseline, particularly against challenging attacks like LWS.

Cut the Deadwood Out: Backdoor Purification via Guided Module Substitution

Large language model (LLM) unlearning has demonstrated effectiveness in removing the influence of undesirable data (also known as forget data). Existing approaches typically assume full access to the forget dataset, overlooking two key challenges: (1) Forget data is often privacy-sensitive, rare, or legally regulated, making it expensive or impractical to obtain (2) The distribution of available forget data may not align with how that information is represented within the model. To address these limitations, we propose a ``Reveal-and-Release'' method to unlearn with self-generated data, where we prompt the model to reveal what it knows using optimized instructions. To fully utilize the self-generated forget data, we propose an iterative unlearning framework, where we make incremental adjustments to the model’s weight space with parameter-efficient modules trained on the forget data. Experimental results demonstrate that our method balances the tradeoff between forget quality and utility preservation.

Reveal and Release: Iterative LLM Unlearning with Self-generated Data

Generative recommendation has emerged as a promising paradigm that formulates the recommendations into a text-to-text generation task, harnessing the vast knowledge of large language models. However, existing studies focus on considering the sequential order of items and neglect to handle the temporal dynamics across items, which can imply evolving user preferences. To address this limitation, we propose a novel model, Generative Recommender Using Time awareness (GRUT), effectively capturing hidden user preferences via various temporal signals. We first introduce Time-aware Prompting, consisting of two key contexts. The user-level temporal context models personalized temporal patterns across timestamps and time intervals, while the item-level transition context provides transition patterns across users. We also devise Trend-aware Inference, a training-free method that enhances rankings by incorporating trend information about items with generation likelihood. Extensive experiments demonstrate that GRUT outperforms state-of-the-art models, with gains of up to 30.0% and 24.8% in Recall@5 and NDCG@5 across four benchmark datasets. The code will be available upon acceptance.

Enhancing Time Awareness in Generative Recommendation

Fine-tuning large language models (LLMs) with local data is a widely adopted approach for organizations seeking to adapt LLMs to their specific domains. Given the shared characteristics in data across different organizations, the idea of collaboratively fine-tuning an LLM using data from multiple sources presents an appealing opportunity. However, organizations are often reluctant to share local data, making centralized fine-tuning impractical. Federated learning (FL), a privacy-preserving framework, enables clients to retain local data while sharing only model parameters for collaborative training, offering a potential solution. While fine-tuning LLMs on centralized datasets risks data leakage through next-token prediction, the iterative aggregation process in FL results in a global model that encapsulates generalized knowledge, which some believe protects client privacy. In this paper, however, we present contradictory findings through extensive experiments. We show that attackers can still extract training data from the global model, even using straightforward generation methods, with leakage increasing as the model size grows. Moreover, we introduce an enhanced attack strategy tailored to FL, which tracks global model updates during training to intensify privacy leakage. To mitigate these risks, we evaluate privacy-preserving techniques in FL, including differential privacy, regularization-constrained updates and adopting LLMs with safety alignment. Our results provide valuable insights and practical guidelines for reducing privacy risks when training LLMs with FL.

Can Federated Learning Safeguard Private Data in LLM Training? Vulnerabilities, Attacks, and Defense Evaluation

Compound AI (CAI) systems, also referred to as LLM Agents, combine LLMs with retrievers and tools to enable information-seeking applications in the real-world. Thus, ensuring these systems perform reliably is critical. However, traditional evaluation using benchmark datasets and aggregate metrics often fails to capture their true operational performance. This is because understanding the operational efficacy of these information-seeking systems requires the ability to probe their behavior across a spectrum of simulated scenarios to identify potential failure modes. Thus, we present a behavior-driven evaluation framework that generates test specifications - explicit descriptions of expected system behaviors in specific scenarios - aligned with real usage contexts. These test specifications serve as formal declarations of system requirements that are then automatically transformed into concrete test cases. Specifically, our framework operates in two phases: (1) generating diverse test specifications via submodular optimization over semantic diversity and document coverage of the tests, and (2) implementing these specifications through graph-based pipelines supporting both tabular and textual sources. Evaluations on QuAC & HybriDialogue datasets, across SoTA LLMs, reveal that our framework identifies failure modes missed by traditional metrics, demonstrating failure rates twice as high as human-curated datasets.

Evaluating Compound AI Systems through Behaviors, Not Benchmarks

Fact-checking real-world claims, particularly numerical claims, is inherently complex that require multistep reasoning and numerical reasoning for verifying diverse aspects of the claim. Although large language models (LLMs) including reasoning models have made tremendous advances, they still fall short on fact-checking real-world claims that require a combination of compositional and numerical reasoning. They are unable to understand nuance of numerical aspects, and are also susceptible to the overthinking issue, where the model is unable to contextualize diverse information resulting in misinterpretation and backtracking of reasoning process. In this work, we systematically explore scaling test-time compute (TTS) for LLMs on task of fact-checking complex numerical claims, which entails eliciting multiple reasoning paths from an LLM. We train a verifier model (VerifierFC) to navigate this space of possible reasoning paths and select one that could lead to the correct verdict. We observe that TTS helps mitigate the overthinking issue, leading to significant performance gains for fact-checking numerical claims. To improve compute efficiency in TTS, we introduce an adaptive mechanism that performs TTS selectively based on the perceived complexity of the claim. This approach is **1.8x** more efficient than standard TTS, while delivering a notable 18.8% performance improvement over single-shot claim verification methods. Our code and data can be found at https://anonymous.4open.science/r/VerifierFC-B26A.

Think Right, Not More: Test-Time Scaling for Numerical Claim Verification

Political campaigns make increasing use of targeted strategies to influence voters on social media. The analysis of coordinated behaviour allows to determine communities of users that exhibit the same patterns of behaviours. While such analysis is generally performed on static networks, recent extensions to the temporal dimension allowed to highlight users that changed community over time. This may open up new possibilities to quantitatively study influence in social networks. As a first step towards that goal, we set out to analyze the messages users are exposed to and comparing users that changed community with the rest. Our findings show 54 statistically significant linguistic differences, and analyses on the effectiveness of the use of persuasion techniques show that few of them, i.e. loaded language, exaggeration and minimisation, doubt and flag-waving seem to be the most effective for the dataset we studied, tweets on the UK 2019 elections.

Insights into using temporal coordinated behaviour to explore connections between social media posts and influence

The quality of a conversation goes beyond the individual quality of each reply, and instead emerges from how these combine into interactional patterns that give the conversation its distinctive overall "shape". However, there is no robust automated method for comparing conversations in terms of their overall interactional dynamics. Such methods could enhance the analysis of conversational data and help evaluate conversational agents more holistically. In this work, we introduce a similarity measure for comparing conversations with respect to their dynamics. We design a validation framework for testing the robustness of the metric in capturing differences in conversation dynamics and for assessing its sensitivity to the topic of the conversations. Finally, to illustrate the measure's utility, we use it to analyze conversational dynamics in a large online community, bringing new insights into the role of situational power in conversations.

Downloads

Next from EMNLP 2025

Third-Person Appraisal Agent: Simulating Human Emotional Reasoning in Text with Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES