China

Despite extensive efforts to align large language models (LLMs) with human values and safety rules, jailbreak attacks that exploit certain vulnerabilities continuously emerge, highlighting the need to strengthen existing LLMs with additional safety properties to defend against these attacks. However, tuning large models has become increasingly resource-intensive and may have difficulty ensuring consistent performance. We introduce Speculative Safety-Aware Decoding (SSD), a lightweight decoding-time approach that equips LLMs with the desired safety property while accelerating inference. We assume that there exists a small language model that possesses the desired safety property. SSD integrates speculative sampling during decoding and leverages the match ratio between the small and composite models to quantify jailbreak risks. This enables SSD to dynamically switch between decoding schemes to prioritize utility or safety, to handle the challenge of different model capacities. The output token is then sampled from a new distribution that combines the distributions of both models. Experimental results show that SSD successfully equips the large model with the desired safety property, and also allows the model to remain helpful to benign queries. Furthermore, SSD accelerates the inference time, thanks to the speculative sampling design.

EMNLP 2025

Speculative Safety-Aware Decoding

safety and alignment

transfer

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Personalizing large language models (LLMs) is important for aligning outputs with diverse user preferences, yet existing methods struggle with flexibility and generalization. We propose CoPL (Collaborative Preference Learning), a graph-based collaborative filtering framework that models user-response relationships to enhance preference estimation, particularly in sparse annotation settings. By integrating a mixture of LoRA experts, CoPL efficiently fine-tunes LLMs while dynamically balancing shared and user-specific preferences. Additionally, an optimization-free adaptation strategy enables generalization to unseen users without fine-tuning. Experiments on TL;DR, UltraFeedback-P, and PersonalLLM datasets demonstrate that CoPL outperforms existing personalized reward models, effectively capturing both common and controversial preferences, making it a scalable solution for personalized LLM alignment.

CoPL: Collaborative Preference Learning for Personalizing LLMs

Semi-supervised text classification (SSTC) aims to train text classification models with few labeled data and massive unlabeled data. Existing studies develop effective pseudo-labeling methods, but they can struggle with unlabeled data that have imbalanced classes mismatched with the labeled data, making the pseudo-labeling biased towards majority classes, resulting in catastrophic error propagation. We believe it is crucial to explicitly estimate the overall class distribution, and use it to calibrate pseudo-labeling to constrain majority classes. To this end, we formulate the pseudo-labeling as an optimal transport (OT) problem, which transports the unlabeled sample distribution to the class distribution. With a memory bank, we dynamically collect both the high-confidence pseudo-labeled data and true labeled data, thus deriving reliable (pseudo-) labels for class distribution estimation. Empirical results on 3 commonly used benchmarks demonstrate that our model is effective and outperforms previous state-of-the-art methods.

Calibrating Pseudo-Labeling with Class Distribution for Semi-supervised Text Classification

AI systems are rapidly advancing in capability, and frontier model developers broadly acknowledge the need for safeguards against serious misuse. However, this paper demonstrates that fine-tuning, whether via open weights or closed fine-tuning APIs, can produce helpful-only models. In contrast to prior work which is blocked by modern moderation systems or achieved only partial removal of safeguards or degraded output quality, our jailbreak-tuning method teaches models to generate detailed, high-quality responses to arbitrary harmful requests. For example, OpenAI, Google, and Anthropic models will fully comply with requests for CBRN assistance, executing cyberattacks, and other criminal activity. We further show that backdoors can increase not only the stealth but also the severity of attacks. Stronger jailbreak prompts become even more effective in fine-tuning attacks, linking attack and potentially defenses in the input and weight spaces. Not only are current models vulnerable, more recent ones also appear to be becoming even more vulnerable to these attacks, underscoring the urgent need for tamper-resistant safeguards. Until such safeguards are discovered, companies and policymakers should view the release of any fine-tunable model as simultaneously releasing its evil twin: equally capable as the original model, and usable for any malicious purpose within its capabilities.

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

Continual learning (CL) is essential for deploying large language models (LLMs) in dynamic real-world environments without the need for costly retraining. Recent model merging-based methods have attracted significant attention, but they still struggle to effectively manage the trade-off between learning new knowledge and preventing forgetting, a challenge largely stemming from suboptimal number of merges and merging frequency. In this paper, we introduce Adaptive Iterative Model Merging (AimMerging), a novel CL framework that utilizes learning and forgetting signals from the training trajectory to dynamically monitor the model's training status. Guided by dynamic monitoring, the training trajectory-guided merge controller adaptively determines the timing and frequency of iterative fusion, while the rehearsal-based knowledge fusion module computes the merging weights and executes the fusion. Comprehensive experiments on three CL benchmarks with various model sizes (from 770M to 13B) demonstrate that AimMerging achieves significant performance improvements over existing state-of-the-art methods, with an average relative improvement of 80% and 59% on FWT and BWT, respectively. The source code is provided for reproducibility.

AIMMerging: Adaptive Iterative Model Merging Using Training Trajectories for Language Model Continual Learning

Long chain-of-thought (CoT) reasoning have recently attracted significant attention, with models such as DeepSeek-R1 achieving remarkable performance across various reasoning benchmarks. However, a common challenge to these models is the "overthinking" problem, leading to excessive intermediate steps and diminished inference efficiency. While numerous efforts have targeted reduction in generated tokens, these frequently encounter an inherent trade-off: enhancements in efficiency often come at the cost of degradation in performance. To overcome such challenges, we introduce the Multi-Turn Intervention Sampling Framework (MuTIS). Our framework leverages multi-turn interventions within rollouts to produce high-quality, concise reasoning chains. It fine-tunes reasoning models through reinforcement learning, demonstrably surpassing the previously described accuracy-efficiency trade-off. Through extensive experiments on challenging mathematical reasoning benchmarks, our approach achieves a substantial 11.3% improvement in accuracy while concurrently reducing token utilization by an average of 60.1%. Code, data, and models will be fully open-sourced.

MuTIS: Enhancing Reasoning Efficiency through Multi Turn Intervention Sampling in Reinforcement Learning

Parameter-efficient fine-tuning (PEFT) of large language models (LLMs) is critical for adapting to diverse downstream tasks with minimal computational cost. We propose **Di**rectional-**S**VD **Lo**w-**R**ank **A**daptation (DisLoRA), a novel PEFT framework that leverages singular value decomposition (SVD) to decompose pretrained weight matrices into orthogonal backbone and task-specific subspaces, enabling precise capture of task-specific directions (TSDs). By dynamically identifying TSDs and employing adaptive soft orthogonal regularization with mean-normalization mechanism, DisLoRA balances task-specific and orthogonal losses without manual tuning, ensuring robust training stability. Extensive experiments on GLUE and Commonsense Reasoning benchmarks demonstrate that DisLoRA surpasses established PEFT methods, including LoRA, PiSSA, DoRA, LoRA-Dash, and SORSA. DisLoRA achieves superior performance on multiple individual GLUE datasets, surpassing baselines by up to 10.28\% on SST-2 and 3.28\% on CoLA, and consistently attains higher average accuracy than baselines across Commonsense Reasoning Tasks, with a maximum gain of 3.1\%. These results demonstrate DisLoRA’s performance in efficient and high-performing LLM adaptation for domain-specific tasks while preserving generalization.

DisLoRA: Task-specific Low-Rank Adaptation via Orthogonal Basis from Singular Value Decomposition

Sign language translation remains a challenging task due to the scarcity of large-scale, sentence-aligned datasets. Prior arts have focused on various feature extraction and architectural changes to support neural machine translation for sign languages. In this work, we propose a training scheme that is inspired by linguistic-templates-based sentence generation schemes. With translation comparison on 2 sign language datasets, How2Sign, and iSign, we show that a simple transformer-based encoder-decoder architecture outperforms the prior art when considering template-generated sentence pairs in training. We achieve BLEU-4 score improvements from 1.97 to 4.56 on How2Sign and from 0.55 to 3.43 on iSign, surpassing prior state-of-the-art methods. These results demonstrate the effectiveness of template-driven synthetic supervision in low-resource sign language settings.

PoseStitch-SLT: Linguistically Inspired Pose-Stitching for End-to-End Sign Language Translation

Large language models (LLMs) excel at few-shot learning, but their ability to reject out-of-distribution examples remains under-explored. We study this challenge under the setting of few-shot open-set classification, where a model must not only classify examples from a small set of seen classes but also reject unseen ones at inference time. This setting is more realistic and challenging than traditional closed-set supervised learning, requiring both fine-grained classification and robust rejection. We show that, for small LLMs, neither chain-of-thought (CoT) prompting nor supervised fine-tuning (SFT) alone are sufficient to generalise reliably, particularly when class semantics are anonymised. We introduce Wasserstein GFN (W-GFN), a novel amortised Generative Flow Network approach that uses latent trajectories to approximate the Bayesian posterior. With as few as 4 examples per class, W-GFN substantially improves performance, enabling LLaMA 3.2B to achieve geq 80% of the performance of LLaMA 3.3 70B in complex datasets, despite being sim 23 times smaller which highlights the importance of reasoning-aware approaches for robust open-set few-shot learning.

Few-Shot Open-Set Classification via Reasoning-Aware Decomposition

LLMs are evolving into assistants that leverage tools, significantly expanding their capabilities but also introducing critical safety risks. Current models exhibit notable vulnerabilities, particularly in maintaining safety during multi-step tool interactions and in scenarios involving indirect harm. This paper introduces \texbf{ToolSafety}, a safety fine-tuning dataset designed to address these limitations. ToolSafety comprises 5,668 direct harm samples, 4,331 indirect harm samples, and 4,331 multi-step samples. Key features include support for multi-step safety through synthesized trajectories and realistic, context-aware sample generation. We fine-tuned LLaMA3.1-8B-Instruct and Qwen2.5-7B-Instruct using ToolSafety. Experimental results demonstrate that these models effectively maintain safety in multi-step and indirect harm scenarios. Further analysis into superficial alignment across different decoding strategies, languages, and jailbreak prompts indicates that while some risks persist, the issue is less severe than in multi-step settings. Overall, our approach significantly improves safety across various scenarios with small impact on helpfulness, positioning ToolSafety as a valuable resource for building safer tool-using AI systems.

ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations

Comprehensive evaluation of large language models (LLMs) typically requires large-scale benchmarks, which is costly in terms of both data annotation and computational resource needed for evaluation. To mitigate these challenges, we propose an efficient evaluation framework that selects a question subset based on pre-tested results, thereby reducing the costs. We formulate the subset selection problem as an optimization task, solved using optimal random sampling and simulated annealing algorithms. We compare our approach with prior clustering-based methods and assess their reliability in terms of score accuracy. Additionally, we perform semantic analysis and evaluate whether the selected subsets preserve the semantic information of the original benchmark using Wasserstein distance. Experimental results show that our method outperforms previous approaches in terms of reliability, as measured by L2 norm. Our study provides an optimized perspective for balancing evaluation efficiency and reliability in LLM assessments, while revealing the relationship between optimization methods and semantic retention.

Downloads

Next from EMNLP 2025

CoPL: Collaborative Preference Learning for Personalizing LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES