EMNLP 2025

November 06, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Large language models (LLMs) are trained using massive datasets. However, these datasets often contain undesirable content, e.g., harmful texts, personal information, and copyrighted material. To address this, \emph{machine unlearning} aims to remove information from trained models. Recent work has shown that soft token attacks (\sta) can successfully extract unlearned information from LLMs. In this work, we show that \sta{s} can be an inadequate tool for auditing unlearning. Using common unlearning benchmarks (\textit{Who Is Harry Potter?} and \textit{TOFU}), we demonstrate that, in a \emph{strong auditor} setting, such attacks can elicit any information from the LLM, regardless of (1) the deployed unlearning algorithm, and (2) whether the queried content was originally present in the training corpus. Also, we show that \sta with just a few soft tokens (1-10) can elicit random strings over 400-characters long. Thus showing that \sta{s} must be used carefully to effectively audit unlearning.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

Dynamic Evaluation for Oversensitivity in LLMs
poster

Dynamic Evaluation for Oversensitivity in LLMs

EMNLP 2025

+1William Wang
Sitao Cheng and 3 other authors

06 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2026 Underline - All rights reserved