EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

The responsible deployment of Large Language Models (LLMs) necessitates rigorous safety evaluation. However, a critical challenge arises from inconsistencies between an LLM's internal refusal decisions and external safety assessments, hindering effective validation. This paper introduces the concept of the 'refusal gap' to define these discrepancies and presents a novel, refusal-aware red teaming framework designed to automatically generate test cases that expose them. Our framework employs refusal probes, potentially leveraging the target model's hidden states, to detect internal model refusals and contrasts these with external safety evaluator judgments. This discrepancy signal then guides a red-teaming model to craft prompts that maximize the refusal gap. To further enhance test case diversity and overcome sparse reward challenges, we introduce a hierarchical curiosity-driven mechanism that rewards both refusal gap maximization and topic exploration. Empirical results demonstrate that our method significantly outperforms existing RL-based approaches in generating diverse test cases and achieves a substantially higher discovery rate of refusal gaps.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

DatawiseAgent: A Notebook-Centric LLM Agent Framework for Adaptive and Robust Data Science Automation
poster

DatawiseAgent: A Notebook-Centric LLM Agent Framework for Adaptive and Robust Data Science Automation

EMNLP 2025

+5
Yu Huang and 7 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2026 Underline - All rights reserved