Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
The responsible deployment of Large Language Models (LLMs) necessitates rigorous safety evaluation. However, a critical challenge arises from inconsistencies between an LLM's internal refusal decisions and external safety assessments, hindering effective validation. This paper introduces the concept of the 'refusal gap' to define these discrepancies and presents a novel, refusal-aware red teaming framework designed to automatically generate test cases that expose them. Our framework employs refusal probes, potentially leveraging the target model's hidden states, to detect internal model refusals and contrasts these with external safety evaluator judgments. This discrepancy signal then guides a red-teaming model to craft prompts that maximize the refusal gap. To further enhance test case diversity and overcome sparse reward challenges, we introduce a hierarchical curiosity-driven mechanism that rewards both refusal gap maximization and topic exploration. Empirical results demonstrate that our method significantly outperforms existing RL-based approaches in generating diverse test cases and achieves a substantially higher discovery rate of refusal gaps.