Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Oversensitivity—where language models defensively reject benign prompts—not only disrupts user interactions but also obscures the boundaries between harmful and harmless content. Existing benchmarks rely on static datasets that degrade over time as models evolve, leading to data contamination and diminished evaluative power. To address this, we develop a framework that dynamically generates model-specific challenging datasets, capturing emerging defensive patterns and aligning with each model’s unique behavior. Building on this approach, we construct OverBench, a benchmark that aggregates these datasets across diverse LLM families, encompassing 450,000 samples from 26 models. OverBench provides a dynamic and evolving perspective on oversensitivity, allowing for continuous monitoring of defensive triggers as models advance, highlighting vulnerabilities that static datasets overlook.