EMNLP 2025

November 07, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Large Language Models (LLMs) deployed as autonomous agents with tool access present unique safety challenges that extend beyond standalone model vulnerabilities. Existing red-teaming frameworks like AgentHarm use static prompts and hardcoded toolsets, limiting their applicability to custom production systems.

We introduce a dual-component automated red-teaming framework: AgentHarm-Gen generates adversarial tasks and evaluation functions tailored to arbitrary toolsets, while Red-Agent-Reflect employs iterative prompt refinement with self-reflection to develop progressively more effective attacks.

Evaluating across 115 harmful tasks (71 generated, 44 from AgentHarm) spanning 8 risk categories, our method achieves substantial improvements: up to 162\% increase in attack success rate on o4-mini and 86\% success on Gemini 2.5 Pro. Successful attacks systematically decompose adversarial objectives into benign-appearing sub-tasks that circumvent safety alignment, highlighting the need for agent-specific guardrails.

We contribute our implementation to the AgentHarm repository, enabling systematic identification of safety vulnerabilities in custom agentic workflows before deployment.

Downloads

Paper

Next from EMNLP 2025

Mapping Smarter, Not Harder: A Test-Time Reinforcement Learning Agent That Improve Without Labels or Model Updates
poster

Mapping Smarter, Not Harder: A Test-Time Reinforcement Learning Agent That Improve Without Labels or Model Updates

EMNLP 2025

Chien-Ming (Liam) Huang and 2 other authors

07 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved