EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

LLM agents show promise for vulnerability testing. We however lack benchmarks to evaluate and compare solutions. AutoPenBench covers this need offering an open benchmark for the evaluation of vulnerability testing agents. It includes 33 tasks, ranging from introductory exercises to actual vulnerable systems. It supports MCP, enabling the comparison of agent capabilities. We introduce milestones per task, allowing the comparison of intermediate steps where agents struggle. To illustrate the use of AutoPenBench we evaluate autonomous and human-assisted agent architectures. The former achieves 21\% success rates, insufficient for production, while human-assisted agents reach 64\% success, indicating a viable industrial path. AutoPenBench is offered as open source and enables fair comparison of agents.

Downloads

Paper

Next from EMNLP 2025

Group, Embed and Reason: A Hybrid LLM and Embedding Framework for Semantic Attribute Alignment
poster

Group, Embed and Reason: A Hybrid LLM and Embedding Framework for Semantic Attribute Alignment

EMNLP 2025

+6
Rachel Brill and 8 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2026 Underline - All rights reserved