EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g. SWE-bench~\cite{jimenez2023swe} reports 32.67\% of successful patches involve direct solution leakage and 31.08\% pass due to inadequate test cases. We introduce \textbf{SWE-MERA}, a dynamic, continuously updated benchmark designed to address these fundamental challenges through an automated collection of real-world GitHub issues and rigorous quality validation. Our approach implements a reliable pipeline that ensures quality while minimizing contamination risks, resulting in approximately 10,000 potential tasks with 300 samples currently available. Evaluation using the Aider coding agent demonstrates strong discriminative power in state-of-the-art models. We report performance across dozen recent LLMs evaluated on tasks collected between September 2024 and June 2025.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop
demo

LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop

EMNLP 2025

+2Yulan He
Cesare Aloisi and 4 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved