EMNLP 2025

November 08, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Large Language Models (LLMs) powered with argentic capabilities are able to do knowledge-intensive tasks without human involvement. A prime example of this tool is Deep research with the capability to browse the web, extract information and generate multi-page reports. In this work, we introduce an evaluation sheet that can be used for assessing the capability of Deep Research tools. In addition, we selected academic survey writing as a use case task and evaluated output reports based on the evaluation sheet we introduced. Our findings show the need to have carefully crafted evaluation standards. The evaluation done on OpenAI‘s Deep Search and Google’s Deep Search in generating an academic survey showed the huge gap between search engines and standalone Deep Research tools, as well as the shortcomings in representing the targeted area.

Downloads

Paper

Next from EMNLP 2025

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA
workshop paper

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

EMNLP 2025

Hassan Sajjad
Sher Badshah and 1 other author

08 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2026 Underline - All rights reserved