EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, in long-form question answering (LFQA), they often struggle with factual accuracy, frequently generating hallucinated responses. In this work, we introduce FinLFQA, a benchmark designed to evaluate LLMs' ability to generate answers with reliable attributions. FinLFQA evaluates three key aspects: (1) evidence-supported content to enhance factual grounding and verifiability, (2) step-by-step calculations using executable code for numerical reliability, and (3) domain-specific reasoning informed by knowledge. We conduct an extensive evaluation of eight LLMs, leveraging the developed automated evaluation protocol to evaluate their performance. Our findings show that GPT-4o outperforms other models, while open-sourced models are closing the gap with proprietary models, which demonstrates that open-source models are becoming competitive alternatives for real-world applications. We also find that post-hoc and end-to-end generation perform similarly, while iterative self-feedback provides no significant improvement except external signal is provided.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

Assessing the Sensitivity and Alignment of FOL Closeness Metrics
poster

Assessing the Sensitivity and Alignment of FOL Closeness Metrics

EMNLP 2025

Ehsan Shareghi
Wray Buntine and 2 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved