EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

This paper defines and explores the design space for information extraction (IE) from layout-rich documents using large language models (LLMs). The three core challenges of layout-aware IE with LLMs are 1) data structuring, 2) model engagement, and 3) output refinement. Our study investigates the sub-problems within these core challenges, such as input representation, chunking, prompting, selection of LLMs, and multimodal models. It examines the effect of different design choices through a new layout-aware IE test suite, benchmarking against traditional, fine-tuned IE models. Our results on two datasets show that our one-factor-at-a-time (OFAT) method achieves near-optimal results. It is only 0.8--1.8 points lower than the best full factorial exploration with a fraction ~2.8 of the required computation. Compared to a baseline configuration, it gains 13.3--37.5 points. We demonstrate that, if well-configured, general-purpose LLMs match the performance of specialized models, providing a cost-effective, label-free alternative.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

NormAL LoRA: What is the perfect size?
poster

NormAL LoRA: What is the perfect size?

EMNLP 2025

+1
Aastik . and 3 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved