EMNLP 2025

November 07, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Automatic understanding of figures in scientific papers is challenging since they often contain sub-figures and sub-captions in complex layouts. In this paper, we propose a vision-language model to extract aligned pairs of sub-figures and sub-captions from scientific papers. We further create a carefully curated dataset of 7,174 compound figures with annotated sub-figure bounding boxes and aligned sub-captions. Our experiments show that the proposed method outperforms the prior strong vision models on figure detection average precision by 2.3% and improves caption extraction by an absolute of 46.5% in BLEU compared with Llama-2-13B.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

Annotation-Efficient Language Model Alignment via Diverse and Representative Response Texts
poster

Annotation-Efficient Language Model Alignment via Diverse and Representative Response Texts

EMNLP 2025

Ukyo Honda and 1 other author

07 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved