Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Automatic understanding of figures in scientific papers is challenging since they often contain sub-figures and sub-captions in complex layouts. In this paper, we propose a vision-language model to extract aligned pairs of sub-figures and sub-captions from scientific papers. We further create a carefully curated dataset of 7,174 compound figures with annotated sub-figure bounding boxes and aligned sub-captions. Our experiments show that the proposed method outperforms the prior strong vision models on figure detection average precision by 2.3% and improves caption extraction by an absolute of 46.5% in BLEU compared with Llama-2-13B.