Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
General-purpose Vision-Language Models (VLMs) are increasingly integral to modern AI systems for document understanding, yet their ability to perform fine-grained layout analysis remains severely underdeveloped. Overcoming this requires a large-scale, high-fidelity training dataset. However, current annotation methods, which rely on parsing rendered PDFs, are costly, error-prone, and fail to scale effectively. This work introduces a paradigm shift in data acquisition to resolve this bottleneck. We present LaTeX2Layout, a novel and generalizable procedural pipeline that obtains ground-truth layout information not from the final PDF, but directly from the LaTeX compilation process itself. By instrumenting the compiler, our method produces pixel-perfect bounding boxes and reading order, entirely bypassing the ambiguities of post-rendering parsers. This efficient and accurate pipeline enables us to generate a massive dataset of 140K pages, including 120K programmatically-generated variants that more than double the layout diversity of real-world datasets. This unique dataset allows us to fine-tune a highly efficient 3B parameter VLM, employing a curriculum learning strategy that re-ranks training examples from simple to complex layouts to optimize convergence. Our model establishes a new state-of-the-art, achieving a Kendall's Tau of 0.95 for reading order and a mAP@0.5 of 0.91 for element grounding---a nearly 200\% relative improvement over formidable zero-shot baselines like GPT-4o and Claude-3.7. To foster reproducible research and future innovation, we make our data generation pipeline, dataset, and all models openly available.