EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply. Synthetic data techniques offer a potential path toward sidestepping these limitations. We conduct a large-scale empirical investigation (>1000 LLMs with >100k GPU hours) using a unified protocol and scaling laws, comparing natural web data, diverse synthetic types (rephrased text, generated textbooks), and mixtures of natural and synthetic data. Specifically, we found pre-training on rephrased synthetic data textitalone is not faster than pre-training on natural web texts; while pre-training on 1/3 rephrased synthetic data mixed with 2/3 natural web texts can speed up 5-10x (to reach the same validation loss) at larger data budgets. Pre-training on textbook-style synthetic data textitalone results in notably higher loss on many downstream domains especially at small data budgets. Good'' ratios of synthetic data in training data mixtures depend on the model size and data budget, empirically converging to 30% for rephrased synthetic data. Larger generator models do not necessarily yield better pre-training data than 8B-param models. Our work demystifies synthetic data in pre-training, validates its conditional benefits, and offers practical guidance.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations
technical paper

From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations

EMNLP 2025

+6
Arman Cohan and 8 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved