EMNLP 2025

November 06, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

English, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). However, the same cannot be said for most other languages, likely due to a gap in the quality and diversity of available multilingual pretraining corpora. In this work, we find that machine-translated texts from a high-quality English source can contribute significantly to the pretraining quality of multilingual LLMs. Concretely, we translate FineWeb-Edu, a high-quality English web dataset, into nine languages. resulting in a 1.7-trillion-token dataset, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this dataset. Across Non-English reasoning tasks, we show that TransWebLLM matches or even outperforms multilingual LLMs of similar size, including Llama3.2, Qwen2.5, and Gemma3, despite being trained on an order of magnitude less data. Moreover, we show that adding less than 5% of TransWebEdu as domain-specific pretraining data sets new state-of-the-art results in Arabic, Indonesian, Swahili, and Welsh for understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus, models, and training pipeline under Open Source Initiative-approved licenses.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

Improving the Quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
poster

Improving the Quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics

EMNLP 2025

+2Aloka FernandoNisansa de Silva
Nisansa de Silva and 4 other authors

06 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2026 Underline - All rights reserved