Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
English, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). However, the same cannot be said for most other languages, likely due to a gap in the quality and diversity of available multilingual pretraining corpora. In this work, we find that machine-translated texts from a high-quality English source can contribute significantly to the pretraining quality of multilingual LLMs. Concretely, we translate FineWeb-Edu, a high-quality English web dataset, into nine languages. resulting in a 1.7-trillion-token dataset, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this dataset. Across Non-English reasoning tasks, we show that TransWebLLM matches or even outperforms multilingual LLMs of similar size, including Llama3.2, Qwen2.5, and Gemma3, despite being trained on an order of magnitude less data. Moreover, we show that adding less than 5% of TransWebEdu as domain-specific pretraining data sets new state-of-the-art results in Arabic, Indonesian, Swahili, and Welsh for understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus, models, and training pipeline under Open Source Initiative-approved licenses.