Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning, rivalling supervised models when translating into high-resource languages (HRLs). However, they lag behind when dealing with low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, it also relies on the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present a new approach, TopXGen, which involves using an LLM to automatically generate topic-specific target-side data in the LRL, which can then be backtranslated to produce useful and diverse parallel texts for ICL and fine-tuning. Our intuition is that while LLMs struggle to translate into LRLs, their ability to translate well into HRLs and their multilinguality enable them to generate good quality, natural-sounding target-side texts, which can be translated well into a high-resource source language. We show that TopXGen boosts LLM translation performance during fine-tuning and in-context learning. Our code and outputs will be made freely available.