Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Parallel corpora, as the foundation of machine translation, remain crucial even in the era of large language models (LLMs) for pre-training and fine-tuning. However, annotating parallel corpora is extremely costly, as it requires annotators to be proficient in multiple languages. To reduce this cost, prior work has explored image-pivoted corpus synthesis, generating multilingual captions for the same image as pseudo-parallel data. Unfortunately, these pseudo corpora suffer from the serious issue of multilingual focus divergence, i.e., the model attending to distinct aspects of the image when generating captions in different languages. To address this problem, we propose a method called PRISMS (Parallel Refracting ImageS into Multilingual descriptions with Structured visual guidance), which leverages semantic graphs as structured visual guidance to unify the focus of multilingual captions. To ensure adherence to this guidance, we introduce two key techniques: supervised fine-tuning using self-generated instructional data, and reinforcement learning with a reward signal based on semantic graph consistency. Experimental results on five languages show that our PRISMS significantly improves the image-pivot parallel corpora synthesis, enabling LLMs to achieve translation performance comparable to that of models trained on manually annotated corpora.