Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Cross-lingual, cross-task transfer is challenged by task-specific data scarcity which becomes more severe as language support grows. Both challenges are amplified within vision-language models (VLMs). We investigate multilingual generalization in encoder-decoder transformer VLMs to enable zero-shot image captioning in a language that was only paired with machine translations during training. In this setting, the encoder must learn to generate generalizable, latent task-aware vision representations to instruct the decoder via inserted cross-attention layers. We study scaling laws by training models based on Florence-2 and Gemma-2 that range from 0.4B to 11.2B parameters. The training is performed on a synthetic dataset using varying compute budgets. While all languages in the dataset have image-aligned translations, only a subset of them include image captions. Notably, we show that captioning can emerge in a language after training on only translation data. We find that this indirect learning of unseen task-language pairs adheres to scaling laws that are governed by the multilinguality of the model, its model size and seen training samples. Finally, we demonstrate that our observed scaling laws extend to a variety of downstream tasks, achieving competitive performance through finetuning in multimodal machine translation (Multi30K, CoMMuTE), lexical disambiguation (CoMMuTE), and image captioning (Multi30K, XM3600, COCO Karpathy).