Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that grounds cultural knowledge directly into MLLMs. Leveraging Wikidata's large scale knowledge graph, we collect images that represent culturally significant entities, and generate multilingual Visual Question Answering data. The resulting dataset, CulturalGround, comprises 2.3 million high-quality, culturally-rich VQA pairs in Hindi, Tamil, Japanese, Indonesian, Vietnamese, and English. We train an open-source MLLM on CulturalGround, while interleaving standard multilingual instruction-tuning data to preserve general abilities. CulturalPangea achieves state-of-the-art performance among open models on various culture-focused benchmarks, outperforming prior systems by an average of +4.9% across these benchmarks, without degrading results on mainstream vision–language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems.