Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
keywords:
robotics and beyond
multimodality and language grounding to vision
and fairness
multilingualism and cross-lingual nlp
ethics
bias
Multilingual vision--language models (VLMs) promise universal image--text retrieval, yet their social biases remain under‑explored. We perform the first systematic audit of four public multilingual CLIP variants—M‑CLIP, NLLB‑CLIP, CAPIVARA‑CLIP, and the debiased SigLIP‑2—covering ten languages that differ in resource availability and morphological gender marking. Using balanced subsets of \textsc{FairFace} and the \textsc{PATA} stereotype suite in a zero‑shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the intuition that multilinguality mitigates bias, \emph{every} model exhibits stronger gender skew than its English‑only baseline. CAPIVARA‑CLIP shows its largest biases precisely in the low‑resource languages it targets, while the shared encoder of NLLB‑CLIP and SigLIP‑2 transfers English gender stereotypes into gender‑neutral languages; loosely coupled encoders largely avoid this leakage. Although SigLIP‑2 reduces agency and communion skews, it inherits—and in caption‑sparse contexts (e.g., Xhosa) amplifies—the English anchor’s crime associations. Highly gendered languages consistently magnify all bias types, yet gender‑neutral languages remain vulnerable whenever cross‑lingual weight sharing imports foreign stereotypes. Aggregated metrics thus mask language‑specific “hot spots,” underscoring the need for fine‑grained, language‑aware bias evaluation in future multilingual VLM research.
