EMNLP 2025

November 08, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Multilingual dense embedding models such as Multilingual E5, LaBSE, and BGE-M3 have shown promising results on diverse benchmarks for information retrieval in low-resource languages. But their result on low resource languages is not up to par with other high resource languages. This work improves the performance of BGE-M3 through contrastive fine-tuning; the model was selected because of its superior performance over other multilingual embedding models across MIRACL, MTEB, and SEB benchmarks. To fine-tune this model, we curated a comprehensive dataset comprising Yorùbá (32.9k rows), Igbo (18k rows) and Hausa (85k rows) from mainly news sources. We further augmented our multilingual dataset with English queries and mapped it to each of the Yoruba, Igbo, and Hausa documents, enabling cross-lingual semantic training. The fine-tuned model increased the mean reciprocal rank (MRR): 0.9201 for Yorùbá, 0.8638 for Igbo, 0.9230 for Hausa, and 0.8617 for English to local retrieval; surpassing the baseline BGE-M3 scores of 0.7846, 0.7566, 0.8575, and 0.7377, respectively. The resulting model supports multilingual search, question answering, and other local semantic applications. We release the final dataset, scraping and processing scripts, and fine-tuned model weights.

Downloads

PaperTranscript English (automatic)

Next from EMNLP 2025

The Gemma Sutras: Fine-Tuning Gemma 3 for Sanskrit Sandhi Splitting
workshop paper

The Gemma Sutras: Fine-Tuning Gemma 3 for Sanskrit Sandhi Splitting

EMNLP 2025

SANJAY Batch and 1 other author

08 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2026 Underline - All rights reserved