
Colin Leong
University of Dayton
low-resource
multilingual
ner
multimodal
data quality
audit
phonemes
multilingual data
web-mined data
dataset
language id
4
presentations
11
number of views
SHORT BIO
Christian ML Engineer and PhD interested in low-resource NLP for helping people. Secretly a hyperintelligent octopus, mimicking people without true understanding.
Presentations

Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks
Joshua Nemecek and 5 other authors

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation
David Ifeoluwa Adelani and 35 other authors

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer and 51 other authors

Phone-ing it in: Towards Flexible Multi-Modal Language Model Training by Phonetic Representations of Data
Colin Leong and 1 other author