Lecture image placeholder

Premium content

Access to this content requires a subscription. You must be a premium user to view this content.

Monthly subscription - $9.99Pay per view - $4.99Access through your institutionLogin with Underline account
Need help?
Contact us
Lecture placeholder background
VIDEO DOI: https://doi.org/10.48448/r7r2-zd13

poster

ACL 2024

August 12, 2024

Bangkok, Thailand

MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China

keywords:

cultural awareness

multilingual corpus

low-resource languages

Current large language models demonstrate deficiencies in understanding low-resource languages, particularly the minority languages in China. This limitation stems from the scarcity of available pre-training data. To address this accessibility challenge, we present MC$^2$, a Multilingual Corpus of Minority Languages in China, which is the largest open-source corpus of its kind so far. MC$^2$ includes four underrepresented languages: Tibetan, Uyghur, Kazakh, and Mongolian. Notably, we focus on the less common writing systems of Kazakh and Mongolian, i.e., Kazakh Arabic script and traditional Mongolian script, respectively, which have been long neglected in previous corpus construction efforts. Recognizing the prevalence of language contamination within existing corpora, we adopt a quality-centric solution for collecting MC$^2$, prioritizing accuracy while enhancing diversity. Furthermore, we underscore the importance of attending to the multiplicity of writing systems, which is closely related to the cultural awareness of the resulting models. The MC$^2$ corpus and related models are made public to the community.

Downloads

Transcript English (automatic)

Next from ACL 2024

Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
poster

Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

ACL 2024

+3Pei KeMinlie Huang
Zhexin Zhang and 5 other authors

12 August 2024

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Lectures
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2023 Underline - All rights reserved