
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

workshop paper
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages
keywords:
cultural nlp
cross-culture
benchmark
multilingual
Large language models (LLMs) show limitations in comprehending everyday cultural commonsense knowledge in diverse regions, especially in non-English languages, as these are not necessarily written explicitly online. Existing works, however, mainly focus on a single language or rely heavily on online data sources like Wikipedia, which are often included in the training data. To address this, we present BLEnD, a multilingual socio-cultural commonsense benchmark with 13K question samples from carefully hand-crafted templates. The benchmark covers cultural aspects across 9 countries and regions (Azerbaijan, China, Greece, Indonesia, Mexico, Spain, UK, US, and West Java) in 7 languages (Azerbaijani, Chinese, English, Greek, Indonesian, Spanish, and Sundanese). We show that LLMs perform better in high-resource cultures, with a maximum 27.9%p difference in GPT-4 under a short-answer question format. Furthermore, they perform better for mid-to-high-resource cultures in their local languages, while for low-resource cultures, they perform better in English.