Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Dialects exhibit a substantial degree of lexical variation due to the lack of a standard orthography. At the same time, Large Language Models’ (LLMs) ability to process dialects remains largely understudied. To address this gap, we conduct a fine-grained analysis of dialect variation across different parts-of-speech. Using Bavarian as a case study, we investigate the lexical dialect understanding capability of LLMs by examining how they recognize and translate dialectal terms. To this end, we introduce DiaLemma, a novel annotation framework for obtaining dialect variation dictionaries from monolingual data only, and use it to create a ground truth dataset of 100K human-annotated German-Bavarian word pairs. We evaluate how well nine state-of-the-art LLMs can recognize Bavarian terms as dialect translations, inflected variants, or unrelated forms of a given German lemma. Our evaluation reveals that LLMs are better at translating and recognizing nouns. Surprisingly, when used as dialect word translation models, we find that providing additional context in the form of example usages can boost their performance. Our results highlight the limitations of LLMs in dealing with orthographic dialect variation and emphasizes the need for future work on adapting LLMs to dialects.