Morocco

This paper presents an unsupervised approach to detecting lexical variation in under-researched languages without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure requires no manual annotation and produces transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in ``noisy&#39;&#39; or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.

EACL 2026 Main Conference

A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments

This paper presents an unsupervised approach to detecting lexical variation in under-researched languages without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure requires no manual annotation and produces transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in ``noisy'' or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.

workshop paper

#### *Message from the General Chair, Aline Villavicencio*
I’m delighted and honoured to welcome you to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026), taking place in the beautiful city of Rabat, in Morocco, in March 24-29, 2026. EACL is the flagship European conference of the Association and EACL 2026 proudly continues our field’s tradition of excellence in scholarship, innovation, and inclusivity. I am deeply grateful to the many volunteers whose dedication, generosity, and tireless efforts have made this conference possible.
For the first time EACL is being hosted in the African continent. This is an important milestone for our community, and we are grateful to our Moroccan hosts for enabling this historic moment by bringing this edition of EACL to Rabat. We are also delighted that the Second Arabic NLP School is co-located with EACL. We hope attendees enjoy this wonderful opportunity to strengthen ties with the Computational Linguistics communities across the African continent. *[Read full message](https://drive.google.com/file/d/14NlmHvwM6fPJuMmOvVh7K0vtQbEyv3SZ/view?usp=sharing)*<br><br>

<html><button style="display: inline-flex; align-items: center; justify-content: center; white-space: nowrap; border-radius: 9999px; font-weight: bold; background: #7c3aed; color: white; font-family: 'Space Grotesk', sans-serif; height: 40px; font-size: 16px; padding: 0 20px; border: none; cursor: pointer" onclick="window.open('https://underline.io/events/522/reception','_blank')">Go to Workshops and Tutorials Program</button></html>
<br><br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to EACL 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://2026.eacl.org/registration/) first.

**Online Registration Form**: https://acl.swoogo.com/eacl2026

Registration Required

Welcome to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL). EACL 2026 will be held in Rabat, Morocco, from March 24–29, 2026. 

Mutual intelligibility within language families presents a significant challenge for multilingual NLP, particularly due to the prevalence of dialectal variation and asymmetric comprehension. In this paper, we present a corpus-based computational analysis to quantify linguistic proximity across Romance language variants, with a focus on major Spanish (Argentine, Chilean and European) and Portuguese (Brazilian and European) varieties and the other main Romance languages (Italian, French, Romanian). We introduce a novel computational metric of lexical intelligibility based on surface and semantic similarity of related words, and use it to measure mutual intelligibility for the five main Romance languages in relation to the Spanish and Portuguese varieties studied.

On the Intelligibility of Romance Language Varieties: Spanish and Portuguese in Europe and America

We conduct an empirical study of cross-lingual transfer using spontaneous, noisy, and code-mixed speech across a wide range of Indic dialects and language varieties. Our results indicate that although ASR performance is generally associated with phylogenetic distance across languages, this factor alone does not fully explain performance in dialectal settings. Often, fine-tuning on smaller amounts of dialectal data yields performance comparable to fine-tuning on larger amounts of phylogenetically-related, high-resource standardized languages. We also present a case study on Garhwali, a low-resource Pahari language variety, and evaluate multiple contemporary ASR models. Finally, we analyze transcription errors to examine bias toward pre-training languages, providing additional insight into challenges faced by ASR systems on dialectal and non-standardized speech.

Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties

This paper addresses key gaps in assessing Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation quality, where standard automatic metrics and general-purpose human evaluation methods fail to capture dialect-specific errors. We introduce Ara-HOPE, a human-centric post-editing evaluation framework comprising a five-category error taxonomy and a decision-tree annotation protocol designed for consistent use by minimally trained native dialect speakers. Using Ara-HOPE, we comparatively evaluate three mainstream systems (Jais, GPT-3.5, and NLLB-200) to quantify error patterns, pinpoint systematic weaknesses, and provide actionable guidance for improving dialect-aware MT. Our analysis shows clear performance differences across systems and highlights dialect-specific terminology handling and semantic preservation as the most persistent challenges in DA-MSA translation.

Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation

Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/AnonymousAccountACL/IndicTunedLens. Our code is available at https://github.com/AnonymousAccountACL/IndicTunedLens.

Indic-TunedLens: Interpreting Multilingual Models in Indian Languages

Dialectal speech remains largely underexplored in Automatic Speech Recognition (ASR) research, particularly for Slavic languages. While Ukrainian ASR systems have rapidly improved in recent years with the adoption of Whisper, XLS-R, and Wav2Vec-based models, performance on dialectal variants remains unknown and often significantly degraded. In this work, we present the first dedicated effort to build ASR resources for the Hutsul dialect of Ukrainian. We develop a data preparation and segmentation pipeline, evaluate multiple forced alignment strategies, and benchmark state-of-the-art ASR models under zero-shot and fine-tuned conditions. We evaluate results using WER and CER demonstrating that large multilingual ASR models struggle with dialectal speech, while lightweight fine-tuning produces substantial improvements. All scripts, alignment tools, and training recipes are made publicly available to support future research on Ukrainian dialect speech.

Building ASR Resources for the Hutsul Dialect of Ukrainian

Phonetic transcription is vital for speech processing and linguistic documentation, particularly in languages like Tamil with complex phonology and dialectal variation. Challenges such as consonant gemination, retroflexion, vowel length, and one-to-many grapheme-phoneme mappings are compounded by limited data on Sri Lankan Tamil dialects. We present a dialect-aware, rule-based transcription tool for Tamil that supports Indian and Jaffna Tamil, with extensions underway for other dialects. Using a two-stage pipeline: Tamil script to Latin, then to IPA with context-sensitive rules, the tool handles dialect shifts. A real-time interface enables dialect selection. Evaluated on a 7,830-word corpus, it achieves 94.54\% accuracy for Jaffna Tamil and higher than other tools like eSpeak NG, advancing linguistic preservation and accessible speech technology for Tamil communities.

Bridging Dialectal Variation: A Phonetic Transcription Tool for Tamil

Regional variation was a limiting factor for automatic speech recognition (ASR) before large language models. With the new technology, speech processing becomes more general, which opens the question of how to use data in similar languages such as Croatian and Serbian. In this paper, we analyse model performance in various train-test scenarios with the goal of better understanding the mutual interference of these two languages. Our findings suggest that better performing models are not very sensitive to the regional variation. Training from scratch in one of the languages can give good results on both of them, while fine-tuning large pre-trained multilingual models on smaller data sets does not give the expected results.

Regional Variation in the Performance of ASR Models on Croatian and Serbian

Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LahjatBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system.

Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. In the evaluation we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. Finally, we find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages.

OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

Despite the success of large language models (LLMs) in a wide range of applications, it has been shown that their performance varies across English dialects. Differences among English dialects are reflected in vocabulary, syntax, and writing style, and can adversely affect model performance. Several studies evaluate the dialect robustness of LLMs, yet research on enhancing their robustness to dialectal variation remains limited.

In this paper, we propose two parameter-efficient frameworks for improving dialectal robustness in LLMs: DialectFusion where we train separate LoRA layers for each dialect and apply different LoRA merging methods, and DialectMoE which is built on top of Mixture of Experts LoRA and introduces multiple LoRA-based experts to the feed-forward layer to internally model the dialectal dependencies. Our comprehensive analysis on five open-source LLMs for sentiment and sarcasm tasks in zero- and few-shot settings shows that our proposed approaches enhance the dialect robustness of LLMs and outperforms instruct and LoRA fine-tuning based approaches.

Premium content

Downloads

Next from EACL 2026 Main Conference

On the Intelligibility of Romance Language Varieties: Spanish and Portuguese in Europe and America

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES