Morocco

Can large language models converse in languages virtually absent from their training data? We investigate this question through a case study on Tulu, a Dravidian language with over two million speakers but minimal digital presence. Rather than fine-tuning, we examine whether structured prompt engineering alone can elicit basic conversational ability under extreme data scarcity. Our framework combines explicit grammar documentation, negative constraints to suppress high-probability tokens from related languages, romanization standardization, and quality-controlled synthetic data generation via self-play. Evaluated on a manually curated held-out set across three LLMs (Gemini 2.0 Flash, GPT-4o, and Llama 3.1 70B) and validated by native speakers, our approach reduces vocabulary contamination from 80% to 5% while achieving 85% grammatical accuracy. Cross-model analysis shows that negative constraints provide consistent improvements (12–18 percentage points), while the effectiveness of grammar documentation varies by model architecture (8–22 points). These results demonstrate that structured in-context learning can meaningfully extend LLM capabilities to extremely low-resource languages without parameter updates.

EACL 2026 Main Conference

Making Large Language Models Speak Tulu: Structured Prompting for an Extremely Low-Resource Language

workshop paper

#### *Message from the General Chair, Aline Villavicencio*
I’m delighted and honoured to welcome you to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026), taking place in the beautiful city of Rabat, in Morocco, in March 24-29, 2026. EACL is the flagship European conference of the Association and EACL 2026 proudly continues our field’s tradition of excellence in scholarship, innovation, and inclusivity. I am deeply grateful to the many volunteers whose dedication, generosity, and tireless efforts have made this conference possible.
For the first time EACL is being hosted in the African continent. This is an important milestone for our community, and we are grateful to our Moroccan hosts for enabling this historic moment by bringing this edition of EACL to Rabat. We are also delighted that the Second Arabic NLP School is co-located with EACL. We hope attendees enjoy this wonderful opportunity to strengthen ties with the Computational Linguistics communities across the African continent. *[Read full message](https://drive.google.com/file/d/14NlmHvwM6fPJuMmOvVh7K0vtQbEyv3SZ/view?usp=sharing)*<br><br>

<html><button style="display: inline-flex; align-items: center; justify-content: center; white-space: nowrap; border-radius: 9999px; font-weight: bold; background: #7c3aed; color: white; font-family: 'Space Grotesk', sans-serif; height: 40px; font-size: 16px; padding: 0 20px; border: none; cursor: pointer" onclick="window.open('https://underline.io/events/522/reception','_blank')">Go to Workshops and Tutorials Program</button></html>
<br><br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to EACL 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://2026.eacl.org/registration/) first.

**Online Registration Form**: https://acl.swoogo.com/eacl2026

Registration Required

Welcome to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL). EACL 2026 will be held in Rabat, Morocco, from March 24–29, 2026. 

Grammatical error correction (GEC) aims to improve text quality and readability. Previous work on the task focused primarily on high-resource languages, while low-resource languages lack robust tools. To address this shortcoming, we present a study on GEC for Zarma, a language spoken by over five million people in West Africa. We compare three approaches: rule-based methods, machine translation (MT) models, and large language models (LLMs). We evaluated GEC models using a dataset of more than 250,000 examples, including synthetic and human-annotated data. Our results showed that the MT-based approach using M2M100 outperforms others, with a detection rate of 95.82% and a suggestion accuracy of 78.90% in automatic evaluations (AE) and an average score of 3.0 out of 5.0 in manual evaluation (ME) from native speakers for grammar and logical corrections. The rule-based method was effective for spelling errors but failed on complex context-level errors. LLMs---Gemma 2b and MT5-small---showed moderate performance. Our work supports use of MT models to enhance GEC in low-resource settings, and we validated these results with Bambara, another West African language.

Grammatical Error Correction for Low-Resource Languages: The Case of Zarma

Fine-tuning multilingual models for low-resource dialect translation frequently encounters a “plausibility over faithfulness” dilemma, resulting in severe semantic drift on dialect-specific tokens. We term this phenomenon the “Probability Trap,” where models prioritize statistical fluency over semantic fidelity. To address this, we propose MVS-Rank (Multi-View Scoring Reranking), a generate-then-rerank framework that decouples evaluation from generation. Our method assesses translation candidates through three complementary perspectives: (1) Source-Side Faithfulness via a Reverse Translation Model to anchor semantic fidelity; (2) Local Fluency using Masked Language Models to ensure syntactic precision; and (3) Global Fluency leveraging Large Language Models to capture discourse coherence. Extensive experiments on Cantonese-Mandarin benchmarks demonstrate that MVS-Rank achieves state-of-the-art performance, significantly outperforming strong fine-tuning baselines by effectively rectifying hallucinations while maintaining high fluency.

Escaping the Probability Trap: Mitigating Semantic Drift in Cantonese-Mandarin Translation

We use Finnish and Northern Sámi as a case study to investigate how suitable multilingual LLMs are for low-resource machine translation and how much performance can be improved using supervised finetuning with varying amounts of parallel data. Our experiments on zero-shot translation reveal that mainstream multilingual LLMs from a variety of model families are unsuitable for translation between our chosen languages as-is, regardless of the generation hyperparameters. On the other hand, our experiments on supervised finetuning reveal that even relatively small amounts of parallel data can be very useful for improving performance in both translation directions.

How multilingual are multilingual LLMs? A case study in Northern Sámi-Finnish Translation

Emoji reactions are a frequently used feature of messaging platforms, yet their communicative role remains understudied. Prior work on emojis has focused predominantly on in-text usage, showing that emojis embedded in messages tend to amplify and mirror the author's affective tone. This evidence has often been extended to emoji reactions, treating them as indicators of emotional resonance or user sentiment. However, they may reflect broader social dynamics. Here, we investigate the communicative function of emoji reactions on Telegram. We analyze over 650k crypto-related messages that received at least one reaction, annotating each with sentiment, emotion, persuasion strategy, and speech act labels, and inferring the sentiment and emotion of emoji reactions using both lexicons and LLMs. We uncover a systematic mismatch between message and reaction sentiment, with positive reactions dominating even for neutral or negative content. This pattern persists across rhetorical strategies and emotional tones, indicating that emojis used as reactions do not reliably function as indicators of emotional mirroring or resonance of the content, in contrast to findings reported for in-text emojis. Finally, we identify the features that most predict emoji engagement. Overall, our findings caution against treating emoji reactions as sentiment labels, highlighting the need for more nuanced approaches in sentiment and engagement analysis.

Emoji Reactions on Telegram: Unreliable Indicators of Emotional Resonance

Large language models (LLMs) are now widely used in applications that depend on closed-ended decisions, including automated surveys, policy screening, and decision-support tools. In such contexts, these models are typically expected to produce consistent binary or ternary responses (for example, Yes, No, or Neither) when presented with questions that are semantically equivalent. However recent studies shows that LLM outputs can be influenced by relatively minor changes in prompt wording, raising concerns about the reliability of their decisions under paraphrasing. In this paper, we conduct a systematic analysis of paraphrase robustness across five widely used LLMs. To support this evaluation, we develop a controlled dataset consisting of 200 opinion-based questions drawn from multiple domains, each accompanied by five human-validated paraphrases. All models are evaluated under deterministic inference settings and constrained to a fixed Yes/No/Neither response format. We assess model behavior using a set of complementary metrics that capture the stability of each evaluated model. DeepSeek Reasoner and Gemini 2.0 Flash show the highest stability when responding to paraphrased inputs, whereas Claude 3.7 Sonnet exhibits strong internal consistency but produces judgments that differ more frequently from those of other models. By contrast, GPT-3.5 Turbo and LLaMA 3 70B display greater sensitivity to surface-level variations in prompt phrasing. Overall, these findings suggest that robustness to paraphrasing is driven more by alignment strategies and reasoning design choices than by model size alone.

Measuring LLMs’ Sensitivity to Paraphrased Opinion Prompts

Emotional tone plays a central role in persuasion, yet its impact on computational assessments of political argument quality in real world election campaign speeches remains understudied. In this work, we investigate whether positive emotional framing correlates with higher perceived convincingness in political arguments. We fine-tune language models on argument quality datasets and test their ability to transfer convincingness predictions to real-world campaign speeches. Using a corpus of U.S. presidential campaign speeches, we analyze emotional polarity in relation to predicted persuasive strength to test whether positively framed arguments are judged more convincing than neutral or negative ones. Our empirical analysis shows that political parties rely heavily on argumentation during their election campaigns. Also, we found the evidence that politicians strategically employ emotional cues within their arguments during these campaign speeches, with positive emotions being more strongly associated with persuasive strength, for example in topics such as USMCA’s Effect on American Jobs and Agriculture, Border Control Policies, Progressive Tax Reforms. At the same time, we find that negative emotions have a weaker yet still non-negligible influence on voter persuasion in topics such as City Crime and Civil Unrest and White Supremacist Violence (Charlottesville Incident).

Predicting Convincingness in Political Speech: How Emotional Tone Shapes Persuasive Strength

This study examines the capability of LLMs to predict emotional ratings of Russian words by comparing their assessments with both native speakers' ratings and expert evaluations. The research utilises two datasets: the ENRuN database containing associative emotional ratings of Russian nouns by native speakers, and RusEmoLex, an expert-compiled lexicon. Various open-source LLMs were evaluated, including international models (Llama-3, Qwen 2.5), Russian-developed models, and Russian-adapted variants, representing three parameter scales. The findings reveal distinct patterns in model performance: Russian-adapted models demonstrated superior alignment with native speakers' ratings, whilst model size was not a decisive factor. Conversely, larger models showed better performance in matching expert assessments, with language adaptation having minimal impact. Emotional or sensitive lexis with strong connotations produce a more substantial human-model gap.

Emotional Lexicons: How Large Language Models Predict Emotional Ratings of Russian Words

Data annotation is essential for supervised natural language processing tasks but remains labor-intensive and expensive. Large language models (LLMs) have emerged as promising alternatives, capable of generating high-quality annotations either autonomously or in collaboration with human annotators. However their use in autonomous annotations is often questioned for their ethical take on subjective matters. This study investigates the effectiveness of LLMs in a autonomous, and hybrid annotation setups in propaganda detection. We evaluate GPT and open-source models on two datasets from different domains, namely, Propaganda Techniques Corpus (PTC) for news articles and the Journalist Media Bias on X (JMBX) for social media. Our results show that LLMs, in general, exhibit high recall but lower precision in detecting propaganda, often over-predicting persuasive content. Multi-annotator setups did not outperform the best models in single-annotator setting although it helped reasoning models boost their performance. Hybrid annotation, combining LLMs and human input, achieved the highest overall accuracy than LLM-only settings. We further analyze misclassifications and found that LLM have higher sensitivity towards certain propaganda techniques like loaded language, name calling, and doubt. Finally, using error typology analysis, we explore the reasoning provided on misclassifications by the LLM. Our result shows that although some studies report LLM outperforming manual annotations and it could prove useful in hybrid annotation, its incorporation in the human annotation pipeline must be implemented with caution.

Council of LLMs: Evaluating Capability of Large Language Models to Annotate Propaganda

This paper presents a domain-specific transformer pipeline for quantifying social atmosphere in hostel reviews, an experiential dimension that travelers consistently prioritize but that existing NLP methods and booking platforms fail to capture. We train a cross-encoder on 4,994 manually annotated reviews and use it to pseudo-label 162,840 additional reviews; these labels are then distilled into a sentence-transformer bi-encoder, producing embeddings where proximity reflects social interaction level rather than generic sentiment. On held-out human-labeled data, the domain-adapted embeddings achieve F1 = 0.826, outperforming generic sentence embeddings (0.671) and zero-shot GPT-4o (0.774), with a 40-fold improvement in intra-class versus inter-class similarity. Aggregating predictions to the property level reveals that hostel socialness follows an approximate exponential distribution, confirming that highly social hostels are rare. This work formalizes socialness as a measurable semantic construct and provides a general template for extracting implicit experiential attributes from text at scale.

Quantifying Social Sentiment in Hostels Using A Domain-Specific Transformer Pipeline

Understanding emotion responses relies on reconstructing how individuals appraise events. While prior work has studied emotion trajectories and inherent correlations with appraisals, it has considered appraisals only in a snapshot analysis. However, because appraisal is a complex, sequential process, we argue that it should be analyzed based on how it unfolds throughout a narrative. In this study, we investigate whether trajectories of appraisals are distinctive for different emotions in five-event stories -- narratives where each of five sentences describes an event. We employ zero-shot prompting with a large language model to predict appraisals on sub-sequences of a narrative. We find that this approach is effective in identifying relevant appraisals in narratives, without prior knowledge of the evoked emotion, enabling a comprehensive analysis of appraisal trajectories. Furthermore, we are the first to quantitatively identify typical patterns of appraisal trajectories that distinguish emotions. For example, a rising trajectory for self-responsibility indicates trust, while a falling trajectory suggests anger.

Premium content

Downloads

Next from EACL 2026 Main Conference

Grammatical Error Correction for Low-Resource Languages: The Case of Zarma

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES