Morocco

Leveraging a dataset of paired narratives, we investigate the extent to which large language models (LLMs) can reliably separate incoherent and coherent stories. A probing study finds that LLMs&#39; internal representations can reliably identify incoherent narratives. However, LLMs generate responses to rating questions that fail to satisfactorily separate the coherent and incoherent narratives across several prompt variations, hinting at a gap in LLM&#39;s understanding of storytelling. The reasoning LLMs tested do not eliminate these deficits, indicating that thought strings may not be able to fully address the discrepancy between model internal state and behavior. Additionally, we find that LLMs appear to be more sensitive to incoherence resulting from an event that violates the setting (e.g., a rainy day in the desert) than to incoherence arising from a character violating an established trait (e.g., Mary, a vegetarian, later orders a cheeseburger), suggesting that LLMs may rely more on prototypical world knowledge than building meaning-based narrative coherence. The consistent asymmetry found in our results suggests that LLMs have an incomplete grasp of narrative coherence.

EACL 2026 Main Conference

Mary, the Cheeseburger-Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?

Leveraging a dataset of paired narratives, we investigate the extent to which large language models (LLMs) can reliably separate incoherent and coherent stories. A probing study finds that LLMs' internal representations can reliably identify incoherent narratives. However, LLMs generate responses to rating questions that fail to satisfactorily separate the coherent and incoherent narratives across several prompt variations, hinting at a gap in LLM's understanding of storytelling. The reasoning LLMs tested do not eliminate these deficits, indicating that thought strings may not be able to fully address the discrepancy between model internal state and behavior. Additionally, we find that LLMs appear to be more sensitive to incoherence resulting from an event that violates the setting (e.g., a rainy day in the desert) than to incoherence arising from a character violating an established trait (e.g., Mary, a vegetarian, later orders a cheeseburger), suggesting that LLMs may rely more on prototypical world knowledge than building meaning-based narrative coherence. The consistent asymmetry found in our results suggests that LLMs have an incomplete grasp of narrative coherence.

technical paper

#### *Message from the General Chair, Aline Villavicencio*
I’m delighted and honoured to welcome you to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026), taking place in the beautiful city of Rabat, in Morocco, in March 24-29, 2026. EACL is the flagship European conference of the Association and EACL 2026 proudly continues our field’s tradition of excellence in scholarship, innovation, and inclusivity. I am deeply grateful to the many volunteers whose dedication, generosity, and tireless efforts have made this conference possible.
For the first time EACL is being hosted in the African continent. This is an important milestone for our community, and we are grateful to our Moroccan hosts for enabling this historic moment by bringing this edition of EACL to Rabat. We are also delighted that the Second Arabic NLP School is co-located with EACL. We hope attendees enjoy this wonderful opportunity to strengthen ties with the Computational Linguistics communities across the African continent. *[Read full message](https://drive.google.com/file/d/14NlmHvwM6fPJuMmOvVh7K0vtQbEyv3SZ/view?usp=sharing)*<br><br>

<html><button style="display: inline-flex; align-items: center; justify-content: center; white-space: nowrap; border-radius: 9999px; font-weight: bold; background: #7c3aed; color: white; font-family: 'Space Grotesk', sans-serif; height: 40px; font-size: 16px; padding: 0 20px; border: none; cursor: pointer" onclick="window.open('https://underline.io/events/522/reception','_blank')">Go to Workshops and Tutorials Program</button></html>
<br><br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to EACL 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://2026.eacl.org/registration/) first.

**Online Registration Form**: https://acl.swoogo.com/eacl2026

Registration Required

Welcome to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL). EACL 2026 will be held in Rabat, Morocco, from March 24–29, 2026. 

In today’s rapidly evolving large language model (LLM) landscape, technology companies such as Cisco face the difficult challenge
of selecting the most suitable model for downstream tasks that demand deep, domain-specific
product knowledge. Specialized benchmarks not only inform this decision making but also
can be leveraged to rapidly create quizzes that can effectively train engineering and marketing
personnel on novel product offerings in a continually growing Cisco product space.
We present Pro-QuEST, our Prompt-chain based Quiz Engine using state-of-the-art LLMs
for generating multiple-choice questions on Specialized Technical products. In Pro-QuEST,
we first identify key terms and topics from a given professional certification textbook or
product guide, and generate a series of multiple-choice questions using domain-knowledge
guided prompts. We show LLM benchmarking results with the question benchmarks generated by Pro-QuEST using a range of latest
open-source, and proprietary LLMs and compare them with expert-created exams and review 
questions to derive insights on their composition and difficulty. Our experiments indicate that though there is room for improvement
in Pro-QuEST to generate questions of the complexity levels seen in expert-designed 
certification exams, question-type based prompts provide a promising direction to address this limitation. In sample user studies with Cisco personnel, Pro-QuEST was received with high optimism for its practical usefulness in quickly
compiling quizzes for self-assessment on knowledge of novel products in the rapidly changing tech sector.

Pro-QuEST: Prompt-chaining Quiz Engine for testing Specialized Technical Product Knowledge

A detailed understanding of the basic properties of text collections produced by humans or generated synthetically is vital for all steps of the natural language processing system life cycle, from training to evaluating model performance and synthetic texts.
To facilitate the analysis of these properties, we introduce elfen, a Python library for efficient linguistic feature extraction for text datasets. It includes the largest set of item-level linguistic features in eleven feature areas: surface-level, POS, lexical richness, readability, named entity, semantic, information-theoretic, emotion, psycholinguistic, dependency, and morphological features. Building on top of popular NLP and modern dataframe libraries, elfen enables feature extraction in various languages ($80$ at the moment) on thousands of items, even given limited computing resources. We show how using elfen enables linguistically informed data selection, outlier detection, and text collection comparison.
We release elfen as an open-source PyPI package, accompanied by extensive documentation, including tutorials. We host the code at https://github.com/mmmaurer/elfen/, make it available through the GESIS Methods Hub at https://methodshub.gesis.org/library/methods/elfen/, and provide documentation and tutorials at https://elfen.readthedocs.io/en/latest/. A screencast showcasing elfen is available at https://youtu.be/b4pqHWn6UPU.

elfen: A Python Package for Efficient Linguistic Feature Extraction for Natural Language Datasets

Annual reports communicate corporate performance to stakeholders through dense tables and explanatory text, with rich grounding signals making automated reasoning challenging. Existing QA benchmarks focus on retrieval or single-modality reasoning and rarely require justification for answers with both textual and tabular evidence. We introduce ARQA (Annual Report QA), a benchmark of ~2.5K QA pairs spanning ten fiscal years of automotive enterprise annual reports and three reasoning families — Lookup, Arithmetic, and Insight. Data are produced via a planner–generator pipeline, deterministically verified and recomputed, and fully reviewed by domain experts. We evaluate state-of-the-art instruction-tuned language models on ARQA, showing strong factual retrieval but persistent weaknesses in grounded arithmetic and causal reasoning. We release ARQA and its evaluation toolkit to facilitate research on auditable, evidence-first reasoning over enterprise documents. (https://github.com/RuilongWang/ARQA-Benchmark/)

ARQA: A Benchmark for Grounded Table–Text QA in Enterprise Annual Reports

While large language models (LLMs) excel at dialogue, whether they can adequately parse the structure of conversationвЂ”resolve the reply-to relationship between utterances, or attribute roles like speakers and addresseesвЂ”remains underexplored, especially in multimodal settings. To address this, we introduce a suite of tasks for multimodal conversation understanding and release TV-MMPC, a new human-annotated dataset of conversational roles and threading in television dialogue. Our evaluation reveals that while all multimodal LLMs outperform our heuristic baseline, even the best-performing model we consider experiences a substantial drop in performance when character identities of the conversation are anonymized. Beyond evaluation, we carry out a sociolinguistic analysis of 350,842 utterances in TVQA. We find that while female characters initiate conversations at rates in proportion to their speaking time, they are 1.2 times more likely than men to be cast as an addressee or side-participant, and the presence of side-participants shifts the conversational register from personal to social.

Multimodal Conversation Structure Understanding

Turn-taking is a fundamental component of human communication and is signalled through complex cues distributed across lexical, temporal, and prosodic information. Full-duplex models of spoken dialogue integrate these information sources to produce impressive turn-taking behaviour; However, existing evaluations of their turn-taking capabilities do not address which information sources drive their predictions. We present a systematic analysis of the role of lexical-temporal features on the predictability of turn structure by examining PairwiseTurnGPT, a full-duplex model of spoken dialogue transcripts. Through PCA, mixed-effects modelling, and temporal surprisal analysis, we reveal context-dependent patterns: linguistic fluency paradoxically creates overconfidence at intermediate completion points, while turn-shift overlap dominates boundary detection. Our findings uncover where lexical-temporal information suffices and where additional cues become necessary, establishing a deeper understanding of how turn-taking cues are distributed and how to evaluate dialogue systems.

Analysing the role of lexical and temporal information in turn-taking through predictability

AI-text detectors achieve high accuracy on in-domain benchmarks, but often struggle to generalize across different generation conditions such as unseen prompts, model families, or domains. While prior work has reported these generalization gaps, there are limited insights about the underlying causes. In this work, we present a systematic study aimed at explaining generalization behavior through linguistic analysis. We construct a comprehensive benchmark that spans 6 prompting strategies, 7 large language models (LLMs), and 4 domain datasets, resulting in a diverse set of human- and AI-generated texts. Using this dataset, we fine-tune classification-based detectors on various generation settings and evaluate their cross-prompt, cross-model, and cross-dataset generalization. To explain the performance variance, we compute correlations between generalization accuracies and feature shifts of 80 linguistic features between training and test conditions. Our analysis reveals that generalization performance for specific detectors and evaluation conditions is significantly associated with linguistic features such as tense usage and pronoun frequency.

Explaining Generalization of AI-Generated Text Detectors Through Linguistic Analysis

Most evaluation methods like LLM-as-a-judge treat each test example independently, overlooking the potential to learn from previous evaluations. We introduce **Learning While Evaluating** (LWE), a framework that enables evaluators to improve sequentially during testing without parameter updates. LWE maintains an evolving *meta-prompt* that (i) produces sample-specific evaluation instructions and (ii) updates itself using self-generated feedback after each batch. While sequential updating improves performance, processing every sample incurs substantial computational overhead. We therefore propose ***Selective* LWE**, which updates the meta-prompt only for cases where the evaluator is uncertain, focusing computation on the most informative samples. On multimodal pairwise evaluation benchmarks, *Selective* LWE outperforms baselines and achieves comparable accuracy to full sequential updates while significantly reducing token costs.

Becoming Experienced Judges: Selective Test-Time Learning for Evaluators

Large Language Models (LLMs) frequently confabulate scientific facts, but the mechanisms underlying these failures remain poorly understood. We introduce **Reddit False And Correct Texts** (ReFACT), a benchmark of 1,001 expert-annotated question-answer pairs with **span-level error annotations**, enabling fine-grained analysis of confabulation detection, localization, and correction. Evaluating 9 state-of-the-art LLMs reveals two fundamental limitations. First, models exhibit a dominant **salient distractor** failure mode: 61% of incorrect span predictions are semantically unrelated to actual errors, indicating models fixate on contextually prominent terms rather than true error locations. This pattern persists across model scales (1B to 70B), suggesting scaling alone is insufficient to address this limitation. Second, **comparative judgment** (selecting which of two answers contains factual confabulations) proves fundamentally harder than **independent judgment** (classifying a single answer as confabulated or not). For example, even GPT-4o's Fв‚Ѓ score drops from 0.67 to 0.53 when comparing factual versus confabulated answers side-by-side. This dramatic performance degradation challenges the reliability of LLM-as-Judge paradigms widely adopted in current benchmarks. Code and data are released on [anonymized](https://storage.googleapis.com/dataset-refact/refact-dataset.jsonl).

ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations

Conversational information seeking (CIS) systems aim to model the user's information need within the conversational context and retrieve the relevant information. One major approach to modeling the conversational context aims to rewrite the user utterance in the conversation to represent the information need independently. In this work, we hypothesize that breaking down the information of an utterance into multiple queries covering different aspects of the information need can lead to more effective retrieval performance. This is more evident in more complex utterances that require gathering evidence from various information sources, where a single query rewrite or query representation cannot capture the complexity of the utterance. We propose MQ4CS, a multi-aspect query generation and retrieval framework, which uses Large Language Models (LLMs) to break the user utterance into multiple queries. This approach improves retrieval performance, as most utterances benefit from more than one rewritten query. We evaluate MQ4CS on six widely used CIS datasets, showing it outperforms state-of-the-art query rewriting methods. Using MQ4CS, we also construct MASQ, which includes multiple-aspect queries for the six datasets. Fine-tuning the \llama model on MASQ yields significant improvements. We make our code and dataset publicly available.

Generating Multi-Aspect Queries for Conversational Search

Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36\% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components.

Premium content

Downloads

Next from EACL 2026 Main Conference

Pro-QuEST: Prompt-chaining Quiz Engine for testing Specialized Technical Product Knowledge

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES