Morocco

Automatic evaluation of open-ended question answering in specialized domains remains challenging mainly because it relies on manual annotations from domain experts. In this work, we assess the ability of several large language models (LLMs), including closed-access (GPT-5.1, Gemini-2.5-Pro), open-source general-purpose (Qwen-80B), and biomedical domain-adapted models (MedGemma-27B, Phi-3.5-mini variants), to act as automatic evaluators of semantic equivalence in French medical open-ended QA. Our analysis reveals that LLM-based judgments are sensitive to the source of answer generation: judgement correlation varies substantially across different generator models. Among the judges, MedGemma-27B and Qwen-80B achieve the highest agreement with expert annotations in terms of F1 score and Pearson correlation. We further explore lightweight adaptation strategies on Phi-3.5-mini using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO). Even with 184 training instances, these adaptations significantly improve Phi-3.5’s results and reduce variability across answer generators, achieving performance comparable to larger domain-adapted models. Our results highlight the importance of generator-aware evaluation, the limitations of general-purpose LLMs in domain-specific settings, and the effectiveness of lightweight adaptation for compact models in low-resource scenarios.

EACL 2026 Main Conference

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

workshop paper

#### *Message from the General Chair, Aline Villavicencio*
I’m delighted and honoured to welcome you to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026), taking place in the beautiful city of Rabat, in Morocco, in March 24-29, 2026. EACL is the flagship European conference of the Association and EACL 2026 proudly continues our field’s tradition of excellence in scholarship, innovation, and inclusivity. I am deeply grateful to the many volunteers whose dedication, generosity, and tireless efforts have made this conference possible.
For the first time EACL is being hosted in the African continent. This is an important milestone for our community, and we are grateful to our Moroccan hosts for enabling this historic moment by bringing this edition of EACL to Rabat. We are also delighted that the Second Arabic NLP School is co-located with EACL. We hope attendees enjoy this wonderful opportunity to strengthen ties with the Computational Linguistics communities across the African continent. *[Read full message](https://drive.google.com/file/d/14NlmHvwM6fPJuMmOvVh7K0vtQbEyv3SZ/view?usp=sharing)*<br><br>

<html><button style="display: inline-flex; align-items: center; justify-content: center; white-space: nowrap; border-radius: 9999px; font-weight: bold; background: #7c3aed; color: white; font-family: 'Space Grotesk', sans-serif; height: 40px; font-size: 16px; padding: 0 20px; border: none; cursor: pointer" onclick="window.open('https://underline.io/events/522/reception','_blank')">Go to Workshops and Tutorials Program</button></html>
<br><br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to EACL 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://2026.eacl.org/registration/) first.

**Online Registration Form**: https://acl.swoogo.com/eacl2026

Registration Required

Welcome to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL). EACL 2026 will be held in Rabat, Morocco, from March 24–29, 2026. 

This paper presents an LLM-driven approach for constructing diverse social media datasets to measure and compare loneliness in the caregiver and non-caregiver populations. We introduce an expert-developed loneliness evaluation framework and an expert-informed typology for categorizing causes of loneliness for analyzing social media text. Using a human-validated data processing pipeline, we apply GPT-4o, GPT-5-nano, and GPT-5 to build a high-quality Reddit corpus and analyze loneliness across both populations. The loneliness evaluation framework achieved average accuracies of 76.09\% and 79.78\% for caregivers and non-caregivers, respectively. The cause categorization framework achieved micro-aggregate F1 scores of 0.825 and 0.80 for caregivers and non-caregivers, respectively. Across populations, we observe substantial differences in the distribution of types of causes of loneliness. Caregivers’ loneliness were predominantly linked to caregiving roles, identity recognition, and feelings of abandonment, indicating distinct loneliness experiences between the two groups. Demographic extraction further demonstrates the viability of Reddit for building a diverse caregiver loneliness dataset. Overall, this work establishes an LLM-based pipeline for creating high quality social media datasets for studying loneliness and demonstrates its effectiveness in analyzing population-level differences in the manifestation of loneliness.

Why Are We Lonely? Leveraging LLMs to Measure and Understand Loneliness in Caregivers and Non-caregivers

Accurate normalization of health-related expressions to standardized biomedical concepts is crucial for both healthcare and biomedical research. However, traditional string-based matching methods are limited by lexical variations. In this study, we propose a neural embedding-based normalization framework that utilizes an embedding model trained on biomedical terminology, generating over 3.59 million embeddings corresponding to UMLS terms and Concept Unique Identifiers (CUIs). For clinical data, CUIs were retrieved via semantic matching, while Twitter phrases were first processed using a large language model (LLM) to generate preferred terms prior to embedding-based CUI retrieval. Our approach substantially outperforms exact string matching and MetaMap Lite. For clinical data (3,144 phrases), normalization accuracy improved from 0.679 (string match) and 0.574 (MetaMap Lite) to 0.858. For Twitter data (102 phrases), accuracy increased from 0.235 (string match) and 0.118 (MetaMap Lite) to a range of 0.882 (Gemini 2.5 Flash) to 0.980 (GPT-4o mini). These findings highlight both the effectiveness of embedding-based semantic retrieval and the ability of LLMs to generate preferred terms, enhancing robustness in health concept normalization across diverse text sources.

Normalizing Health Concepts with Biomedical Embedding and LLMs

Large language models (LLMs) often default to single-label classification in zero-shot multi-label tasks—a tendency we term "conservative default". While few-shot prompting mitigates this, it introduces "example bias". We evaluate zero-shot strategies to modulate this tendency using 1,441 healthcare feedback records and two LLMs. We compare instruction-based methods with structural constraints that modify the token generation sequence, specifically an Enum-First format requiring domain enumeration before selection. Results show that structural constraints substantially reduce single-label rates (Magistral: 96% → 19%; Qwen3: 54% → 0.0%), though the latter suggests potential over-correction compared to human baselines (16.7–41.3%). These findings indicate that while output structure is a potent modulator of classification behavior by shifting the decision point upstream, its effect magnitude is model-dependent, necessitating empirical calibration to prevent spurious associations.

Modulating Multi-Label Tendency in Zero-Shot LLM Coding: The Effect of Output Structure on CDSS Feedback Analysis

Large Language Models (LLMs) demonstrate impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of existing data and models. Meanwhile, multi-agent models have shown strong capabilities in solving complex tasks. In this paper, we evaluate the performance of LLMs in a multi-agent interaction setting for the novel task of multicultural image generation. Our key contributions are: (1) We introduce MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas; (2) We provide a dataset of 9,000 multicultural images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages; and (3) We demonstrate that multi-agent interactions outperform simple, no-agent models across multiple evaluation metrics, offering valuable insights for future research.Our dataset and models are available at https://anonymous.4open.science/r/MosAIG.

Multi-Agent Multimodal Models for Multicultural Text to Image Generation

Long-form educational videos contain valuable mentorship insights, but extracting structured knowledge from hours of unscripted content remains challenging. We introduce MentorQA, a publicly available dataset and framework for mentorship-focused question answering from multilingual long-form videos. The dataset includes nearly 9,000 question–answer pairs collected from 120 mentorship videos across four languages (English, Hindi, Chinese, and Romanian) and six topics. We compare four QA-generation models and introduce mentorship-oriented evaluation metrics that go beyond factual correctness to assess learning value. Through comprehensive evaluation with nine LLM judges and twelve human annotators, we find that Multi-Agent pipelines consistently produce higher-quality mentorship-focused QA, particularly for complex topics and low-resource languages. We release the dataset and evaluation framework to support future research in multilingual educational and mentorship-focused AI at https://anonymous.4open.science/r/MentorQA/.

MentorQA: Multi-Agent Multilingual Question Answering for Long-Form Mentorship Content

In this paper, we study the capabilities of large language models (LLMs) to adapt a concert moderation to diverse expertise levels of listeners. Our proof-of-concept concert moderator is based on retrieval-augmented generation (RAG) and uses few-shot audience modelling to infer listener's expertise. We study the capabilities of the system to adapt to three different listener's expertise levels. Two open domain LLMs are compared: gpt-oss:20b and llama3. The recognised differences among the models suggest that they vary in how directly they reproduce versus paraphrase retrieved information while maintaining semantic alignment.

From Novice to Expert: Generating Audience-Dependent Concert Moderations with RAG-LLMs

The advancement of Machine learning (ML), Large Audio Language Models (LALMs), and autonomous AI agents in Music Information Retrieval (MIR) necessitates a shift from static tagging to rich, human-aligned representation learning. However, the scarcity of open-source infrastructure capable of capturing the subjective nuances of audio annotation remains a critical bottleneck. This paper introduces LabelBuddy, an open-source collaborative auto-tagging audio annotation tool designed to bridge the gap between human intent and machine understanding. Unlike static tools, it decouples the interface from inference via containerized backends, allowing users to plug in custom models for AI-assisted pre-annotation. We describe the system architecture, which supports multi-user consensus, containerized model isolation, and a roadmap for extending agents and LALMs. Code available at https://github.com/GiannisProkopiou/gsoc2022-Label-buddy.

LabelBuddy: An Open Source Music and Audio Language Annotation Tagging Tool Using AI Assistance

Audio-video question answering (AVQA) systems for music show signs of multimodal "understanding", but it is unclear which inputs they rely on or whether their behavior reflects genuine audio-video reasoning. Existing evaluations focus on overall accuracy and rarely examine modality dependence. We address this gap by suggesting a method of using counterfactual evaluations to analyse the audio-video understanding of the models, illustrated with a case study on the audio-video spatial-temporal (AVST) architecture. This includes interventions that zero out or swap audio, video, or both, where results are benchmarked against a baseline based on linguistic patterns alone. Results show stronger reliance on audio than video, yet performance persists when either modality is removed, indicating learned cross-modal representations. The AVQA system studied thus exhibits non-trivial multimodal integration, though its "understanding" remains uneven.

Stochastic Parrots or True Virtuosos? Digging Deeper Into the Audio-Video Understanding of AVQA Models

Although annotated music descriptor datasets for user queries are increasingly common, few consider the user’s intent behind these descriptors, which is essential for effectively meeting their needs. We introduce MusicRecoIntent, a manually annotated corpus of 2,291 Reddit music requests, labeling musical descriptors across seven categories with positive, negative, or referential preference-bearing roles. We then investigate how reliably large language models (LLMs) can extract these music descriptors, finding that they do capture explicit descriptors but struggle with context-dependent ones. This work can further serve as a benchmark for fine-grained modeling of user intent and for gaining insights into improving LLM-based music understanding systems.

Beyond Musical Descriptors: Extracting Preference-Bearing Intent in Music Queries

Music often shares notable parallels with language, motivating the use of pretrained large language models (LLMs) for symbolic music understanding and generation. Despite growing interest, the practical effectiveness of adapting instruction-tuned LLMs to symbolic music remains insufficiently characterized. We present a controlled comparative study of finetuning strategies for ABC-based generation and understanding, comparing an off-the-shelf instruction-tuned backbone to domain-adapted variants and a music-specialized LLM baseline. Across multiple symbolic music corpora and evaluation signals, we provide some insights into adaptation choices for symbolic music applications. We highlight the domain adaptation vs.~preserving prior information tradeoff as well as the distinct behaviour of metrics used to measure the domain adaptation for symbolic music.

Premium content

Downloads

Next from EACL 2026 Main Conference

Why Are We Lonely? Leveraging LLMs to Measure and Understand Loneliness in Caregivers and Non-caregivers

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES