Morocco

When language models are trained on synthetic data, they (student model) can covertly acquire behavioral traits from the data-generating model (teacher model). &quot;Subliminal learning&quot; refers to the transmission of traits from a teacher to a student model via training on data unrelated to those traits. Prior work demonstrated this in the training domains of number sequences, code, and math Chain-of-Thought traces including transmission of misaligned behaviors. 
We investigate whether transmission occurs through natural language paraphrases with fixed semantic content, and whether content explicitly contradicting the teacher&#39;s preference can block it. 
We find that training on paraphrases from a teacher system-prompted to love a particular animal increases a student&#39;s preference for that animal by up to 19 percentage points. This occurs when paraphrased content is semantically unrelated to the animal, or even when it explicitly expresses dislike.
The transmission succeeds despite aggressive filtering to ensure paraphrase fidelity.
This raises concerns for pipelines where models generate their own training data: content-based inspection cannot detect such transmission, and even preference-contradicting content fails to prevent it.

EACL 2026 Main Conference

You Didn&#39;t Have to Say It Like That: Subliminal Learning from Faithful Paraphrases

When language models are trained on synthetic data, they (student model) can covertly acquire behavioral traits from the data-generating model (teacher model). "Subliminal learning" refers to the transmission of traits from a teacher to a student model via training on data unrelated to those traits. Prior work demonstrated this in the training domains of number sequences, code, and math Chain-of-Thought traces including transmission of misaligned behaviors. 
We investigate whether transmission occurs through natural language paraphrases with fixed semantic content, and whether content explicitly contradicting the teacher's preference can block it. 
We find that training on paraphrases from a teacher system-prompted to love a particular animal increases a student's preference for that animal by up to 19 percentage points. This occurs when paraphrased content is semantically unrelated to the animal, or even when it explicitly expresses dislike.
The transmission succeeds despite aggressive filtering to ensure paraphrase fidelity.
This raises concerns for pipelines where models generate their own training data: content-based inspection cannot detect such transmission, and even preference-contradicting content fails to prevent it.

You Didn't Have to Say It Like That: Subliminal Learning from Faithful Paraphrases

technical paper

#### *Message from the General Chair, Aline Villavicencio*
I’m delighted and honoured to welcome you to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026), taking place in the beautiful city of Rabat, in Morocco, in March 24-29, 2026. EACL is the flagship European conference of the Association and EACL 2026 proudly continues our field’s tradition of excellence in scholarship, innovation, and inclusivity. I am deeply grateful to the many volunteers whose dedication, generosity, and tireless efforts have made this conference possible.
For the first time EACL is being hosted in the African continent. This is an important milestone for our community, and we are grateful to our Moroccan hosts for enabling this historic moment by bringing this edition of EACL to Rabat. We are also delighted that the Second Arabic NLP School is co-located with EACL. We hope attendees enjoy this wonderful opportunity to strengthen ties with the Computational Linguistics communities across the African continent. *[Read full message](https://drive.google.com/file/d/14NlmHvwM6fPJuMmOvVh7K0vtQbEyv3SZ/view?usp=sharing)*<br><br>

<html><button style="display: inline-flex; align-items: center; justify-content: center; white-space: nowrap; border-radius: 9999px; font-weight: bold; background: #7c3aed; color: white; font-family: 'Space Grotesk', sans-serif; height: 40px; font-size: 16px; padding: 0 20px; border: none; cursor: pointer" onclick="window.open('https://underline.io/events/522/reception','_blank')">Go to Workshops and Tutorials Program</button></html>
<br><br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to EACL 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://2026.eacl.org/registration/) first.

**Online Registration Form**: https://acl.swoogo.com/eacl2026

Registration Required

Welcome to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL). EACL 2026 will be held in Rabat, Morocco, from March 24–29, 2026. 

While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques -- Attribution Analysis, Function Vector Analysis, and Circuit Tracing -- and introduces a novel methodology: using a vision-language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI-powered explanations successfully bridged the knowledge gap for non-experts; a statistical analysis showed no significant correlation between a user's prior LLM experience and their comprehension scores, indicating that the system effectively leveled the playing field. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.

Simplifying Outcomes of Language Model Component Analyses with ELIA

The rapid growth of scientific literature has made manual extraction of structured knowledge increasingly impractical. To address this challenge, we introduce SCILIRE, a system for creating datasets designed around Human-AI teaming principles centred on workflows for verifying and curating data. SCILIRE implements an iterative workflow in which researchers can review, correct AI outputs, where this interaction is used as a feedback signal to improve future LLM-based inference. We evaluate our design principles using a combination of intrinsic benchmarking outcomes together with real-world case studies across multiple domains. The results demonstrate that SCILIRE improves extraction fidelity, facilitates efficient dataset creation, and empowers researchers to maintain control over their curated data.

Using a Human-AI Teaming Approach to Create and Curate Scientific Dataset with the SCILIRE System (presented virtually)

We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization. This tool is aimed at education stakeholders as well as *ACL community at large, as it supports learning and can also be used to collect user feedback and annotation.

AITutor-EvalKit: Exploring the Capabilities of AI Tutors (presented virtually)

Tracking financial investments in climate adaptation is complex and expertise-intensive, particularly for Early Warning Systems (EWS), where multilateral development bank (MDB) and fund reports lack standardized financial reporting and appear as heterogeneous PDFs with complex tables and inconsistent layouts.


We introduce an agent-based Retrieval-Augmented Generation (RAG) system that uses hybrid retrieval and internal chain-of-thought (CoT) reasoning to extract relevant financial data, classify EWS investments, and allocate budgets with grounding evidence spans. While these components are individually established, our contribution is their integration into a domain-specific workflow tailored to heterogeneous MDB reports and numerically grounded EWS budget allocation. On a manually annotated CREWS Fund corpus, our system outperforms four alternatives (zero-shot classifier, few-shot ``zero rule'' classifier, fine-tuned transformer-based classifier, and few-shot CoT+ICL classifier) on multi-label classification and budget allocation, achieving 87% accuracy, 89% precision, and 83% recall. We further benchmark against the Gemini 2.5 Flash AI Assistant on an expert-annotated MDB evidence set co-curated with the World Meteorological Organization (WMO), enabling a comparative analysis of glass-box agents versus black-box assistants in transparency and performance. The system is publicly deployed and accessible at https://ews-front.vercel.app/ (see Appendix A for demonstration details and Appendix B for dataset statistics and splits). We will open-source all code, LLM generations, and human annotations to support further work on AI-assisted climate finance.

AI for Climate Finance: Agentic Retrieval and Multi-Step Reasoning for Early Warning System Investments

Translation Memory (TM) systems are core components of commercial computer-aided translation (CAT) tools. However, traditional fuzzy matching methods often fail to retrieve semantically relevant content when surface similarity is low. We introduce SmartMatch, an open-source semantic retrieval system for TM that connects modern sentence encoders with a vector database and exposes the full retrieval pipeline through an interactive web-based User Interface (UI). The demo allows users to (i) enter a query segment, (ii) switch between embedding backends, and (iii) inspect the top retrieved matches with similarity scores and qualitative cues, and also observe end-to-end latency in real time. We provide a reproducible benchmark on multilingual TM data, reporting retrieval quality using reference-based MT metrics (COMET, BERTScore, METEOR, chrF), together with coverage and latency/throughput trade-offs relevant to production CAT workflows. On DGT-TM, encoder-based retrieval achieves full coverage (100%) with millisecond-level latency (p50/p90 $\leq$ 6--20\,ms) and attains the strongest semantic-quality scores on the shared query set (e.g., BERTScore up to 0.91 at $k{=}10$), while BM25 remains a strong lightweight lexical baseline with very low latency. Overall, SmartMatch bridges recent advances in sentence encoders with the real-time constraints of translation memory retrieval

SmartMatch: Real-Time Semantic Retrieval for Translation Memory Systems

Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic structure, amplify bias, and waste capacity across languages and domains. This paper reframes tokenization as a core modeling decision rather than a preprocessing step. We argue for a context-aware framework that integrates tokenizer and model co-design, guided by linguistic, domain, and deployment considerations. Standardized evaluation and transparent reporting are essential to make tokenization choices accountable and comparable. Treating tokenization as a core design problem, not a technical afterthought, can yield language technologies that are fairer, more efficient, and more adaptable.

Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question-answer benchmark of 4,420 question-answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold-standard dataset (500 QA pairs) through expert-driven data annotation. We developed an ensemble auto-evaluation LLM that incorporates expert labeling characteristics, average fact–abstract similarity (F 1), and low-similarity fact counts (F 2), enabling the construction of a silver-standard dataset (3,720 QA pairs). We are committed to maintaining RAGPPI as a resource to support the research community in advancing RAG systems for drug discovery QA solutions.

RAGPPI: Retrieval-Augmented Generation Benchmark for ProteinвЂ“Protein Interactions in Drug Discovery

Policy-gradient reinforcement learning (PGRL) form the backbone of current methods used to improve language model reasoning. However, these methods are incompatible with diffusion based language models (dLLMs). Most attempts to apply PGRL to dLLMs, either are extremely unscalable or use unprincipled approximations. Our proposed framework uses a novel pseudo-likelihood based objective for alignment of dLLMs. Our objective has the same optima as PGRL based optimization, but does not need to evaluate exact likelihood from dLLMs, thus providing a principled and practical alternative for fine-tuning of dLLMs. Additionally other than the pseudo-likelihood based objective, we present two computationally efficient approximations to allow for scalable training. Experiments on multiple reasoning and coding benchmarks show that our proposals match or surpasses the performance of GRPO and related baselines.

Pseudo-Likelihood Training for Reasoning Diffusion Language Models

Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Paradoxically, MLLMs sometimes produce correct answers while misinterpreting crucial visual elements, masking these underlying perception failures. Our preliminary analysis on a joint perception-reasoning dataset revealed that 29% of correct reasoning answers from a leading MLLM contained perception errors. To systematically study visual perception abilities of MLLMs, we introduce Do You See Me- a scalable, programmatically generated benchmark with 1758 images and 2612 questions across seven core subtasks spanning 2D and 3D variants (twelve total tasks) providing parametric control over difficulty levels. The benchmark tasks are inspired by human psychology. Our evaluation of eleven leading MLLMs reveals a stark deficit: humans achieve 95.83% accuracy, while top MLLMs average below 50%. This performance gap widens drastically as task complexity increases. Further diagnostics show: (1) supervised finetuning offers only modest gains (11%), (2) models tend to exploit task “shortcuts” like MCQ formats over detailed visual analysis, and (3) Chain-of-Thought prompting can degrade complex visual tasks by verbalizing images into lossy text. These findings expose the foundational perception limits in current MLLMs and highlight the need for robust visual perception improvements in MLLMs. The benchmark dataset, source code and evaluation scripts are available at[<https://github.com/microsoft/Do-You-See-Me>].

Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

Contract compliance verification requires reasoning about cross-clause dependencies where obligations, exceptions, and conditions interact across multiple provisions, yet existing legal NLP benchmarks like ContractNLI and CUAD focus exclusively on isolated single-clause tasks. We introduce COMPACT (COMpliance PAralegals via Clause graph reasoning over conTracts), a framework that models cross-clause dependencies through structured clause graphs. Our approach extracts deontic-temporal entities from clauses and constructs typed relationship graphs capturing definitional dependencies, exception hierarchies, and temporal sequences. From these graphs, we introduce ACE (Assessing Compliance in Enterprise)- a benchmark containing 4,700 carefully constructed compliance scenarios derived from 633 real-world contracts covering 26 types of agreements. Each scenario requires multi-hop reasoning across multiple clauses, and undergoes independent LLM-based validation to ensure quality. Evaluation reveals that multi-clause reasoning poses a fundamental challenge for state-of-the-art models (34-57% base accuracy), while training on ACE yields substantial improvements on compliance tasks (+22–43 % points) and also enhances general legal reasoning performance on other benchmarks (PrivaCI-Bench, ContractNLI).

Next from EACL 2026 Main Conference

Simplifying Outcomes of Language Model Component Analyses with ELIA

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES