Morocco

Categorization is a core component of human linguistic competence. We investigate how a transformer-based language model (LM) learns linguistic categories by comparing its behaviour over the course of training to behaviours which characterize abstract feature–based and concrete exemplar–based accounts of human language acquisition. We investigate how lexical semantic and syntactic categories emerge using novel divergence-based metrics that track learning trajectories using next-token distributions. In experiments with GPT-2 small, we find that (i) when a construction is learned, abstract class-level behaviour is evident at earlier steps than lexical item–specific behaviour, and (ii) that different linguistic behaviours emerge abruptly in sequence at different points in training, revealing that abstraction plays a key role in how LMs learn. This result informs the models of human language acquisition that LMs may serve as an existence proof for.

EACL 2026 Main Conference

Humans and transformer LMs: Abstraction drives language learning

technical paper

#### *Message from the General Chair, Aline Villavicencio*
I’m delighted and honoured to welcome you to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026), taking place in the beautiful city of Rabat, in Morocco, in March 24-29, 2026. EACL is the flagship European conference of the Association and EACL 2026 proudly continues our field’s tradition of excellence in scholarship, innovation, and inclusivity. I am deeply grateful to the many volunteers whose dedication, generosity, and tireless efforts have made this conference possible.
For the first time EACL is being hosted in the African continent. This is an important milestone for our community, and we are grateful to our Moroccan hosts for enabling this historic moment by bringing this edition of EACL to Rabat. We are also delighted that the Second Arabic NLP School is co-located with EACL. We hope attendees enjoy this wonderful opportunity to strengthen ties with the Computational Linguistics communities across the African continent. *[Read full message](https://drive.google.com/file/d/14NlmHvwM6fPJuMmOvVh7K0vtQbEyv3SZ/view?usp=sharing)*<br><br>

<html><button style="display: inline-flex; align-items: center; justify-content: center; white-space: nowrap; border-radius: 9999px; font-weight: bold; background: #7c3aed; color: white; font-family: 'Space Grotesk', sans-serif; height: 40px; font-size: 16px; padding: 0 20px; border: none; cursor: pointer" onclick="window.open('https://underline.io/events/522/reception','_blank')">Go to Workshops and Tutorials Program</button></html>
<br><br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to EACL 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://2026.eacl.org/registration/) first.

**Online Registration Form**: https://acl.swoogo.com/eacl2026

Registration Required

Welcome to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL). EACL 2026 will be held in Rabat, Morocco, from March 24–29, 2026. 

The legal systems have a hierarchical structure, and a higher-level law often authorizes a lower-level law to implement detailed provisions, which is called delegation. When interpreting legal texts with delegation, readers must repeatedly consult the lower-level laws that stipulate the detailed provisions, imposing a substantial workload. Therefore, it is necessary to develop a system that enables readers to instantly refer to relevant laws in delegation. However, manually annotating delegation is difficult because it requires extensive legal expertise, careful reading of numerous legal texts, and continuous adaptation to newly enacted laws. In this study, we focus on Japanese law and develop a two-stage pipeline system for automatic delegation annotation. First, we extract keywords that indicate delegation using a named entity recognition approach. Second, we identify the delegated provision corresponding to each keyword as an entity disambiguation task. In our experiments, the proposed system demonstrates sufficient performance to assist manual annotation in practice.

Toward Automatic Delegation Extraction in Japanese Law

In a rapidly evolving labor market, detecting and addressing emerging skill needs is essential for shaping responsive education and workforce policies. Online job advertisements (OJAs) provide a real-time view of changing demands, but require first retrieving skill mentions from unstructured text and then solving the entity linking problem of connecting them to standardized skill taxonomies. To harness this potential, we present a multilingual human-in-the-loop (HITL) pipeline that operates in two steps: candidate skills are extracted from national OJA corpora using country-specific word embeddings, capturing terms that reflect each country’s labor market. These candidates are linked to ESCO using an encoder-based system and refined through a decoder large language models (LLMs) for accurate contextual alignment. Our approach is validated through both quantitative and qualitative evaluations, demonstrating that our method enables timely, multilingual monitoring of emerging skills, supporting agile policy-making and targeted training initiatives.

SkiLLens: Recognising and Mapping Novel Skills from Millions of Job Ads Across Europe Using Language Models

There has been considerable interest in using surprisal from Transformer-based language models (LMs) as predictors of human sentence processing difficulty. Recent work has observed an inverse scaling relationship between Transformers' per-word estimated probability and the predictive power of their surprisal estimates on reading times, showing that LMs with more parameters and trained on more data are less predictive of human reading times. However, these studies focused on predicting latency-based measures. Tests on brain imaging data have not shown a trend in any direction when using a relatively small set of LMs, leaving open the possibility that the inverse scaling phenomenon is constrained to latency data. This study therefore conducted a more comprehensive evaluation using surprisal estimates from 17 pre-trained LMs across three different LM families on two functional magnetic resonance imaging (fMRI) datasets. Results show that the inverse scaling relationship between models' per-word estimated probability and model fit on both datasets still obtains, resolving the inconclusive results of previous work and indicating that this trend is not specific to latency-based measures.

Surprisal from Larger Transformer-based Language Models Predicts fMRI Data More Poorly

Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforward to use and compare in monolingual setups, but rest on a number of assumptions in multilingual setups. One such assumption is that comparing the perplexity of CLMs on parallel sentences is indicative of their quality since the information content (here understood as the semantic meaning) is the same. However, the metrics are inherently measuring information content in the information-theoretic sense. We make this and other such assumptions explicit and discuss their implications. We perform experiments with six metrics on two multi-parallel corpora both with mono- and multilingual models. Ultimately, we find that current metrics are not universally comparable. We look at the form-meaning debate to provide some explanation for this.

Form and Meaning in Intrinsic Multilingual Evaluations

Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) is an increasingly explored approach for combining the reasoning capabilities of large language models with the structured evidence of knowledge graphs. However, current evaluation practices fall short: existing benchmarks often include questions that can be directly answered using existing triples in KG, making it unclear whether models perform reasoning or simply retrieve answers directly. Moreover, inconsistent evaluation metrics and lenient answer matching criteria further obscure meaningful comparisons. In this work, we introduce a general method for constructing benchmarks, together with an evaluation protocol, to systematically assess KG-RAG methods under knowledge incompleteness. Our empirical results show that current KG-RAG methods have limited reasoning ability under missing knowledge, often rely on internal memorization, and exhibit varying degrees of generalization depending on their design.

What Breaks Knowledge Graph based RAG? Empirical Insights into Reasoning under Incomplete Knowledge

Extending large language models to low-resource languages is essential for global accessibility, but training separate models per language is prohibitively expensive. Mixture-of-Experts (MoE) architectures address this by adding sparse language-specific parameters, but determining how many experts each layer needs remains an open question. Current approaches allocate experts based on layer-level similarity, yet language processing exhibits fine-grained specialization at individual neurons. We propose NeuronMoE, a method that analyzes language-specific neurons across all transformer components to guide expert allocation per layer based on empirically measured cross-lingual neuron diversity. Applied to Llama-3.2-3B for low-resource languages (Greek, Turkish, and Hungarian), this approach achieves approximately 40% average parameter reduction while matching the performance of the LayerMoE baseline. We find that low-resource language experts independently develop neuron specialization patterns mirroring the high-resource language, which are concentrated in early and late layers. This reveals potential universal architectural principles in how multilingual models organize linguistic knowledge. Our approach generalizes across architectures, as validated on Qwen, and shows that allocation strategy matters more than total expert count.

NeuronMoE: Efficient Cross-Lingual Extension via Neuron-Guided Mixture-of-Experts

Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases—crucial in real-world applications where users express the same intent in varied ways—remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA—a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing—even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.

SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space

In this work, we propose an entity type-specific and knowledge-augmented token classification framework designed to improve encoder models' performance on recipe texts. Our empirical analysis shows that this approach achieves state-of-the-art (SOTA) results on 5 out of 7 benchmark recipe datasets, significantly outperforming traditional token classification methods. We introduce a novel methodology leveraging curated domain-specific knowledge contexts to guide encoder models such as BERT and RoBERTa, which we refer to as RecipeBERT-KA and RecipeRoBERTa-KA. Additionally, we release a newly reprocessed entity type-specific and knowledge-enriched dataset that merges seven widely used food datasets, making it the largest annotated food-related dataset to date. Comparative analysis with SOTA large language models (GPT-4o, Mistral-7B, LLaMA 3-13B and LLaMA 3-70B) highlights the practical advantages of our smaller and specialised models. Finally, we analyse the impact of the different knowledge contexts, our models' potential for transfer learning, the effect of combining the datasets and scenarios where traditional token classification may still perform competitively, offering nuanced insight into method selection.

Knowledge Augmentation Enhances Token Classification for Recipe Understanding

Counterspeech generation has emerged as a promising approach to combat online hate speech, with recent work focusing on controlling attributes used in counterspeech, such as strategies or intents. While these attributes are often evaluated automatically using classifiers, a key goal of this evaluation is to compare the performance of different generation models. However, the validity of such evaluation results is questionable when the classifiers themselves have only modest performance. This paper examines the automatic evaluation of counterspeech attributes using a multi-attribute counterspeech dataset containing 2,728 samples. We investigate when automatic evaluation can be trusted for model comparison and address the limitations of current evaluation methodologies. We make concrete recommendations for how to perform classifier validation before model evaluation. Our classifier validation results demonstrate that even limited classifiers can produce trustworthy model rankings. Therefore, we argue that when comparing counterspeech generation models, a classifier's ability to rank generation models is a more direct measure of its practical utility than traditional classification metrics, e.g., accuracy and F1.

Validating Automatic Evaluation of Controllable Counterspeech Generation: Rankings Matter More Than Scores

How capable are large language models (LLMs) in the domain of taxation? Although numerous studies have explored the legal domain, research dedicated to taxation remains scarce. Moreover, the datasets used in these studies are either simplified, failing to reflect the real-world complexities, or not released as open-source. To address this gap, we introduce PLAT, a new benchmark designed to assess the ability of LLMs to predict the legitimacy of additional tax penalties. PLAT comprises 300 examples: (1) 100 binary-choice questions, (2) 100 multiple-choice questions, and (3) 100 essay-type questions, all derived from 100 Korean court precedents. PLAT is constructed to evaluate not only LLMs' understanding of tax law but also their performance in legal cases that require complex reasoning beyond straightforward application of statutes. Our systematic experiments with multiple LLMs reveal that (1) their baseline capabilities are limited, especially in cases involving conflicting issues that require a comprehensive understanding (not only of the statutes but also of the taxpayer's circumstances), and (2) LLMs struggle particularly with the "AC" stages of "IRAC" even for advanced reasoning models like o3, which actively employ inference-time scaling.

Premium content

Downloads

Next from EACL 2026 Main Conference

Toward Automatic Delegation Extraction in Japanese Law

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES