Morocco

We propose a unified framework for not only attributing synthetic speech to its source but also for detecting speech generated by synthe sizers that were not encountered during training. This requires methods that move beyond simple detection to support both detailed foren- sic analysis and open-set generalization. To address this, we introduce SIGNAL, a hybrid framework that combines speech foundation models (SFMs) with graph-based modeling and open-set-aware inference. Our framework integrates Graph Neural Networks (GNNs) and a k-Nearest Neighbor (KNN) classifier, allowing it to capture meaningful relationships between utterances and recognize speech that doesn&#39;t belong to any known generator. It builds a dynamic graph where utterances are connected based on how closely their learned representations align, enabling the GNN to capture subtle relationships across samples. Meanwhile, the KNN branch handles open-set detection through confidence-based thresholding. We evaluate SIGNAL using the DiffSSD dataset, which offers a diverse mix of real speech and synthetic audio from both open-source and commercial diffusion-based TTS systems. To further assess generalization, we also test on the SingFake benchmark. Our results show that SIGNAL consistently improves performance across both tasks, with Mamba based embeddings delivering especially strong results. To the best of our knowledge, this is the first study to unify graph-based learning and open-set detection for tracing synthetic speech back to its origin.

EACL 2026 Main Conference

Bridging Attribution and Open-Set Detection using Graph-Augmented Instance Learning in Synthetic Speech

We propose a unified framework for not only attributing synthetic speech to its source but also for detecting speech generated by synthe sizers that were not encountered during training. This requires methods that move beyond simple detection to support both detailed foren- sic analysis and open-set generalization. To address this, we introduce SIGNAL, a hybrid framework that combines speech foundation models (SFMs) with graph-based modeling and open-set-aware inference. Our framework integrates Graph Neural Networks (GNNs) and a k-Nearest Neighbor (KNN) classifier, allowing it to capture meaningful relationships between utterances and recognize speech that doesn't belong to any known generator. It builds a dynamic graph where utterances are connected based on how closely their learned representations align, enabling the GNN to capture subtle relationships across samples. Meanwhile, the KNN branch handles open-set detection through confidence-based thresholding. We evaluate SIGNAL using the DiffSSD dataset, which offers a diverse mix of real speech and synthetic audio from both open-source and commercial diffusion-based TTS systems. To further assess generalization, we also test on the SingFake benchmark. Our results show that SIGNAL consistently improves performance across both tasks, with Mamba based embeddings delivering especially strong results. To the best of our knowledge, this is the first study to unify graph-based learning and open-set detection for tracing synthetic speech back to its origin.

technical paper

#### *Message from the General Chair, Aline Villavicencio*
I’m delighted and honoured to welcome you to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026), taking place in the beautiful city of Rabat, in Morocco, in March 24-29, 2026. EACL is the flagship European conference of the Association and EACL 2026 proudly continues our field’s tradition of excellence in scholarship, innovation, and inclusivity. I am deeply grateful to the many volunteers whose dedication, generosity, and tireless efforts have made this conference possible.
For the first time EACL is being hosted in the African continent. This is an important milestone for our community, and we are grateful to our Moroccan hosts for enabling this historic moment by bringing this edition of EACL to Rabat. We are also delighted that the Second Arabic NLP School is co-located with EACL. We hope attendees enjoy this wonderful opportunity to strengthen ties with the Computational Linguistics communities across the African continent. *[Read full message](https://drive.google.com/file/d/14NlmHvwM6fPJuMmOvVh7K0vtQbEyv3SZ/view?usp=sharing)*<br><br>

<html><button style="display: inline-flex; align-items: center; justify-content: center; white-space: nowrap; border-radius: 9999px; font-weight: bold; background: #7c3aed; color: white; font-family: 'Space Grotesk', sans-serif; height: 40px; font-size: 16px; padding: 0 20px; border: none; cursor: pointer" onclick="window.open('https://underline.io/events/522/reception','_blank')">Go to Workshops and Tutorials Program</button></html>
<br><br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to EACL 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://2026.eacl.org/registration/) first.

**Online Registration Form**: https://acl.swoogo.com/eacl2026

Registration Required

Welcome to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL). EACL 2026 will be held in Rabat, Morocco, from March 24–29, 2026. 

Specialized visual tools can augment large language models or vision language models with expert knowledge (e.g., grounding, spatial reasoning, medical knowledge, etc.), but knowing which tools to call (and when to call them) can be challenging. We introduce DART, a multi-agent framework that uses disagreements between multiple debating visual agents to identify useful visual tools (e.g., object detection, OCR, spatial reasoning, etc.) that can resolve inter-agent disagreement. These tools allow for fruitful multi-agent discussion by introducing new information, and by providing tool-aligned agreement scores that highlight agents in agreement with expert tools, thereby facilitating discussion. We utilize an aggregator agent to select the best answer by providing the agent outputs and tool information. We test DART on four diverse benchmarks and show that our approach improves over multi-agent debate as well as over single agent tool-calling frameworks, beating the next-strongest baseline (multi-agent debate with a judge model) by 3.4% and 2.4% on A-OKVQA and MMMU respectively. We also find that DART adapts well to new tools in applied domains, with a 1.3% improvement on the M3D medical dataset over other strong tool-calling, single agent, and multi-agent baselines. Additionally, we measure text overlap across rounds to highlight the rich discussion in DART compared to existing multi-agent methods. Finally, we study the distribution of expert tool calls to ensure that every tool is being reliably used to help resolve disagreement.

DART: Leveraging Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning

Reliable mathematical and scientific reasoning remains an open challenge for large visionвЂ“language models (VLMs). Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE (Transparent Reasoning And Consistency Evaluation), a framework for analyzing, diagnosing, and improving reasoning in VLMs. At its core, TRACE leverages Auxiliary Reasoning Sets (ARS), compact sub-questionвЂ“answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS is linked to final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement.

TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models

Understanding how large language models (LLMs) process compositional linguistic structures is integral to enhancing their reliability and interpretability. We present Constituent-Aware Pooling (CAP), a methodology grounded in compositionality, mechanistic interpretability, and information theory that intervenes in model activations by pooling token representations into linguistic constituents at various layers. Experiments across eight models (124M-8B parameters) on inverse definition modelling, hypernym and synonym prediction reveal that semantic composition is not localised to specific layers but distributed across network depth. Performance degrades substantially under constituent-based pooling, particularly in early and middle layers, with larger models showing greater sensitivity. We propose an information-theoretic interpretation: transformers' training objectives incentivise deferred integration to maximise token-level throughput, resulting in fragmented rather than localised composition. These findings highlight fundamental architectural and training constraints requiring specialised approaches to encourage robust compositional processing.

Where Do LLMs Compose Meaning? A Layerwise Analysis of Compositional Robustness

We investigate how the choice of output format influences the fine-tuning performance of large language models on sequence labelling- based information extraction tasks. Across more than 280 experiments spanning multiple benchmarks, model families, and formats, we find that the output format is a critical yet often overlooked hyperparameter. In some cases, the choice of format alone can shift F1 scores by over 40% on the same dataset and model. Further, we observe that no single format consistently dominates across settings - the optimal choice depends on factors like model family, size and dataset characteristics. Overall, these results demonstrate that even informationally equivalent output formats can produce substantial performance variation, highlighting the need to treat output formatting as a key factor in building robust and reliable information extraction systems.

Lost in Formatting: How Output Formats Skew LLM Performance on Information Extraction

Recent advances in mathematical reasoning typically rely on massive scale, yet the question remains: can strong reasoning capabilities be induced in small language models (<=1.5B) under extreme constraints? We investigate this by training models on a single A40 GPU (48GB) for under 24 hours using Reinforcement Learning with Verifiable Rewards (RLVR) and Low-Rank Adaptation (LoRA). We find that the success of this ``micro-budget" regime depends critically on the interplay between adapter capacity and model initialization. While low-rank adapters (r=8) consistently fail to capture the complex optimization dynamics of reasoning, high-rank adapters (r=256) unlock significant plasticity in standard instruction-tuned models. Our best result achieved an impressive 40.0% Pass@1 on AIME 24 (an 11.1% absolute improvement over baseline) and pushed Pass@16 to 70.0%, demonstrating robust exploration capabilities. However, this plasticity is not universal: while instruction-tuned models utilized the budget to elongate their chain-of-thought and maximize reward, heavily math-aligned models suffered performance collapse, suggesting that noisy, low-budget RL updates can act as destructive interference for models already residing near a task-specific optimum.

Plasticity vs. Rigidity: The Impact of Low-Rank Adapters on Reasoning on a Micro-Budget

The rapid adoption of Small Language Models (SLMs) for resource constrained applications has outpaced our understanding of their ethical and fairness implications. To address this gap, we introduce the Vacuous Neutrality Framework (VaNeu), a multi-dimensional evaluation paradigm designed to assess SLM fairness prior to deployment. The framework examines model robustness across four stages - biases, utility, ambiguity handling, and positional bias over diverse social bias categories. To the best of our knowledge, this work presents the first large-scale audit of SLMs in the 0.5–5B parameter range, an overlooked “middle tier” between BERT-class encoders and flagship LLMs. We evaluate nine widely used SLMs spanning four model families under both ambiguous and disambiguated contexts. Our findings show that models demonstrating low bias in early stages often fail subsequent evaluations, revealing hidden vulnerabilities and unreliable reasoning. These results underscore the need for a more comprehensive understanding of fairness and reliability in SLMs, and position the proposed framework as a principled tool for responsible deployment in socially sensitive settings. The code is available at: https://github.com/smanduru10/Vacuous-Neutrality-Framework.git.

Beyond Bias Scores: Unmasking Vacuous Neutrality in Small Language Models

Retrieval-augmented generation has become the dominant paradigm for deploying large language models in knowledge-intensive applications, yet practitioners lack guidance on model selection when both quality and costs matter. We evaluate language models from 4B to 70B parameters, including PLLuM and Bielik families of Polish LLM, within a Polish Wikipedia-based RAG pipeline. Quality assessment uses GPT-4o pairwise comparison across 1,000 PolQA questions with bias mitigation and Bradley-Terry ranking, while energy measurements capture inference costs on NVIDIA H100 hardware. Our findings challenge conventional scaling assumptions: parameter scaling beyond 12B offers minimal quality gains, with mid-size PLLuM-12 matching 70B performance while reducing energy consumption by 83\%.

Evaluating Cost-Efficiency of LLMs in a RAG Setup on Polish Wikipedia: Quality vs. Energy Consumption

The ability of AI systems to not only answer complex natural language questions, but also transparently justify their reasoning, is crucial for building trust and enabling effective human-AI collaboration. In domains requiring multi-hop reasoning, answers must often be constructed by combining multiple relevant sentences from a knowledge base to build an inferential path from the question toward the answer. We tackle this challenge by exploring a neuro-symbolic approach to reasoning through the generation of entailment trees -- structured, step-by-step proof trees -- using Large Language Models (LLMs). These trees provide interpretable justifications for the inference process. Using the EntailmentBank~\citep{dalvi2021explaining} data set, we evaluated a diverse set of prompting strategies across multiple models, along with a proposal of an inference-guided prompting approach that performs well. We also fine-tuned LLMs trained specifically for proof generation by applying several data augmentation, curriculum learning, and reinforcement-guided optimization strategies. Our results show that the fine-tuned model outperforms all prompting strategies, achieving superior performance across multiple structural and semantic metrics. We also provide a detailed evaluation of which training strategies are helpful towards proof generation. Our findings highlight the importance of proof tree generation as a benchmark for evaluating structured reasoning in LLMs.

From Sentences to Proof Trees: Leveraging Language Models for Structured Reasoning

We revisit MWE-aware linguistic tokenization as a character-level and token-level sequence labeling problem and present a systematic evaluation on English, German, Italian, and Dutch data. We compare a standard tokenizer trained without MWE-awareness as a baseline (UDPipe), a character-level SRN+CRF model (Elephant), and transformer-based MaChAmp models trained either directly on gold character labels or as token-level postprocessors on top of UDPipe. Our results show that the two-stage pipeline -- UDPipe pretokenization followed by MaChAmp postprocessing -- consistently yields the best accuracy. Our analysis of error patterns highlights how different architectures trade off over- and undersegmentation. These findings provide practical guidance for building MWE-aware tokenizers and suggest that postprocessing pipelines with transformers are a strong and general strategy for non-standard tokenization.

Trainable, Multiword-aware Linguistic Tokenization Using Modern Neural Networks

Diagram-grounded geometry problem solving is a critical benchmark for multimodal large language models (MLLMs), yet the benefits of multi-agent design over single-agent remain unclear. We systematically compare single-agent and multi-agent pipelines on four visual math benchmarks: Geometry3K, MathVerse, OlympiadBench, and We-Math. For open-source models, multi-agent consistently improves performance. For example, Qwen-2.5-VL (7B) gains +6.8 points and Qwen-2.5-VL (32B) gains +3.3 on Geometry3K, and both Qwen-2.5-VL variants see further gains on OlympiadBench and We-Math. In contrast, the closed-source Gemini-2.0-Flash generally performs better in single-agent mode on classic benchmarks, while multi-agent yields only modest improvements on the newer We-Math dataset. These findings show that multi-agent pipelines provide clear benefits for open-source models and can assist strong proprietary systems on newer, less familiar benchmarks, but agentic decomposition is not universally optimal. All code, data, and reasoning files are available at https://github.com/faiyazabdullah/Interpreter-Solver

Premium content

Downloads

Next from EACL 2026 Main Conference

DART: Leveraging Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES