Morocco

LLM agents are increasingly deployed to plan, retrieve, and write with tools, yet evaluation still leans on static benchmarks and small human studies. We present the Agent-Testing Agent (ATA), a meta-agent that combines static code analysis, developer interrogation, literature mining, and persona-driven adversarial test generation whose difficulty adapts via judge feedback. Each dialogue is scored with an LLM-as-a-Judge (LAAJ) rubric and used to steer subsequent tests toward the agent&#39;s weakest capabilities. On a travel planner and a Wikipedia writer, the ATA surfaces more diverse and severe failures than expert annotators while matching severity, and finishes in 20--30 minutes versus ten-annotator rounds that took days. Ablating code analysis and web search increases variance and miscalibration, underscoring the value of evidence-grounded test generation. The ATA outputs quantitative metrics and qualitative bug reports for developers. We release the full open-source implementation.

EACL 2026 Main Conference

Agent-Testing Agent: A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents

LLM agents are increasingly deployed to plan, retrieve, and write with tools, yet evaluation still leans on static benchmarks and small human studies. We present the Agent-Testing Agent (ATA), a meta-agent that combines static code analysis, developer interrogation, literature mining, and persona-driven adversarial test generation whose difficulty adapts via judge feedback. Each dialogue is scored with an LLM-as-a-Judge (LAAJ) rubric and used to steer subsequent tests toward the agent's weakest capabilities. On a travel planner and a Wikipedia writer, the ATA surfaces more diverse and severe failures than expert annotators while matching severity, and finishes in 20--30 minutes versus ten-annotator rounds that took days. Ablating code analysis and web search increases variance and miscalibration, underscoring the value of evidence-grounded test generation. The ATA outputs quantitative metrics and qualitative bug reports for developers. We release the full open-source implementation.

technical paper

#### *Message from the General Chair, Aline Villavicencio*
I’m delighted and honoured to welcome you to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026), taking place in the beautiful city of Rabat, in Morocco, in March 24-29, 2026. EACL is the flagship European conference of the Association and EACL 2026 proudly continues our field’s tradition of excellence in scholarship, innovation, and inclusivity. I am deeply grateful to the many volunteers whose dedication, generosity, and tireless efforts have made this conference possible.
For the first time EACL is being hosted in the African continent. This is an important milestone for our community, and we are grateful to our Moroccan hosts for enabling this historic moment by bringing this edition of EACL to Rabat. We are also delighted that the Second Arabic NLP School is co-located with EACL. We hope attendees enjoy this wonderful opportunity to strengthen ties with the Computational Linguistics communities across the African continent. *[Read full message](https://drive.google.com/file/d/14NlmHvwM6fPJuMmOvVh7K0vtQbEyv3SZ/view?usp=sharing)*<br><br>

<html><button style="display: inline-flex; align-items: center; justify-content: center; white-space: nowrap; border-radius: 9999px; font-weight: bold; background: #7c3aed; color: white; font-family: 'Space Grotesk', sans-serif; height: 40px; font-size: 16px; padding: 0 20px; border: none; cursor: pointer" onclick="window.open('https://underline.io/events/522/reception','_blank')">Go to Workshops and Tutorials Program</button></html>
<br><br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to EACL 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://2026.eacl.org/registration/) first.

**Online Registration Form**: https://acl.swoogo.com/eacl2026

Registration Required

Welcome to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL). EACL 2026 will be held in Rabat, Morocco, from March 24–29, 2026. 

Our system is built upon a multi-modal information extraction pipeline designed to process and interpret corporate sustainability reports. This integrated framework systematically handles diverse data formats—including text, tables, figures, and infographics—to extract, structure, and evaluate ESG-related content. The extracted multi-modal data is subsequently formalized into a structured knowledge graph (KG), which serves as both a semantic framework for representing entities, relationships, and metrics relevant to ESG domains, and as the foundational infrastructure for the automated compliance system. This KG enables high-precision retrieval of information across multiple source formats and reporting modalities. The trustworthy, context-rich representations provided by the knowledge graph establish a verifiable evidence base, creating a critical foundation for reliable retrieval-augmented generation (RAG) and subsequent LLM-based scoring and analysis of automatic ESG compliance system.

ESG-KG: A Multi-modal Knowledge Graph System for Automated Compliance Assessment (presented virtually)

Agentic search has emerged as a promising paradigm for adaptive retrieval systems powered by large language models (LLMs). However, existing benchmarks primarily focus on quality, overlooking efficiency factors that are critical for real-world deployment. Moreover, real-world user queries often contain underspecified preferences, a challenge that remains largely underexplored in current agentic search evaluation. As a result, many agentic search systems remain impractical despite their impressive performance. In this work, we introduce HotelQuEST, a benchmark comprising 214 hotel search queries that range from simple factual requests to complex queries, enabling evaluation across the full spectrum of query difficulty. We further address the challenge of evaluating underspecified user preferences by collecting clarifications that make annotators' implicit preferences explicit for evaluation. We find that LLM-based agents achieve higher accuracy than traditional retrievers, but at substantially higher costs due to redundant tool calls and suboptimal routing that fails to match query complexity to model capability. Our analysis exposes inefficiencies in current agentic search systems and demonstrates substantial potential for cost-aware optimization.

HotelQuEST: Balancing Quality and Efficiency in Agentic Search

Predicting hospital readmissions is a critical clinical task with substantial implications for patient outcomes and healthcare cost management. We propose DisGraph-RP, a graph-augmented temporal modeling framework that integrates structured discourse-aware text representation with cross-admission relational reasoning. Our approach introduces a Section-Aware Contrastive Encoder that leverages section segmentation and aspect-based supervision to produce fine-grained representations of discharge summaries. These representations are then composed over time using a Graph-Based temporal module that encodes inter-visit dependencies through learned edge relations, enabling the model to capture disease progression, treatment history, and recurrent risk signals. Experiments on multiple real-world datasets demonstrate that DisGraph-RP achieves significant improvements over strong baselines, including transformer-based clinical models and prompting-based LLM approaches. Our findings highlight the importance of combining discourse-informed text encoding with temporal graph reasoning for robust clinical outcome prediction.

industry 5DisGraph-RP: Graph-Augmented Temporal Modeling with Aspect-Based Contrastive Encoding of Discharge Summary for Readmission Prediction

Automatic Speech Recognition (ASR) systems for low-resource languages like Hindi often produce erroneous transcripts due to limited annotated data and linguistic complexity. **Post-ASR correction** using language models (LMs) and large language models (LLMs) offers a promising approach to improve transcription quality. In this work, we compare fine-tuned LMs (mT5, ByT5), fine-tuned LLMs (Nanda 10B), and instruction-tuned LLMs (GPT-4o-mini, LLaMA variants) for post-ASR correction in Hindi. Our findings reveal that **smaller, fine-tuned models** consistently **outperform larger LLMs** in both fine-tuning and in-context learning (ICL) settings. We observe a **U-shaped inverse scaling** trend under zero-shot ICL, where mid-sized LLMs degrade performance before marginal recovery at extreme scales, yet still fall short of fine-tuned models. **ByT5 is more effective for character-level corrections** such as transliteration and word segmentation, while **mT5 handles broader semantic inconsistencies**. We also identify performance drops in out-of-domain settings and propose **mitigation strategies** to preserve domain fidelity. In particular, we observe similar trends in **Marathi and Telugu**, indicating the broader applicability of our findings across low-resource Indian languages.

Post-ASR Correction in Hindi: Comparing Language Models and Large Language Models in Low-Resource Scenarios

Personalization in LLMs often relies on costly human feedback or interaction logs, limiting scalability and neglecting deeper user attributes. To reduce the reliance on human annotations, we introduce GRAVITY (Generative Response with Aligned Values, Interests, and Traits of You), a framework for generating synthetic, profile-grounded preference data that captures users' interests, values, beliefs, and personality traits. By integrating demographic, cultural, and psychological frameworks -- including Hofstede's cultural dimensions, Schwartz's basic values, the World Values Survey, and Big Five OCEAN traits -- GRAVITY synthesizes preference pairs to guide personalized content generation. We evaluate GRAVITY on book descriptions for 400 Amazon users, comparing it to prompt-based conditioning, standard fine-tuning, and naive synthetic pair generation. Profile-grounded synthetic data consistently improves generation, especially across multiple cultures (USA, Brazil, Japan, India), achieving over 4% higher preference gains across baselines, with user studies showing that GRAVITY outputs are preferred over 86% of the time. Our results show that scenario-grounded synthetic data can capture richer user variation, reduce reliance on costly annotation, and produce more engaging, user-centered content, offering a scalable path for LLM personalization.

GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences

Large language model (LLM) agents are increasingly deployed to tackle complex tasks, often necessitating collaboration among multiple specialized agents. However, multi-agent collaboration introduces new challenges in planning, coordination, and verification. Execution failures frequently arise not from flawed reasoning alone, but from subtle misalignments in task interpretation, output format, or inter-agent handoffs. To address these challenges, we present VeriMAP, a framework for multi-agent collaboration with verification-aware planning. The VeriMAP planner decomposes tasks, models subtask dependencies, and encodes planner-defined passing criteria as subtask verification functions (VFs) in Python and natural language. We evaluate VeriMAP on diverse datasets, demonstrating that it outperforms both single- and multi-agent baselines while enhancing system robustness and interpretability. Our analysis highlights how verification-aware planning enables reliable coordination and iterative refinement in multi-agent systems, without relying on external labels or annotations.

Verification-Aware Planning for Multi-Agent Systems

Long-context conversational agents require robust memory, but existing frameworks struggle to organize information effectively across dimensions like time and topic, leading to poor retrieval. To address this, we introduce H-Mem, a novel Hybrid Multi-Dimensional Memory architecture. H-Mem stores conversational facts in two parallel, hierarchical data structures: a temporal tree that organizes information chronologically and a semantic tree that organizes it conceptually. This dual-tree design enables a hybrid retrieval mechanism managed by an intelligent Mode Controller. Based on the query, the controller dynamically chooses between a sequential search using semantic anchors and an intersective search combining both hierarchies. Our experiments on long-context QA datasets demonstrate that H-Mem provides a more flexible approach to memory management, leading to significant improvements of over 8.4\% compared to other state-of-the-art systems.

H-Mem: Hybrid Multi-Dimensional Memory Management for Long-Context Conversational Agents

Large language models are increasingly used for creative writing and engagement content, raising safety concerns about their outputs. Using humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. We further supplement this by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that even for fixed neutral setups, harmful outputs receive higher humor scores, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show that harmful cues widen predictive uncertainty and, surprisingly, can even make harmful punchlines more expected for some models, suggesting intrinsic structural embedding in learned humor distributions. Experiments and human evaluation on an additional satire-generation task with human-perceived funniness judgments show that LLM funniness relies on increased stereotypicality and toxicity, including for closed models. Quantitatively, stereotypical/toxic jokes gain 10%--21% in mean humor score, stereotypical jokes appear 11% to 28% more often among the jokes marked funny by an LLM-based metric, and up to 10% more often in generations perceived as funny by humans.

Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models

Prompt-based adversarial attacks are a key tool for assessing the robustness of large language models (LLMs). Yet, existing studies typically treat prompts as flat text, overlooking their internal structure, different components within a prompt contribute unequally to robustness. This work introduces PromptAnatomy, a framework that decomposes prompts into functional components and perturbs them selectively to expose component-wise vulnerabilities. We further propose ComPerturb, a controlled perturbation method that ensures linguistic plausibility through perplexity-based filtering. Using this framework, four instruction-tuning datasets are structurally annotated and validated by human reviewers. Experiments across five advanced LLMs show that ComPerturb achieves state-of-the-art attack success rates, while ablation analyses confirm the complementary effects of prompt dissection and perplexity filtering. These results highlight the importance of structural awareness in evaluating and improving the adversarial robustness of LLMs.

Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in LLMs

Leveraging multimodal large language models (MLLMs) to construct embodied agents represents a promising approach for addressing real-world tasks. However, current benchmarks predominantly focus on language-centric tasks or heavily rely on simulated environments, thereby restricting their capacity to evaluate performance in realistic settings. To bridge this gap, we introduce CityNav, a comprehensive benchmark encompassing four diverse global cities, explicitly designed to evaluate the decision-making capabilities of raw MLLM-driven agents in real-world navigation tasks. Specifically, agents must rely solely on visual observations and internal multimodal reasoning to sequentially execute a significant number of decisions (50+) without additional environmental annotations or specialized architectural enhancements. Extensive evaluation reveals that popular linguistic reasoning techniques (e.g., Chain-of-Thought, Self-Consistency, Reflection) fail to deliver substantial improvements in performance. To address this, we propose Verbalization of Path (VoP), a novel method that explicitly grounds the agent's internal multimodal reasoning through verbalized navigational paths, substantially enhancing navigation success. Nonetheless, overall performance in particularly challenging cities remains insufficient, underscoring the critical necessity for advanced reasoning frameworks and robust methods capable of handling complex, long-range sequential decision-making tasks. The code and dataset will be released on acceptance.

Premium content

Downloads

Next from EACL 2026 Main Conference

ESG-KG: A Multi-modal Knowledge Graph System for Automated Compliance Assessment (presented virtually)

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES