Morocco

Search agents are language models (LMs) that reason and search knowledge bases (or the web) to answer questions; recent methods supervise only the final answer accuracy using reinforcement learning with verifiable rewards (RLVR). Most RLVR search agents tackle general-domain QA, which limits their relevance to technical AI systems in science, engineering, and medicine. In this work we propose training agents to search and reason over scientific papers -- this tests technical question-answering, it is directly relevant to real scientists, and the capabilities will be crucial to future AI Scientist systems. Concretely, we release a search corpus of 16 million biomedical paper abstracts and construct a challenging factoid QA dataset called PaperSearchQA with 60k samples answerable from the corpus, along with benchmarks. We train search agents in this environment to outperform non-RL retrieval baselines; we also perform further quantitative analysis and observe interesting agent behaviors like planning, reasoning, and self-verification. Our corpus, datasets, and benchmarks are usable with the popular Search-R1 codebase for RLVR training; they are available on Hugging Face. Finally, our data creation methods are scalable and easily extendable to other scientific domains.

EACL 2026 Main Conference

PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR

poster

#### *Message from the General Chair, Aline Villavicencio*
I’m delighted and honoured to welcome you to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026), taking place in the beautiful city of Rabat, in Morocco, in March 24-29, 2026. EACL is the flagship European conference of the Association and EACL 2026 proudly continues our field’s tradition of excellence in scholarship, innovation, and inclusivity. I am deeply grateful to the many volunteers whose dedication, generosity, and tireless efforts have made this conference possible.
For the first time EACL is being hosted in the African continent. This is an important milestone for our community, and we are grateful to our Moroccan hosts for enabling this historic moment by bringing this edition of EACL to Rabat. We are also delighted that the Second Arabic NLP School is co-located with EACL. We hope attendees enjoy this wonderful opportunity to strengthen ties with the Computational Linguistics communities across the African continent. *[Read full message](https://drive.google.com/file/d/14NlmHvwM6fPJuMmOvVh7K0vtQbEyv3SZ/view?usp=sharing)*<br><br>

<html><button style="display: inline-flex; align-items: center; justify-content: center; white-space: nowrap; border-radius: 9999px; font-weight: bold; background: #7c3aed; color: white; font-family: 'Space Grotesk', sans-serif; height: 40px; font-size: 16px; padding: 0 20px; border: none; cursor: pointer" onclick="window.open('https://underline.io/events/522/reception','_blank')">Go to Workshops and Tutorials Program</button></html>
<br><br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to EACL 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://2026.eacl.org/registration/) first.

**Online Registration Form**: https://acl.swoogo.com/eacl2026

Registration Required

Welcome to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL). EACL 2026 will be held in Rabat, Morocco, from March 24–29, 2026. 

Novelty assessment is a central yet understudied aspect of peer review, particularly in high-volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: (i) content extraction from submissions, (ii) retrieval and synthesis of related work, and (iii) structured comparison for evidence-based assessment. Our method is informed by analysis of human-written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human-annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions, substantially outperforming existing LLM-based baselines. It produces detailed, literature-aware analysis and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM-assisted approaches to support more rigorous and transparent peer review without displacing human expertise. The data and the code are available at https://ukplab.github.io/eacl2026-assessing-paper-novelty/

Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback

Large Vision-Language Models (LVLMs), trained by aligning visual encoders to LLMs on extensive vision-language data, demonstrate impressive performance across a broad variety of tasks that require understanding of both visual and textual inputs. Acknowledging this, recent work proposed to post-hoc convert generative LVLMs into vision-language encoders (VLEs) via supervised contrastive learning objectives. This type of training enables LVLMs to produce better representations, i.e., embeddings for image and text input, used in retrieval and (semantic) similarity tasks. Having observed that this type of VLEs (i.e., LVLMs turned into encoders) commonly employ last-token pooling in downstream tasks, without using special sequence-end tokens, in this focused contribution, we study the effect of pooling strategies on VLEs' downstream performance. We empirically show that, in contrast to mean pooling, last-token pooling (without special sequence-end tokens) makes VLEs highly sensitive to end-of-input artifacts in fine-tuning and inference data, e.g., whether input sequences end with punctuation or newline characters. Finally, we show that introducing the special end-of-sequence token removes this sensitivity and makes VLEs robust to formatting artifacts of input text.

Mind Your Special Tokens! On the Importance of Dedicated Sequence-End Tokens in Vision-Language Embedding Models

Large language models (LLMs) can be benchmark-contaminated, which produces inflated scores that mask memorization as generalization. In a multilingual setting, memorization has been shown to be able to transfer to ``uncontaminated'' languages. Using the FLORES-200 machine translation benchmark as a diagnostic, we study three 7--8B instruction-tuned multilingual LLMs. Using Llama and Qwen's BLEU and COMET scores as a control, we confirm Bloomz's FLORES contamination. We then demonstrate that machine translation contamination happens cross-lingually and is driven by target-side memorization, artificially boosting performance in translating unseen input. Further analysis shows that despite our source paraphrasing or perturbation efforts, recall of memorized references often persists. We discover that this is strongly anchored to source-side named entities; randomizing these sharply reduces recall of memorized texts. This provides insights into potential ways to cleanly benchmark contaminated LLMs.

When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation

Meta-reviewing is a pivotal stage in the peer-review process, serving as the final step in determining whether a paper is recommended for acceptance. Prior research on meta-reviewing has treated this as a summarization problem over review reports. However, complementary to this perspective, meta-reviewing is a decision-making process that requires weighing reviewer arguments and placing them within a broader context. Prior research has demonstrated that decision-makers can be effectively assisted in such scenarios via dialogue agents. In line with this framing, we explore the practical challenges for realizing dialog agents that can effectively assist meta-reviewers. Concretely, we first address the issue of data scarcity for training dialogue agents by generating synthetic data using Large Language Models (LLMs) based on a self-refinement strategy to improve the relevance of these dialogues to expert domains. Our experiments demonstrate that this method produces higher-quality synthetic data and can serve as a valuable resource towards training meta-reviewing assistants. Subsequently, we utilize this data to train dialogue agents tailored for meta-reviewing and find that these agents outperform *off-the-shelf* LLM-based assistants for this task. Finally, we apply our agents in real-world meta-reviewing scenarios and confirm their effectiveness in enhancing the efficiency of meta-reviewing.

Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue

Text-to-image (T2I) models are increasingly employed by users worldwide. However, prior research has pointed to the high sensitivity of T2I towards particular input languages - when faced with languages other than English (i.e., different surface forms of the same prompt), T2I models often produce culturally stereotypical depictions, prioritizing the surface over the prompt's semantics. Yet a comprehensive analysis of this behavior, which we dub Surface-over-Semantics (SoS), is missing. We present the first analysis of T2I models' SoS tendencies. To this end, we create a set of prompts covering 171 cultural identities, translated into 14 languages, and use it to prompt seven T2I models. To quantify SoS tendencies across models, languages, and cultures, we introduce a novel evaluation measure and analyze how the tendencies we identify manifest visually. We show that all tested models exhibit strong surface tendencies in at least three languages, and that this effect intensifies throughout the layers of T2Is' text encoders. Furthermore, strong surface tendencies often directly relate to stereotypical depictions and are reflected in distinct color profiles.

SoS: Analysis of Surface over Semantics in Multilingual Text-To-Image Generation

Process reward models (PRMs) improve complex reasoning in large language models (LLMs) by grading candidate solutions step-by-step and selecting answers via aggregated step scores. While effective in domains such as mathematics, their applicability to tasks involving semi-structured data, like table question answering (TQA) remains unexplored. TQA poses unique challenges for PRMs, including abundant irrelevant information, loosely connected reasoning steps, and domain-specific reasoning. This work presents the first systematic study of PRMs for TQA. We evaluate state-of-the-art generative PRMs on TQA from both answer and step perspectives. Results show that PRMs that combine textual and code verification can aid solution selection but struggle to generalize to out-of-domain data. Analysis reveals a weak correlation between performance in step-level verification and answer accuracy, possibly stemming from weak step dependencies and loose causal links. Our findings highlight limitations of current PRMs on TQA and offer valuable insights for building more robust, process-aware verifiers.

Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering

This paper investigates gender biases exhibited by LLM-based virtual assistants when providing educational recommendations, focusing on minimal gender indicators. Experimenting on Italian, a language with grammatical gender, we demonstrate that simply changing noun and adjective endings (e.g., from masculine "-o" to feminine "-a") significantly shifts recommendations. More specifically, we find that LLMs i) recommend STEM disciplines less for prompts with feminine grammatical gender and ii) narrow down the set of disciplines recommended to prompts with masculine grammatical gender; these effects persist across multiple commercial LLMs (from OpenAI, Anthropic, and Google). We show that grammatical gender cues alone trigger substantial distributional shifts in educational recommendations, and up to 76% of the bias exhibited when using prompts with proper names is already present with grammatical gender markers alone.Our findings highlight the need for robust bias evaluation and mitigation strategies before deploying LLM-based virtual assistants in student-facing contexts and the risks of using general purpose LLMs for educational applications, especially in languages with grammatical gender.

Beyond Names: How Grammatical Gender Markers Bias LLM-based Educational Recommendations

In-context learning (ICL) in large language models (LLMs) has been shown to operate through task vectors' the representation that summarizes the mapping induced by in-context demonstrations and can be composed by simple arithmetic operations. While this phenomenon is well studied in LLMs, its extension to vision-language models (VLMs) remains underexplored. In this work, we systematically examine the additive compositionality of in-context task vectors in VLMs, extracted from text-side hidden representations. Specifically, we construct compositional visual reasoning tasks with clearly defined subtasks and extract task vectors from few-shot demonstrations. Empirical experiments show that the vector for a complex task can be approximated by adding the vectors of its constituent subtasks. Beyond this, we analyze token-level contextual embeddings and show that additive composition arises because complex-task representations emerge as the superposition of atomic subtask components, preserving semantic structure within the model's activation space.

On the Additive Compositionality of Task Vectors in Vision-Language Models

Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity—qualities that can only be assessed post-execution. Open-source models struggle even more, frequently producing non-executable or visually poor outputs. Although supervised fine-tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post-execution feedback. To address this gap, we propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi-objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post-execution feedback. By training Qwen2.5 models (7B and 14B), RL-Text2Vis achieves a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero-shot baseline. Our models significantly outperform strong zero-shot and supervised baselines and also demonstrate robust generalization to out-of-domain datasets like VIS-Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at https://github.com/vis-nlp/RL-Text2Vis.

Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization

In science, promotional language ('hype') is increasing and can undermine objective evaluation of evidence, impede research development, and erode trust in science. In this paper, we introduce the task of automatic detection of hype, which we define as hyperbolic or subjective language that authors use to glamorize, promote, embellish, or exaggerate aspects of their research. We propose formalized guidelines for identifying hype language and apply them to annotate a portion of the National Institutes of Health (NIH) grant application corpus. We then evaluate traditional text classifiers and language models on this task, comparing their performance with a human baseline. Our experiments show that formalizing annotation guidelines can help humans reliably annotate candidate hype adjectives and that using our annotated dataset to train machine learning models yields promising results. Our findings highlight the linguistic complexity of the task and the potential need for domain knowledge. While some linguistic works address hype detection, to the best of our knowledge, we are the first to approach it as a natural language processing task.

Premium content

Downloads

Next from EACL 2026 Main Conference

Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES