Morocco

Recent advances in Multimodal Large Language Models (MLLMs) have improved image recognition and reasoning, but video-related tasks remain challenging due to memory constraints from dense frame processing. Existing Video Moment Retrieval (VMR) methodologies rely on sparse frame sampling, risking potential information loss, especially in lengthy videos. We propose SMORE (See MORE, store less), a framework that enhances memory efficiency while maintaining high information resolution. SMORE (1) uses query-guided captions to encode semantics aligned with user intent, (2) applies query-aware importance modulation to highlight relevant segments, and (3) adaptively compresses frames to preserve key content while reducing redundancy. This enables efficient video understanding without exceeding memory budgets. Experimental validation reveals that SMORE achieves state-of-the-art performance on QVHighlights, Charades-STA, and ActivityNet-Captions benchmarks. The code will be made publicly available.

EACL 2026 Main Conference

See More, Store Less: Memory-Efficient Resolution for Video Moment Retrieval

poster

#### *Message from the General Chair, Aline Villavicencio*
I’m delighted and honoured to welcome you to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026), taking place in the beautiful city of Rabat, in Morocco, in March 24-29, 2026. EACL is the flagship European conference of the Association and EACL 2026 proudly continues our field’s tradition of excellence in scholarship, innovation, and inclusivity. I am deeply grateful to the many volunteers whose dedication, generosity, and tireless efforts have made this conference possible.
For the first time EACL is being hosted in the African continent. This is an important milestone for our community, and we are grateful to our Moroccan hosts for enabling this historic moment by bringing this edition of EACL to Rabat. We are also delighted that the Second Arabic NLP School is co-located with EACL. We hope attendees enjoy this wonderful opportunity to strengthen ties with the Computational Linguistics communities across the African continent. *[Read full message](https://drive.google.com/file/d/14NlmHvwM6fPJuMmOvVh7K0vtQbEyv3SZ/view?usp=sharing)*<br><br>

<html><button style="display: inline-flex; align-items: center; justify-content: center; white-space: nowrap; border-radius: 9999px; font-weight: bold; background: #7c3aed; color: white; font-family: 'Space Grotesk', sans-serif; height: 40px; font-size: 16px; padding: 0 20px; border: none; cursor: pointer" onclick="window.open('https://underline.io/events/522/reception','_blank')">Go to Workshops and Tutorials Program</button></html>
<br><br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to EACL 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://2026.eacl.org/registration/) first.

**Online Registration Form**: https://acl.swoogo.com/eacl2026

Registration Required

Welcome to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL). EACL 2026 will be held in Rabat, Morocco, from March 24–29, 2026. 

In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.

Modeling Turn-Taking with Semantically Informed Gestures

Reasoning in language models is difficult to evaluate: natural-language traces are unverifiable, symbolic datasets too small, and most benchmarks conflate heuristics with inference. We present FOL-Traces, the first large-scale dataset of programmatically verified reasoning traces, enabling rigorous evaluation of structured logical inference. We also propose two challenging and comprehensive diagnostic tasks---masked operation prediction and step completion---that directly probe syntactic awareness and process fidelity. FOL-Traces serves as a scalable testbed for rigorously studying how models perform structured logical inference. Systematic experiments with 5 reasoning LLMs show that the dataset remains challenging: models only reach around 45.7% accuracy on masked operation prediction and around 27% on two-step completion.

FOL-Traces: Verified First-Order Logic Reasoning Traces at Scale

Large language models suffer from positional biases like the "Lost in the Middle'" (LiM) phenomenon and recency bias, which undermine the performance of long-context retrieval augmented generation. In this work, we investigate the role of rotary position embeddings (RoPE) in this context. Our empirical study confirms the persistence of these biases in modern language models. Drawing on these findings, we introduce Caliope, a training-free calibration framework for modifying RoPE inputs at inference time. Our calibrators yield substantial improvements in specific model-task configurations, revealing complex performance trade-offs between simple needle-in-a-haystack experiments and more demanding cross-chunk reasoning tasks. This work offers a practical mitigation framework and key insights into the nature of positional biases.

Can Calibration of Positional Encodings Enhance Long Context Utilization?

Modern language models excel at factual reasoning but struggle with value diversity: the multiplicity of plausible human perspectives. Tasks such as hate speech or sexism detection expose this limitation, where human disagreement captures the diversity of perspectives that models need to account for, rather than dataset noise. In this paper, we explore whether multi-perspective in-context learning (ICL) can align large language models (LLMs) with this diversity without parameter updates. We evaluate four LLMs on five datasets across three languages (English, Arabic, Italian), considering three label-space representations (aggregated hard, disaggregated hard, and disaggregated soft) and five demonstration selection and ordering strategies. Our multi-perspective approach outperforms standard prompting on aggregated English labels, while disaggregated soft predictions better align with human judgments in Arabic and Italian datasets.These findings highlight the importance of perspective-aware LLMs for reducing bias and polarization, while also revealing the challenges of applying ICL to socially sensitive tasks. We further probe the model faithfulness using XAI, offering insights into how LLMs handle human disagreement.

Seeing All Sides: Multi-Perspective In-Context Learning for Subjective NLP

No. Whilst Multimodal Large Language Models (MLLMs) have been shown to perform very well on general video data, we systematically show that their performance on movies lags behind. This is surprising as MLLMs are increasingly used for movie understanding. To measure the performance of MLLMs on movies, we explore three pillars of movie mastery: movie knowledge, cinematographic knowledge, and critical analysis. Through a combination of quantitative and in-depth qualitative evaluations we identify where MLLMs show promise and, in particular, where they fail. Our findings show that in small-scale settings involving factual knowledge, MLLMs are able to outperform existing methods. However, once cinematographic and critical analysis is required, MLLMs are insufficiently able to extract meaningful information from the visual modality to be able to provide useful insights.

Are Multimodal LLMs Movie Buffs?

Graphical User Interface (GUI) grounding is critical for effective GUI agents. Despite recent progress, key challenges remain: 1) existing grounding models and benchmarks are skewed toward web and mobile environments, neglecting desktop interfaces (especially windows); and 2) grounding capability is assessed using accuracy on a single "best" instruction per UI element. However, users can refer to a UI element in diverse valid ways -- via visual attributes, spatial relations, etc, and a capable grounding model should produce consistent outputs across such variations. Focusing on desktop environments, we introduce GUI Grounding Sensitivity Benchmark, which investigates the model sensitivity to multiple descriptions of the same UI element. We design an automatic pipeline to generate multiple valid instructions per UI element, and develop nuanced data validation methods, as frontier models even hallucinate to produce a single instruction. Evaluation of 12 models reveals they are reasonably sensitive and their performance on existing benchmarks does not reflect their true ability. Building on the insight that a given grounding model struggles more with certain instructions or relations, we introduce the GUI Grounding Diagnosis Agent, which generates challenging instructions using model feedback and iterative refinement. Our agent reports high success rate (upto 84%) in generating instructions that fail the state-of-the-art GUI grounding models.

Do GUI Grounders Truly Understand UI Elements?

While chain-of-thought (CoT) prompting improves reasoning in large language models, its effectiveness in vision-language models (VLMs) remains limited due to over-reliance on textual cues and memorized knowledge. To investigate the visual reasoning capabilities of VLMs in complex real-world scenarios, we introduce DrivingVQA, a visual question answering dataset derived from driving theory exams, which contains 3,931 multiple-choice problems with expert-written explanations and grounded entities relevant to the reasoning process. Leveraging this dataset, we explore the benefits of incorporating entity-related information, such as entity names, spatial coordinates, and visual content, through supervised fine-tuning to enhance the modelХs reasoning abilities. Our experiments demonstrate that interleaving textual explanations with visual tokens extracted from entities relevant to the question improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting. Furthermore, we demonstrate that this retrieval-based approach effectively scales to the larger A-OKVQA reasoning dataset by leveraging automatically generated pseudo-labels, outperforming CoT prompting.

DRIVINGVQA: A Dataset for Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios

Dashboards are powerful visualization tools for data-driven decision-making, integrating multiple interactive views that allow users to explore, filter, and navigate data. Unlike static charts, dashboards support rich interactivity, which is essential for uncovering insights in real-world analytical workflows. However, existing question-answering benchmarks for data visualizations largely overlook this interactivity, focusing instead on static charts. This limitation severely constrains their ability to evaluate the capabilities of modern multimodal agents designed for GUI-based reasoning. To address this gap, we introduce DashboardQA, the first benchmark explicitly designed to assess how vision-language GUI agents comprehend and interact with real-world dashboards. The benchmark includes 292 tasks on 112 interactive dashboards, encompassing 405 question answer pairs overall. These questions span five categories: multiple-choice, factoid, hypothetical, multi-dashboard, and conversational. By assessing a variety of leading closed- and open-source GUI agents, our analysis reveals their key limitations, particularly in grounding dashboard elements, planning interaction trajectories, and performing reasoning. Our findings indicate that interactive dashboard reasoning is a challenging task overall for all the VLMs evaluated. Even the top-performing agents struggle; for instance, the best agent based on Gemini-Pro-2.5 achieves only 38.69% accuracy, while the OpenAI CUA agent reaches just 22.69%, demonstrating the benchmark's significant difficulty. We release DashboardQA at ..

DashboardQA: Benchmarking Multimodal Agents for Question Answering on Interactive Dashboards

We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90% accuracy on plan-aware VQA.

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Large Language Models (LLMs) excel at mathematical reasoning in English, but their performance in low-resource languages remains underexplored. This gap is particularly critical for Indonesia, where equitable access to AI-powered education depends on robust multilingual reasoning across diverse local languages. We introduce MATH-IDN, a multilingual benchmark for mathematical problem solving in Indonesian, Javanese, Sundanese, and Buginese, with English as reference. Each subset contains expert-verified problems spanning seven subjects and five difficulty levels. We evaluate eight open-source LLM families, including math-specialized, Southeast-Asian-adapted, and general-purpose models, under a zero-shot chain-of-thought setting. Results show that MATH-IDN presents a challenging and discriminative benchmark, revealing substantial performance gaps in low-resource languagesвЂ”particularly for BugineseвЂ”and highlighting key limitations in current multilingual reasoning capabilities. Anonymous link to the data repository: https://anonymous.4open.science/r/MATH_IDN-EA85

Premium content

Downloads

Next from EACL 2026 Main Conference

Modeling Turn-Taking with Semantically Informed Gestures

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES