China

Arabic is one of the most widely spoken languages in the world, yet efforts to develop and evaluate Large Language Models (LLMs) for Arabic remain relatively limited. Most existing Arabic benchmarks focus on linguistic, cultural, or religious content, leaving a significant gap in domains like STEM and code which are increasingly relevant for real-world LLM applications. To help bridge this gap, we present 3LM, a suite of three benchmarks designed specifically for Arabic. The first is a set of STEM-related question-answer pairs, naturally sourced from Arabic textbooks and educational worksheets. The second consists of synthetically generated STEM questions, created using the same sources. The third benchmark focuses on code generation, built through a careful translation of two widely used code benchmarks, incorporating a human-in-the-loop process with several rounds of review to ensure high-quality and faithful translations. We release all three benchmarks publicly to support the growth of Arabic LLM research in these essential but underrepresented areas.

EMNLP 2025

3LM: Bridging Arabic, STEM, and Code through Benchmarking

stem

arabic

benchmark

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. 
Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior.


Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with substantial performance drops 
occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever’s difficulty in ranking documents across languages. Finally, we propose two simple retrieval strategies that address this source of failure by enforcing equal retrieval from both languages or by translating the query, resulting in substantial improvements in cross-lingual and overall performance. 
These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications. 


The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

We present an end-to-end, self-evolving adversarial workflow for long-context Question-Answer (QA) Generation in Arabic. By orchestrating multiple specialized LVLMs: a question generator, an evaluator, and a swarm of answer generators, our system iteratively refines its own performance without any human intervention. Starting from raw, multi-page Arabic documents across diverse domains, the question generator produces fine-grained, context-aware queries to be tackled by the answer generator swarm, and the evaluator assesses and feeds back quality metrics. This closed-loop cycle enables continuous learning: low-confidence outputs trigger automated re-generation and model updates, progressively enhancing question difficulty and relevance. Moreover, we set the quality metrics as a tunable hyperparameter, enabling question generation at controllable and customizable difficulty levels. We release AraLongBench, a large-scale Arabic benchmark of single- and multi-page challenges spanning hundreds of pages, and demonstrate that our self-evolving workflow substantially outperform static pipelines, markedly boosting the long-context comprehension capabilities of leading Arabic Large Vision Language Models (LVLMs). Lastly, we also meticulously architect a fully automated agentic workflow for long-context Arabic document collection.

A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context QA Generation

Lemmatization for dialectal Arabic poses many challenges due to the lack of orthographic standards and limited morphological analyzers. This work explores the effectiveness of Seq2Seq models for lemmatizing dialectal Arabic, both without analyzers and with their integration. We assess how well these models generalize across dialects and benefit from related varieties. Focusing on Egyptian, Gulf, and Levantine dialects with varying resource levels, our analysis highlights both the potential and limitations of data-driven approaches. The proposed method achieves significant gains over baselines, performing well in both lowresource and dialect-rich scenarios.

Lemmatizing Dialectal Arabic with Sequence-to-Sequence Models

Recently, extensive research on the hallucination of the large language models (LLMs) has mainly focused on the English language. Despite the growing number of multilingual and Arabic-specific LLMs, evaluating LLMs' hallucination in the Arabic context remains relatively underexplored. The knowledge gap is particularly pressing given Arabic’s widespread use across many regions and its importance in global communication and media. This paper presents the first comprehensive hallucination evaluation of Arabic and multilingual LLMs on two critical Arabic natural language generation tasks: generative question answering (GQA) and summarization. This study evaluates a total of 12 LLMs, including 4 Arabic pre-trained models, 4 multilingual models, and 4 reasoning-based models. To assess the factual consistency and faithfulness of LLMs' outputs, we developed a fine-grained hallucination evaluation framework consisting of 12 fine-grained hallucination indicators that represent the varying characteristics of each task. The results reveal that factual hallucinations are more prevalent than faithfulness errors across all models and tasks. Notably, the Arabic pre-trained model Allam consistently demonstrates lower hallucination rates than multilingual models and a comparative performance with reasoning-based models.

AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

Prompt relevance is a critical yet underexplored dimension in Arabic Automated Essay Scoring (AES). We present the first systematic study of binary prompt-essay relevance classification, supporting both AES scoring and dataset annotation. To address data scarcity, we built a synthetic dataset of on-topic and off-topic pairs and evaluated multiple models, including threshold-based classifiers, SVMs, causal LLMs, and a fine-tuned masked SBERT model. For real-data evaluation, we combined QAES with ZAEBUC, creating off-topic pairs via mismatched prompts. We also tested prompt expansion strategies using AraVec, CAMeL, and GPT-4o. Our fine-tuned SBERT achieved 98% F1 on synthetic data and strong results on QAES+ZAEBUC, outperforming SVMs and threshold-based baselines and offering a resource-efficient alternative to LLMs. This work establishes the first benchmark for Arabic prompt relevance and provides practical strategies for low-resource AES.


Evaluating Prompt Relevance in Arabic Automatic Essay Scoring: Insights from Synthetic and Real-World Data

The Holy Qur'an provides timeless guidance, addressing modern challenges and offering answers to many important questions. The Qur'an QA 2023 shared task introduced the Qur'anic Passage Retrieval (QPR) task, which involves retrieving relevant passages in response to questions written in modern standard Arabic (MSA). In this work, we evaluate the ability of seven large language models (LLMs) to retrieve relevant passages from the Qur'an in response to given questions, considering zero-shot and several few-shot scenarios. Our experiments show that the best model, Claude, significantly outperforms the state-of-the-art QPR model by 28 points on MAP and 38 points on MRR, exhibiting an impressive improvement of about 113% and 82%, respectively.

Can LLMs Directly Retrieve Passages for Answering Questions from Qur'an?

Speech emotion recognition is vital for humancomputer interaction, particularly for lowresource languages like Arabic, which face
challenges due to limited data and research. We
introduce ArabEmoNet, a lightweight architecture designed to overcome these limitations and
deliver state-of-the-art performance. Unlike
previous systems relying on discrete MFCC features and 1D convolutions, which miss nuanced
spectro-temporal patterns, ArabEmoNet uses
Mel spectrograms processed through 2D convolutions, preserving critical emotional cues often
lost in traditional methods. While recent models favor large-scale architectures with millions
of parameters, ArabEmoNet achieves superior
results with just 1 million parameters, which
is 90 times smaller than HuBERT base and 74
times smaller than Whisper. This efficiency
makes it ideal for resource-constrained environments. ArabEmoNet advances Arabic speech
emotion recognition, offering exceptional performance and accessibility for real-world applications

ArabEmoNet: A Lightweight Hybrid 2D CNN-BiLSTM Model with Attention for Robust Arabic Speech Emotion Recognition

Addressing the need for efficient scoring beyond the time-intensive manual process , this work demonstrates that Feature Engineering is not Dead for Arabic Automated Essay Scoring (AES). We introduce a comprehensive set of 816 engineered linguistic features , inspired by the success in both English and Arabic AES , and grouped into five categories: Surface, Lexical, Semantic, Syntactic, and Readability Metrics. Our experiments on the TAQAE dataset using cross-prompt training confirm that these features are essential: they dramatically boost the performance of Hybrid models (like ProTACT and AraBERT) , and models that rely on them, like the Feature-based and Hybrid categories, achieve the highest overall average performance , with Random Forest (RF) + feature selection reaching an average QWK of 0.294. This clearly establishes that engineered features remain critical for achieving state-of-the-art results in Arabic AES.

Feature Engineering is not Dead: A Step Towards State of the Art for Arabic Automated Essay Scoring

This paper provides a comprehensive overview of the QIAS 2025 shared task, organized as part of the ArabicNLP 2025 conference and co­located with EMNLP 2025. The task was designed for the evaluation of large language
models in the complex domains of religious and legal reasoning. It comprises two subtasks: (1) Islamic Inheritance Reasoning, requiring models to compute inheritance shares according to Islamic jurisprudence, and (2) Islamic Knowledge Assessment, which covers a range of traditional Islamic disciplines. Both subtasks were structured as multiple­choice question answering challenges, with questions stratified by varying difficulty levels. The shared task attracted
significant interest, with 44 teams participating in the development phase, from which 18 teams advanced to the final test phase. Of these,
6 teams submitted entries for both subtasks, 8 for Task 1 only, and two for Task 2 only. Ultimately, 16 teams submitted system description
papers. Herein, we detail the task’s motivation,
dataset construction, evaluation protocol, and
present a summary of the participating systems
and their results.

Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation

BALSAM, a comprehensive, community driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15Kdevelopment, and a centralized, transparent platform for blind evaluation. We envision
BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.

Downloads

Next from EMNLP 2025

The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads