China

Lemmatization for dialectal Arabic poses many challenges due to the lack of orthographic standards and limited morphological analyzers. This work explores the effectiveness of Seq2Seq models for lemmatizing dialectal Arabic, both without analyzers and with their integration. We assess how well these models generalize across dialects and benefit from related varieties. Focusing on Egyptian, Gulf, and Levantine dialects with varying resource levels, our analysis highlights both the potential and limitations of data-driven approaches. The proposed method achieves significant gains over baselines, performing well in both lowresource and dialect-rich scenarios.

EMNLP 2025

Lemmatizing Dialectal Arabic with Sequence-to-Sequence Models

disambiguation

lemmatization

part-of-speech tagging

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Recently, extensive research on the hallucination of the large language models (LLMs) has mainly focused on the English language. Despite the growing number of multilingual and Arabic-specific LLMs, evaluating LLMs' hallucination in the Arabic context remains relatively underexplored. The knowledge gap is particularly pressing given Arabic’s widespread use across many regions and its importance in global communication and media. This paper presents the first comprehensive hallucination evaluation of Arabic and multilingual LLMs on two critical Arabic natural language generation tasks: generative question answering (GQA) and summarization. This study evaluates a total of 12 LLMs, including 4 Arabic pre-trained models, 4 multilingual models, and 4 reasoning-based models. To assess the factual consistency and faithfulness of LLMs' outputs, we developed a fine-grained hallucination evaluation framework consisting of 12 fine-grained hallucination indicators that represent the varying characteristics of each task. The results reveal that factual hallucinations are more prevalent than faithfulness errors across all models and tasks. Notably, the Arabic pre-trained model Allam consistently demonstrates lower hallucination rates than multilingual models and a comparative performance with reasoning-based models.

AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

Prompt relevance is a critical yet underexplored dimension in Arabic Automated Essay Scoring (AES). We present the first systematic study of binary prompt-essay relevance classification, supporting both AES scoring and dataset annotation. To address data scarcity, we built a synthetic dataset of on-topic and off-topic pairs and evaluated multiple models, including threshold-based classifiers, SVMs, causal LLMs, and a fine-tuned masked SBERT model. For real-data evaluation, we combined QAES with ZAEBUC, creating off-topic pairs via mismatched prompts. We also tested prompt expansion strategies using AraVec, CAMeL, and GPT-4o. Our fine-tuned SBERT achieved 98% F1 on synthetic data and strong results on QAES+ZAEBUC, outperforming SVMs and threshold-based baselines and offering a resource-efficient alternative to LLMs. This work establishes the first benchmark for Arabic prompt relevance and provides practical strategies for low-resource AES.


Evaluating Prompt Relevance in Arabic Automatic Essay Scoring: Insights from Synthetic and Real-World Data

The Holy Qur'an provides timeless guidance, addressing modern challenges and offering answers to many important questions. The Qur'an QA 2023 shared task introduced the Qur'anic Passage Retrieval (QPR) task, which involves retrieving relevant passages in response to questions written in modern standard Arabic (MSA). In this work, we evaluate the ability of seven large language models (LLMs) to retrieve relevant passages from the Qur'an in response to given questions, considering zero-shot and several few-shot scenarios. Our experiments show that the best model, Claude, significantly outperforms the state-of-the-art QPR model by 28 points on MAP and 38 points on MRR, exhibiting an impressive improvement of about 113% and 82%, respectively.

Can LLMs Directly Retrieve Passages for Answering Questions from Qur'an?

Speech emotion recognition is vital for humancomputer interaction, particularly for lowresource languages like Arabic, which face
challenges due to limited data and research. We
introduce ArabEmoNet, a lightweight architecture designed to overcome these limitations and
deliver state-of-the-art performance. Unlike
previous systems relying on discrete MFCC features and 1D convolutions, which miss nuanced
spectro-temporal patterns, ArabEmoNet uses
Mel spectrograms processed through 2D convolutions, preserving critical emotional cues often
lost in traditional methods. While recent models favor large-scale architectures with millions
of parameters, ArabEmoNet achieves superior
results with just 1 million parameters, which
is 90 times smaller than HuBERT base and 74
times smaller than Whisper. This efficiency
makes it ideal for resource-constrained environments. ArabEmoNet advances Arabic speech
emotion recognition, offering exceptional performance and accessibility for real-world applications

ArabEmoNet: A Lightweight Hybrid 2D CNN-BiLSTM Model with Attention for Robust Arabic Speech Emotion Recognition

Addressing the need for efficient scoring beyond the time-intensive manual process , this work demonstrates that Feature Engineering is not Dead for Arabic Automated Essay Scoring (AES). We introduce a comprehensive set of 816 engineered linguistic features , inspired by the success in both English and Arabic AES , and grouped into five categories: Surface, Lexical, Semantic, Syntactic, and Readability Metrics. Our experiments on the TAQAE dataset using cross-prompt training confirm that these features are essential: they dramatically boost the performance of Hybrid models (like ProTACT and AraBERT) , and models that rely on them, like the Feature-based and Hybrid categories, achieve the highest overall average performance , with Random Forest (RF) + feature selection reaching an average QWK of 0.294. This clearly establishes that engineered features remain critical for achieving state-of-the-art results in Arabic AES.

Feature Engineering is not Dead: A Step Towards State of the Art for Arabic Automated Essay Scoring

This paper provides a comprehensive overview of the QIAS 2025 shared task, organized as part of the ArabicNLP 2025 conference and co­located with EMNLP 2025. The task was designed for the evaluation of large language
models in the complex domains of religious and legal reasoning. It comprises two subtasks: (1) Islamic Inheritance Reasoning, requiring models to compute inheritance shares according to Islamic jurisprudence, and (2) Islamic Knowledge Assessment, which covers a range of traditional Islamic disciplines. Both subtasks were structured as multiple­choice question answering challenges, with questions stratified by varying difficulty levels. The shared task attracted
significant interest, with 44 teams participating in the development phase, from which 18 teams advanced to the final test phase. Of these,
6 teams submitted entries for both subtasks, 8 for Task 1 only, and two for Task 2 only. Ultimately, 16 teams submitted system description
papers. Herein, we detail the task’s motivation,
dataset construction, evaluation protocol, and
present a summary of the participating systems
and their results.

Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation

BALSAM, a comprehensive, community driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15Kdevelopment, and a centralized, transparent platform for blind evaluation. We envision
BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.

BALSAM: A Platform for Benchmarking Arabic Large Language Models

Post-training has emerged as a crucial technique for aligning pre-trained Large Language Models (LLMs) with human instructions, significantly enhancing their performance across a wide range of tasks. Central to this process is the quality and diversity of post-training datasets. This paper presents a review of publicly available Arabic post-training datasets on the Hugging Face Hub, organized along four key dimensions: (1) LLM Capabilities (e.g., Question Answering, Translation, Reasoning, Summarization, Dialogue, Code Generation, and Function Calling); (2) Steerability (e.g., Persona and System Prompts); (3) Alignment (e.g., Cultural, Safety, Ethics, and Fairness); and (4) Robustness. Each dataset is rigorously evaluated based on popularity, practical adoption, recency and maintenance, documentation and annotation quality, licensing transparency, and scientific contribution. Our review revealed critical gaps in the development of Arabic post-training datasets, including limited task diversity, inconsistent or missing documentation and annotation, and low adoption across the community. Finally, the paper discusses the implications of these gaps on the progress of Arabic-centric LLMs and applications while providing concrete recommendations for future efforts in Arabic post-training dataset development.

Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations

We address the task of reverse dictionary modeling in Arabic, where the goal is to retrieve a target word given its definition. The task comprises two subtasks: (1) generating embeddings for Arabic words based on Arabic glosses, and (2) a cross-lingual setting where the gloss is in English and the target embedding is for the corresponding Arabic word. Prior approaches have largely relied on BERT models such as CAMeLBERT or MARBERT trained with mean squared error loss. In contrast, we propose a novel ensemble architecture that combines MARBERTv2 with the encoder of AraBART, and we demonstrate that the choice of loss function has a significant impact on performance. We apply contrastive loss to improve representational alignment, and introduce structural and center losses to better capture the semantic distribution of the dataset. This multi-loss framework enhances the quality of the learned embeddings and leads to consistent improvements in both monolingual and cross-lingual settings. Our system achieved the best rank metric in both subtasks compared to the previous approaches. These results highlight the effectiveness of combining architectural diversity with task-specific loss functions in representational tasks for morphologically rich languages like Arabic.

Learning Word Embeddings from Glosses: A Multi-Loss Framework for Arabic Reverse Dictionary Tasks

We introduce ALARB, a dataset and suite of tasks designed to evaluate the reasoning capabilities of large language models (LLMs) within the Arabic legal domain. While existing Arabic benchmarks cover some knowledge-intensive tasks such as retrieval and understanding, substantial datasets focusing specifically on multistep reasoning for Arabic LLMs, especially in open-ended contexts, are lacking. The dataset comprises over 13K commercial court cases from Saudi Arabia, with each case including the facts presented, the reasoning of the court, the verdict, as well the cited clauses extracted from the regulatory documents. We define a set of challenging tasks leveraging this dataset and reflecting the complexity of real-world legal reasoning, including verdict prediction, completion of reasoning chains in multistep legal arguments, and identification of relevant regulations based on case facts. We benchmark a representative selection of current open and closed Arabic LLMs on these tasks and demonstrate the dataset’s utility for instruction tuning. Notably, we show that instruction tuning a modest 12B parameter model using ALARB significantly enhances its performance in verdict prediction and Arabic verdict generation, reaching a level comparable to that of GPT-4o.

Downloads

Next from EMNLP 2025

AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads