China

Arabic Word Sense Disambiguation (WSD) remains challenging due to the language’s rich morphology, polysemy, and lack of large annotated resources. This study benchmarks four generative Large Language Models (LLMs)—GPT-4o, LLaMA 3.1-8B, Qwen 2.5-7B, and Gemma 2-9B—on two public Arabic WSD datasets under both zero-shot and fine-tuned conditions. Results show that GPT-4o achieves the strongest zero-shot performance (79 % accuracy, 66% macro-F1), while parameter-efficient fine-tuning of open models via LoRA closes and surpasses this gap. Qwen 2.5-7B attains 90.77 % accuracy and 83.98 % F1 on Dataset A, and LLaMA 3.1-8B reaches 88.51 % accuracy and 69.41 % F1 on Dataset B. The findings demonstrate that medium-sized open LLMs can serve as competitive, reproducible baselines for Arabic sense-level understanding when modest supervision is applied.

EMNLP 2025

Zero-Shot and Fine-Tuned Evaluation of Generative LLMs for Arabic Word Sense Disambiguation

word sense disambiguation

arabic nlp

large language models

zero-shot learning

fine-tuning

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Tool calling is a critical capability that allows Large Language Models (LLMs) to interact with external systems, significantly expanding their utility. However, research and resources for tool calling are predominantly English-centric, leaving a gap in our understanding of how to enable this functionality for other languages, such as Arabic. This paper investigates three key research questions: (1) the necessity of in-language (Arabic) tool-calling data versus relying on cross-lingual transfer, (2) the effect of general-purpose instruction tuning on tool-calling performance, and (3) the value of fine-tuning on specific, high-priority tools. To address these questions, we conduct extensive experiments using base and post-trained variants of an open-weight Arabic LLM. To enable this study, we bridge the resource gap by translating and adapting two open-source tool-calling datasets into Arabic. Our findings provide crucial insights into the optimal strategies for developing robust tool-augmented agents for Arabic.

Tool Calling for Arabic LLMs: Data Strategies and Instruction Tuning

North African Arabic dialects pose major NLP challenges due to high lexical variation, script diversity (Arabic/Latin), and frequent French code-switching. We introduce a phoneme-based normalization that harmonizes surface forms across varieties by mapping both Arabic and French into a simplified Latin representation.

We then pretrain BERT models exclusively on normalized Modern Standard Arabic and French, without using any dialectal data. The resulting models are evaluated on Named Entity Recognition (DzNER, DarNER, WikiFANE) and sentiment classification (TwiFil).

Our approach achieves state-of-the-art performance on several North African benchmarks and shows strong zero-shot generalization from MSA to Algerian NER (Ar_20k > dialect-pretrained models). Results demonstrate that standard-only pretraining with normalization is a viable and scalable solution for supporting underserved Arabic dialects.

Modeling North African Dialects from Standard Languages

Argument mining for Arabic remains underexplored, largely due to the scarcity of annotated corpora. To address this gap, we examine the effectiveness of cross-lingual transfer from English. Using the English Persuasive Essays (PE) corpus, annotated with argumentative components (Major Claim, Claim, and Premise), we explore several transfer strategies: training encoder-based multilingual and monolingual models on English data, machine-translated Arabic data, and their combination. We further assess the impact of annotation noise introduced during translation by manually correcting portions of the projected training data. In addition, we investigate the potential of prompting large language models (LLMs) for the task. Experiments on a manually corrected Arabic test set show that monolingual models trained on translated data achieve the strongest performance, with further improvements from small-scale manual correction of training examples.

Transfer or Translate? Argument Mining in Arabic with No Native Annotations

While Knowledge Editing (KE) has been widely explored in English, its behavior in morphologically rich languages like Arabic remains underexamined. In this work, we present the first study of Arabic KE. We evaluate four methods (ROME, MEMIT, ICE, and LTE) on Arabic translations of the ZsRE and Counterfact benchmarks, analyzing both multilingual and cross-lingual settings. Our experiments on Llama-2-7B-chat show that parameter-based methods struggle with cross-lingual generalization, while instruction-tuned methods perform more robustly. We extend Learning-To-Edit (LTE) to a multilingual setting and show that joint Arabic-English training improves both editability and transfer. We release Arabic KE benchmarks and multilingual training for LTE data to support future research.

An Exploration of Knowledge Editing for Arabic

The quality of training data plays a critical role in the performance of large language models (LLMs). This is especially true for low-resource languages where high-quality content is relatively scarce. Inspired by the success of FineWeb-Edu for English, we construct a native Arabic educational-quality dataset using similar methodological principles. We begin by sampling 1 million Arabic web documents from Common Crawl and labeling them into six quality classes (0–5) with Qwen-2.5-72B-Instruct model using a classification prompt adapted from FineWeb-Edu. These labeled examples are used to train a robust classifier capable of distinguishing educational content from general web text. We train a classification head on top of a multilingual 300M encoder model, then use this classifier to filter a large Arabic web corpus, discarding documents with low educational value. To evaluate the impact of this curation, we pretrain from scratch two bilingual English-Arabic 7B LLMs on 800 billion tokens using the filtered and unfiltered data and compare their performance across a suite of benchmarks. Our results show a significant improvement when using the filtered educational dataset, validating the effectiveness of quality filtering as a component in a balanced data mixture for Arabic LLM development. This work addresses the scarcity of high-quality Arabic training data and offers a scalable methodology for curating educational quality content in low-resource languages.

ArabicWeb-Edu: Educational Quality Data for Arabic LLM Training

Grapheme-to-phoneme (G2P) models are essential components in text-to-speech (TTS) and pronunciation assessment applications. While standard forms of languages have gained attention in that regard, dialectal speech, which often serves as the primary means of spoken communication for many communities, as it is the case for Arabic, has not received the same level of focus. In this paper, we introduce an end-to-end dialectal G2P for Egyptian Arabic, a dialect without standard orthography. Our novel architecture accomplishes three tasks: (i) restores short vowels of the diacritical marks for the dialectal text; (ii) maps certain characters that happen only in the spoken version of the dialectal Arabic to their dialect-specific character transcriptions; and finally (iii) converts the previous step output to the corresponding phoneme sequence. We benchmark G2P on a modular cascaded system, a large language model, and our multi-task end-to-end architecture.

DialG2P: Dialectal Grapheme-to-Phoneme. Arabic as a Case Study

We present an overview of the AraGenEval shared task, organized as part of the ArabicNLP 2025 conference. This task introduced the first benchmark suite for Arabic authorship analysis, featuring three subtasks: Authorship Style Transfer, Authorship Identification, and AI-Generated Text Detection. We curated highquality datasets, including over 47,000 paragraphs from 21 authors and a balanced corpus of human- and AI-generated texts. The task attracted significant global participation, with 72 registered teams from 16 countries. The results highlight the effectiveness of transformer-based models, with top systems leveraging prompt engineering for style transfer, model ensembling for authorship identification, and a mix of multilingual and Arabic-specific models for AI text detection. This paper details the task design, datasets, participant systems, and key findings, establishing a foundation for future research in Arabic stylistics and trustworthy NLP.

The AraGenEval Shared Task on Arabic Authorship Style Transfer and AI Generated Text Detection

We introduce AraHealthQA 2025, the Comprehensive Arabic Health Question Answering Shared Task, held in conjunction with ArabicNLP 2025 (co-located with EMNLP 2025). This shared task addresses the paucity of high-quality Arabic medical QA resources by offering two complementary tracks: MentalQA, focusing on Arabic mental health Q&A (e.g., anxiety, depression, stigma reduction), and MedArabiQ, covering broader medical domains such as internal medicine, pediatrics, and clinical decision making. Each track comprises multiple subtasks, evaluation datasets, and standardized metrics, facilitating fair benchmarking. The task was structured to promote modeling under realistic, multilingual, and culturally nuanced healthcare contexts. We outline the dataset creation, task design and evaluation framework, participation statistics, baseline systems, and summarize the overall outcomes. We conclude with reflections on the performance trends observed and prospects for future iterations in Arabic health QA.

AraHealthQA 2025: TheFirst Shared Task on Arabic Health Question Answering

We present the results and findings of the BAREC Shared Task 2025 on Arabic Readability Assessment, organized as part of the Third Arabic Natural Language Processing Conference (ArabicNLP 2025). The BAREC 2025 shared task focuses on automatic readability assessment using the BAREC Corpus, addressing fine-grained classification into 19 readability levels. The shared task includes two sub-tasks: sentence-level classification and document-level classification, and three tracks: (1) Strict Track, where only the BAREC Corpus is allowed; (2) Constrained Track, restricted to the BAREC Corpus, SAMER Corpus, and SAMER Lexicon, and (3) Open Track, allowing any external resources. A total of 22 teams from 12 countries registered for the task. Among these, 17 teams submitted system description papers. The winning team achieved 87.5 QWK on the sentencelevel task and 87.4 QWK on the document-level task.

BAREC Shared Task 2025 on Arabic Readability Assessment

We present ImageEval 2025, the first shared task dedicated to Arabic image captioning. The task addresses the critical gap in multimodal Arabic NLP by focusing on two complementary subtasks: (1) creating the first open-source, manually-captioned Arabic image dataset through a collaborative datathon, and (2) developing and evaluating Arabic image captioning models. A total of 44 teams registered, of which eight submitted during the test phase, producing 111 valid submissions. Evaluation was conducted using automatic metrics, LLM-based judgment, and human assessment. In Subtask 1, the best-performing system achieved a cosine similarity of 65.5, while in Subtask 2, the top score was 60.0. Although these results show encouraging progress, they also confirm that Arabic image captioning remains a challenging task, particularly due to cultural grounding requirements, morphological richness, and dialectal variation. All datasets, baseline models, and evaluation tools are released publicly to support future research in Arabic multimodal NLP.

Downloads

Next from EMNLP 2025

Tool Calling for Arabic LLMs: Data Strategies and Instruction Tuning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

Tool Calling for Arabic LLMs: Data Strategies and Instruction Tuning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads