China

Addressing the need for efficient scoring beyond the time-intensive manual process , this work demonstrates that Feature Engineering is not Dead for Arabic Automated Essay Scoring (AES). We introduce a comprehensive set of 816 engineered linguistic features , inspired by the success in both English and Arabic AES , and grouped into five categories: Surface, Lexical, Semantic, Syntactic, and Readability Metrics. Our experiments on the TAQAE dataset using cross-prompt training confirm that these features are essential: they dramatically boost the performance of Hybrid models (like ProTACT and AraBERT) , and models that rely on them, like the Feature-based and Hybrid categories, achieve the highest overall average performance , with Random Forest (RF) + feature selection reaching an average QWK of 0.294. This clearly establishes that engineered features remain critical for achieving state-of-the-art results in Arabic AES.

EMNLP 2025

Feature Engineering is not Dead: A Step Towards State of the Art for Arabic Automated Essay Scoring

automated essay scoring

feature engineering

large language models

encoder

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

This paper provides a comprehensive overview of the QIAS 2025 shared task, organized as part of the ArabicNLP 2025 conference and co­located with EMNLP 2025. The task was designed for the evaluation of large language
models in the complex domains of religious and legal reasoning. It comprises two subtasks: (1) Islamic Inheritance Reasoning, requiring models to compute inheritance shares according to Islamic jurisprudence, and (2) Islamic Knowledge Assessment, which covers a range of traditional Islamic disciplines. Both subtasks were structured as multiple­choice question answering challenges, with questions stratified by varying difficulty levels. The shared task attracted
significant interest, with 44 teams participating in the development phase, from which 18 teams advanced to the final test phase. Of these,
6 teams submitted entries for both subtasks, 8 for Task 1 only, and two for Task 2 only. Ultimately, 16 teams submitted system description
papers. Herein, we detail the task’s motivation,
dataset construction, evaluation protocol, and
present a summary of the participating systems
and their results.

Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation

BALSAM, a comprehensive, community driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15Kdevelopment, and a centralized, transparent platform for blind evaluation. We envision
BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.

BALSAM: A Platform for Benchmarking Arabic Large Language Models

Post-training has emerged as a crucial technique for aligning pre-trained Large Language Models (LLMs) with human instructions, significantly enhancing their performance across a wide range of tasks. Central to this process is the quality and diversity of post-training datasets. This paper presents a review of publicly available Arabic post-training datasets on the Hugging Face Hub, organized along four key dimensions: (1) LLM Capabilities (e.g., Question Answering, Translation, Reasoning, Summarization, Dialogue, Code Generation, and Function Calling); (2) Steerability (e.g., Persona and System Prompts); (3) Alignment (e.g., Cultural, Safety, Ethics, and Fairness); and (4) Robustness. Each dataset is rigorously evaluated based on popularity, practical adoption, recency and maintenance, documentation and annotation quality, licensing transparency, and scientific contribution. Our review revealed critical gaps in the development of Arabic post-training datasets, including limited task diversity, inconsistent or missing documentation and annotation, and low adoption across the community. Finally, the paper discusses the implications of these gaps on the progress of Arabic-centric LLMs and applications while providing concrete recommendations for future efforts in Arabic post-training dataset development.

Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations

We address the task of reverse dictionary modeling in Arabic, where the goal is to retrieve a target word given its definition. The task comprises two subtasks: (1) generating embeddings for Arabic words based on Arabic glosses, and (2) a cross-lingual setting where the gloss is in English and the target embedding is for the corresponding Arabic word. Prior approaches have largely relied on BERT models such as CAMeLBERT or MARBERT trained with mean squared error loss. In contrast, we propose a novel ensemble architecture that combines MARBERTv2 with the encoder of AraBART, and we demonstrate that the choice of loss function has a significant impact on performance. We apply contrastive loss to improve representational alignment, and introduce structural and center losses to better capture the semantic distribution of the dataset. This multi-loss framework enhances the quality of the learned embeddings and leads to consistent improvements in both monolingual and cross-lingual settings. Our system achieved the best rank metric in both subtasks compared to the previous approaches. These results highlight the effectiveness of combining architectural diversity with task-specific loss functions in representational tasks for morphologically rich languages like Arabic.

Learning Word Embeddings from Glosses: A Multi-Loss Framework for Arabic Reverse Dictionary Tasks

We introduce ALARB, a dataset and suite of tasks designed to evaluate the reasoning capabilities of large language models (LLMs) within the Arabic legal domain. While existing Arabic benchmarks cover some knowledge-intensive tasks such as retrieval and understanding, substantial datasets focusing specifically on multistep reasoning for Arabic LLMs, especially in open-ended contexts, are lacking. The dataset comprises over 13K commercial court cases from Saudi Arabia, with each case including the facts presented, the reasoning of the court, the verdict, as well the cited clauses extracted from the regulatory documents. We define a set of challenging tasks leveraging this dataset and reflecting the complexity of real-world legal reasoning, including verdict prediction, completion of reasoning chains in multistep legal arguments, and identification of relevant regulations based on case facts. We benchmark a representative selection of current open and closed Arabic LLMs on these tasks and demonstrate the dataset’s utility for instruction tuning. Notably, we show that instruction tuning a modest 12B parameter model using ALARB significantly enhances its performance in verdict prediction and Arabic verdict generation, reaching a level comparable to that of GPT-4o.

ALARB: An Arabic Legal Argument Reasoning Benchmark

We present Octopus, a first family of modular speech-language models designed for Arabic-English ASR, dialect identification, and speech translation. Built on Whisper-V3 and enhanced with large language models like ALLaM, LLaMA, and DeepSeek, Octopus bridges speech and text through a lightweight projection layer and Q-Former. To broaden its scope beyond speech, Octopus integrates BEATs, a general-purpose audio encoder allowing it to understand both linguistic and acoustic events. Despite its simplicity, this dual-encoder design supports robust performance across multilingual and code-switched scenarios. We also introduce TinyOctopus, a distilled variant using smaller models (Distil-Whisper + LLaMA3-1B / DeepSeek-1.5B), achieving competitive results with just a fraction of the parameters. Fine-tuning on synthetic code-switched data further boosts its performance. Octopus demonstrates the power of compact, extensible architectures in Arabic-centric speech modeling and sets the stage for unified multilingual audio-language understanding.

Octopus: Towards Building the Arabic Speech LLM Suite

In this paper, we present the Arabic Multimodal Crawl (AMCrawl), the first native-based Arabic multimodal dataset to our knowledge, derived from the Common Crawl corpus and rigorously filtered for quality and safety. Image-text pair datasets are the standard choice for pretraining multimodal large language models. However, they are often derived from image alt-text metadata, which is typically brief and context-poor, disconnecting images from their broader meaning. Although significant advances have been made in building interleaved image-text datasets for English, such as the OBELICS dataset, a substantial gap remains for native Arabic content. Our processing covered 8.6 million Arabic web pages, yielding 5.8 million associated images and 1.3 billion text tokens. The final dataset includes interleaved image-text documents and question-answer pairs, featuring 2.8 million high-quality interleaved documents and 5 million QA pairs. Alongside the dataset, we release the complete pipeline and code, ensuring reproducibility and encouraging further research and development. To demonstrate the effectiveness of AMCrawl, we introduce a publicly available native Arabic Vision Language model, trained with 13 billion parameters. These models achieve competitive results when benchmarked against publicly available datasets. AMCrawl bridges a critical gap in Arabic multimodal resources, providing a robust foundation for developing Arabic multimodal large language models and fostering advancements in this underrepresented area. Code: github.com/shahad-aboukozzana/AMCrawl

AMCrawl: An Arabic Web-Scale Dataset of Interleaved Image-Text Documents and Image-Text Pairs

Conversational question-answering (CQA) plays a crucial role in bridging the gap between human language and machine understanding, enabling more natural and interactive interactions with AI systems. In this work, we present the first results on open-domain Arabic CQA using deep learning. We introduce AraQReCC, a large-scale Arabic CQA dataset containing 9K conversations with 62K question-answer pairs, created by translating a subset of the QReCC dataset. To ensure data quality, we used COMET-based filtering and manual ratings from large language models (LLMs), such as GPT-4 and LLaMA, selecting conversations with COMET scores, along with LLM ratings of 4 or more. AraQReCC facilitates advanced research in Arabic CQA, improving clarity and relevance through question rewriting. We applied AraT5 for question rewriting and used BM25 and Dense Passage Retrieval (DPR) for passage retrieval. AraT5 is also used for question answering, completing the end-to-end system. Our experiments show that the best performance is achieved with DPR, attaining an F1 score of 21.51\% on the test set. While this falls short of the human upper bound of 40.22\%, it underscores the importance of question rewriting and quality-controlled data in enhancing system performance.

Open-domain Arabic Conversational Question Answering with Question Rewriting

Classical Arabic represents a significant era that encompasses the golden age of Arab culture, philosophy, and scientific literature. With a broad consensus on the importance of translating these literatures to enrich knowledge dissemination across communities, the advent of large language models (LLMs) and translation systems offers promising tools to facilitate this goal. However, we have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics, hindering the development of high-quality translation systems. In response, we present the ATHAR dataset, which comprises 66,000 high-quality classical Arabic to English translation samples that cover a wide array of topics including science, culture, and philosophy. Furthermore, we assess the performance of current state-of-the-art LLMs under various settings, concluding that there is a need for such datasets in current systems. Our findings highlight how models can benefit from fine-tuning or incorporating this dataset into their pretraining pipelines. The dataset is publicly available on the HuggingFace Data Hub: https://huggingface.co/datasets/mohamed-khalil/ATHAR

ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

For effective use in specific countries, Large Language Models (LLMs) need a strong grasp of local culture and core knowledge to ensure socially appropriate, context-aware, and factually correct responses. Existing Arabic and Saudi benchmarks are limited, focusing mainly on dialects or lifestyle, with little attention to deeper cultural or domain-specific alignment from authoritative sources. To address this gap and the challenge LLMs face with non-Western cultural nuance, this study introduces the Saudi-Alignment Benchmark. It consists of 874 manually curated questions across two core cultural dimensions: Saudi Cultural and Ethical Norms, and Saudi Domain Knowledge. These questions span multiple subcategories and use three formats to assess different goals with verified sources. Our evaluation reveals significant variance in LLM alignment. GPT-4 achieved the highest overall accuracy (83.3%), followed by ALLaM-7B (81.8%) and Llama-3.3-70B (81.6%), whereas Jais-30B exhibited a pronounced shortfall at 21.9%. Furthermore, multilingual LLMs excelled in norms; ALLaM-7B in domain knowledge. Considering the effect of question format, LLMs generally excelled in selected-response formats but showed weaker results on generative tasks, indicating that recognition-based benchmarks alone may overestimate cultural and contextual alignment. These findings highlight the need for tailored benchmarks and reveal LLMs’ limitations in achieving cultural grounding, particularly in underrepresented contexts like Saudi Arabia.

Content not yet available

Downloads

Next from EMNLP 2025

Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Content not yet available

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads