EMNLP 2025

November 08, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

North African Arabic dialects pose major NLP challenges due to high lexical variation, script diversity (Arabic/Latin), and frequent French code-switching. We introduce a phoneme-based normalization that harmonizes surface forms across varieties by mapping both Arabic and French into a simplified Latin representation.

We then pretrain BERT models exclusively on normalized Modern Standard Arabic and French, without using any dialectal data. The resulting models are evaluated on Named Entity Recognition (DzNER, DarNER, WikiFANE) and sentiment classification (TwiFil).

Our approach achieves state-of-the-art performance on several North African benchmarks and shows strong zero-shot generalization from MSA to Algerian NER (Ar_20k > dialect-pretrained models). Results demonstrate that standard-only pretraining with normalization is a viable and scalable solution for supporting underserved Arabic dialects.

Downloads

Paper

Next from EMNLP 2025

Transfer or Translate? Argument Mining in Arabic with No Native Annotations
workshop paper

Transfer or Translate? Argument Mining in Arabic with No Native Annotations

EMNLP 2025

Khalid Khatib
Khalid Khatib and 1 other author

08 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2026 Underline - All rights reserved