Lecture image placeholder

Premium content

Access to this content requires a subscription. You must be a premium user to view this content.

Monthly subscription - $9.99Pay per view - $4.99Access through your institutionLogin with Underline account
Need help?
Contact us
Lecture placeholder background
VIDEO DOI: https://doi.org/10.48448/7z8p-nv76

workshop paper

ACL 2024

August 16, 2024

Bangkok, Thailand

CATT: Character-based Arabic Tashkeel Transformer

keywords:

noisy student

tashkeel

diacritization

pretraining

transformers

Tashkeel, or Arabic Text Diacritization (ATD), greatly enhances the comprehension of Arabic text by removing ambiguity and minimizing the risk of misinterpretations caused by its absence. It plays a crucial role in improving Arabic text processing, particularly in applications such as text-to-speech and machine translation. This paper introduces a new approach to training ATD models. First, we finetuned two transformers, encoder-only and encoder-decoder, that were initialized from a pretrained character-based BERT. Then, we applied the Noisy-Student approach to boost the performance of the best model. We evaluated our models alongside 11 commercial and open-source models using two manually labeled benchmark datasets: WikiNews and our CATT dataset. Our findings show that our top model surpasses all evaluated models by relative Diacritic Error Rates (DERs) of 30.83% and 35.21% on WikiNews and CATT, respectively, achieving state-of-the-art in ATD. In addition, we show that our model outperforms GPT-4-turbo on CATT dataset by a relative DER of 9.36%. We open-source our CATT models and benchmark dataset for the research community footnote{ https://github.com/abjadai/catt }.

Downloads

Transcript English (automatic)

Next from ACL 2024

Picking Up Where the Linguist Left Off: Mapping Morphology to Phonology through Learning the Residuals
workshop paper

Picking Up Where the Linguist Left Off: Mapping Morphology to Phonology through Learning the Residuals

ACL 2024

+1Abdelrahim QaddoumiEllen Broselow
Salam Khalifa and 3 other authors

16 August 2024

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Lectures
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2023 Underline - All rights reserved