Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
We tackle Diacritic Restoration for Arabic dialectal sentences using a multimodal model that combines text and speech. The text stream uses our own pretrained model named CATT, and the speech stream uses the Whisper-base encoder, with a Linear classification head for token-level prediction. We integrate the modalities via either Early Fusion or Cross-Attention Fusion, and the system remains robust when speech is absent. Across both official development and test sets, the model outperforms baseline and other participants in WER/CER and maintains an advantage on challenging pronunciations.
