Lecture image placeholder

Premium content

Access to this content requires a subscription. You must be a premium user to view this content.

Monthly subscription - $9.99Pay per view - $4.99Access through your institutionLogin with Underline account
Need help?
Contact us
Lecture placeholder background
VIDEO DOI: https://doi.org/10.48448/pewt-ca73

workshop paper

ACL 2024

August 16, 2024

Bangkok, Thailand

On the Utility of Pretraining Language Models on Synthetic Data

keywords:

natural language understanding (nlu)

optical character recognition (ocr)

natural language processing (nlp)

automatic speech recognition (asr)

pretrained language models

deep learning

text generation

machine translation

Development of pre-trained language models has predominantly relied on large amounts of datasets. However, this dependence on abundant data has limited the applicability of these models in low-resource settings. In this work, we investigate the utility of exploiting synthetic datasets acquired from different sources to pre-train language models for Arabic. Namely, we leverage data derived based on four different methods: optical character recognition (OCR), automatic speech recognition (ASR), machine translation (MT), and generative language models. We use these datasets to pre-train models in three different architectures: encoder-only (BERTtextsubscript{Base}), encoder-decoder (T5), and decoder-only (GPT-2). We test the capabilities of resulting models on Arabic natural language understanding (NLU) tasks using the ORCA benchmark. Our results show that utilizing synthetic data can achieve performance comparable to, or even surpassing, those trained on gold data. For example, our model based on a GPT-2 architecture trained on a combined synthetic dataset surpasses the baseline model ARBERTtextsubscript{v2}. Overall, our models pre-trained on synthetic data demonstrate robust performance across various tasks. This highlights the potential of synthetic datasets in augmenting language model training in low-resource settings.

Downloads

Transcript English (automatic)

Next from ACL 2024

Benchmarking LLaMA-3 on Arabic Language Generation Tasks
workshop paper

Benchmarking LLaMA-3 on Arabic Language Generation Tasks

ACL 2024

+2Muhammad Abdul-MageedMd Tawkat Islam Khondaker
Md Tawkat Islam Khondaker and 4 other authors

16 August 2024

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Lectures
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2023 Underline - All rights reserved