Thailand

The amount of data to fine-tune LLMs plays a crucial role in the performance of these models in downstream tasks. Consequently, it is not straightforward to deploy these models in lowresource settings. In this work, we investigate two new multi-task learning data augmentation approaches for fine-tuning LLMs when little data is available: &quot;In-domain Augmentation&quot; of the training data and extracting &quot;Drills&quot; as smaller tasks from the target dataset. We evaluate the proposed approaches in three natural language processing settings in the context of SMM4H 2024 competition tasks: multi-class classification, entity recognition, and information extraction. The results show that both techniques improve the performance of the models in all three settings, suggesting a positive impact from the knowledge learned in multi-task training to perform the target task.

ACL 2024

Dolomites@#SMM4H 2024 Helping LLMs &quot;Know The Drill&quot; in Low-Resource Settings A Study on Social Media Posts

mtl-da

entity recognition

large language model

classification

data augmentation

information extraction

The amount of data to fine-tune LLMs plays a crucial role in the performance of these models in downstream tasks. Consequently, it is not straightforward to deploy these models in lowresource settings. In this work, we investigate two new multi-task learning data augmentation approaches for fine-tuning LLMs when little data is available: "In-domain Augmentation" of the training data and extracting "Drills" as smaller tasks from the target dataset. We evaluate the proposed approaches in three natural language processing settings in the context of SMM4H 2024 competition tasks: multi-class classification, entity recognition, and information extraction. The results show that both techniques improve the performance of the models in all three settings, suggesting a positive impact from the knowledge learned in multi-task training to perform the target task.

Dolomites@#SMM4H 2024 Helping LLMs "Know The Drill" in Low-Resource Settings A Study on Social Media Posts

workshop paper

### Welcome!
The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. Our Virtual Poster Sessions will take place online Thursday, August 22, 2024.

You are required to register for this event. **Please register [here](https://2024.aclweb.org/registration). **

If you have already registered, please check your inbox for an email from Underline granting you access to ACL 2024 content.

Please register!

The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. More information will be announced soon.

The following is a description of the RIGA team’s submissions for the SMM4H-2024 Task 1: Extraction and normalization of adverse drug events (ADEs) in English tweets. Our approach focuses on utilizing Large Language Models (LLMs) to generate data that enhances the fine-tuning of classification and Named Entity Recognition (NER) models. Our solution significantly outperforms mean and median submissions of other teams. The efficacy of our ADE extraction from tweets is comparable to the current state-of-the-art solution, established as the task baseline. The code for our method is available on GitHub1.

RIGA at SMM4H-2024 Task 1: Enhancing ADE discovery with GPT-4

This article introduces contrastive alignment instructions (AlignInstruct) to address two challenges in machine translation (MT) on large language models (LLMs). One is the expansion of supported languages to previously unseen ones. The second relates to the lack of data in low-resource languages. Model fine-tuning through MT instructions (MTInstruct) is a straightforward approach to the first challenge. However, MTInstruct is limited by weak cross-lingual signals inherent in the second challenge. AlignInstruct emphasizes cross-lingual supervision via a cross-lingual discriminator built using statistical word alignments. Our results based on fine-tuning the BLOOMZ models (1b1, 3b, and 7b1) in up to 24 unseen languages showed that: (1) LLMs can effectively translate unseen languages using MTInstruct; (2) AlignInstruct led to consistent improvements in translation quality across 48 translation directions involving English; (3) Discriminator-based instructions outperformed their generative counterparts as cross-lingual instructions; (4) AlignInstruct improved performance in 30 zero-shot directions.

Tuning LLMs with Contrastive Alignment Instructions for Machine Translation in Unseen, Low-resource Languages

While machines learn from existing corpora, humans have the unique capability to establish and accept new language systems. This makes human form unique language systems within social groups. Aligning with this, we focus on a gap remaining in addressing translation challenges within social groups, where in-group members utilize unique terminologies. We propose KpopMT dataset, which aims to fill this gap by enabling precise terminology translation, choosing Kpop fandom as an initiative for social groups given its global popularity. Expert translators provide 1k English translations for Korean posts and comments, each annotated with specific terminology within social groups' language systems. We evaluate existing translation systems including GPT models on KpopMT to identify their failure cases. Results show overall low scores, underscoring the challenges of reflecting group-specific terminologies and styles in translation. We plan to expand KpopMT to encompass other social groups, such as sports and global movie communities.

KpopMT: Translation Dataset with Terminology for Kpop Fandom

Recent advancements in Neural Machine Translation (NMT) systems have significantly improved model performance on various translation benchmarks. However, these systems still face numerous challenges when translating low-resource languages (e.g., Urdu). In this work, we uncover the specific issues machine translation systems face when dealing with Urdu (a low-resource Indo-Aryan language). We first conduct a comprehensive evaluation of four diverse models on the task of Urdu Machine Translation: GPT-3.5 (a large language model), opus-mt-en-ur (a bilingual translation model), NLLB (a model trained for translating 200 languages) and IndicTrans2 (a specialized model for translating low-resource Indian languages). The results demonstrate that IndicTrans2 significantly outperforms other models in Urdu Machine Translation. We unveil the reasons for the superior performance of IndicTrans2, highlight the specific challenges encountered by different models in Urdu translation, and provide suggestions for further improvements in Urdu machine translation systems.

Challenges in Urdu Machine Translation

In this paper we propose a framework for automatic translation of English text to American Sign Language (ASL) which leverages a linguistically informed transformer model to translate English sentences into ASL gloss sequences. These glosses are then associated with respective ASL videos, effectively representing English text in ASL. To facilitate experimentation, we create an English-ASL parallel dataset on banking domain.
Our preliminary results demonstrated that the linguistically informed transformer model achieves a 97.83\% ROUGE-L score for text-to-gloss translation on the ASLG-PC12 dataset. Furthermore, fine-tuning the transformer model on the banking domain dataset yields an 89.47\% ROUGE-L score when fine-tuned on ASLG-PC12 + banking domain dataset. These results demonstrate the effectiveness of the linguistically informed model for both general and domain-specific translations. To facilitate parallel dataset generation in banking-domain, we choose ASL despite having limited benchmarks and data corpus compared to some of the other sign languages.

Linguistically Informed Transformers for Text to American Sign Language Translation

Cross-lingual summarization (XLS) aims to generate a summary in a target language different from the source language document. While large language models (LLMs) have shown promising zero-shot XLS performance, their few-shot capabilities on this task remain unexplored, especially for low-resource languages with limited parallel data. In this paper, we investigate the few-shot XLS performance of various models, including Mistral-7B-Instruct-v0.2, GPT-3.5, and GPT-4.
Our experiments demonstrate that few-shot learning significantly improves the XLS performance of LLMs, particularly GPT-3.5 and GPT-4, in low-resource settings. 
However, the open-source model Mistral-7B-Instruct-v0.2 struggles to adapt effectively to the XLS task with limited examples. 
Our findings highlight the potential of few-shot learning for improving XLS performance and the need for further research in designing LLM architectures and pre-training objectives tailored for this task. We provide a future work direction to explore more effective few-shot learning strategies and to investigate the transfer learning capabilities of LLMs for cross-lingual summarization.

Low-Resource Cross-Lingual Summarization through Few-Shot Learning with Large Language Models

Neural Machine Translation (NMT) remains a formidable challenge, especially when dealing with low-resource languages. Pre-trained sequence-to-sequence (seq2seq) multi-lingual models, such as mBART-50, have demonstrated impressive performance in various low-resource NMT tasks. However, their pre-training has been confined to 50 languages, leaving out support for numerous low-resource languages, particularly those spoken in the Indian subcontinent. Expanding mBART-50's language support requires complex pre-training, risking performance decline due to catastrophic forgetting. 
Considering these expanding challenges, this paper explores a framework that leverages the benefits of a pre-trained language model along with knowledge distillation in a seq2seq architecture to facilitate translation for low-resource languages, including those not endorsed by mBART-50. The proposed framework employs a multilingual encoder-based seq2seq model as the foundational architecture and subsequently uses complementary knowledge distillation techniques to mitigate the impact of imbalanced training. 
Our framework is evaluated on six low-resource Indic language pairs, yielding significant BLEU-4 and chrF improvements over baselines. Further, we conduct human evaluation confirm effectiveness of our approach.

Enhancing Low-Resource NMT with a Multilingual Encoder and Knowledge Distillation: A Case Study

Cantonese, the second most prevalent Chinese dialect after Mandarin, has been relatively overlooked in machine translation (MT) due to a scarcity of bilingual resources. In this paper, we propose to leverage Mandarin, a high-resource language, as a pivot language for Cantonese-to-English MT. Our method utilizes transfer learning from pre-trained Bidirectional and Auto-Regressive Transformer (BART) models to initialize auxiliary Cantonese-Mandarin and Mandarin-English MT models. The parameters of the auxiliary models are then used to initialize the Cantonese-English model. Based on our experiment, our proposed method outperforms several baseline initialization strategies and naive pivot translation.

Leveraging Mandarin as a Pivot Language for Low-Resource Machine Translation from Cantonese to English

This study addresses a challenge in Morphology segmentation: accurately segmenting words in languages with rich morphology. There needs to be more consistency between the outcomes of probabilistic approaches like Morfessor and words segmented by humans. Our study adds some steps to the Morfessor segmentation process to consider invalid morphemes and borrowed words from other languages to improve morphological segmentation significantly. Comparing our idea to the results obtained from Morfessor demonstrates its efficiency, leading to more accurate morphology segmentation; in the particular case of Turkish, and opening up new opportunities for progress in morpheme segmentation especially for rich morphologically language.

Enhancing Turkish Word Segmentation: A Focus on Borrowed Words and Invalid Morpheme

Despite the increasing popularity of multilingualism within the NLP community, numerous languages continue to be underrepresented due to the lack of available resources. Our work addresses this gap by introducing experiments on cross-lingual transfer between 158 high-resource (HR) and 31 low-resource (LR) languages. We mainly focus on extremely LR languages, some of which are first presented in research works. Across $158*31$ HR–LR language pairs, we investigate how continued pretraining on different HR languages affects the mT5 model's performance in representing LR languages in the LM setup. Our findings surprisingly reveal that the optimal language pairs with improved performance do not necessarily align with direct linguistic motivations, with subtoken overlap playing a more crucial role. Our investigation indicates that specific languages tend to be almost universally beneficial for pretraining (\textit{super donors}), while others benefit from pretraining with almost any language (\textit{super recipients}). This pattern recurs in various setups and is unrelated to the linguistic similarity of HR-LR pairs. Furthermore, we perform evaluation on two downstream tasks, part-of-speech (POS) tagging and machine translation (MT), showing how HR pretraining affects LR language performance. We thoroughly explore and discuss the experimental results of this study.

Premium content

Dolomites@#SMM4H 2024 Helping LLMs "Know The Drill" in Low-Resource Settings A Study on Social Media Posts

Downloads

Next from ACL 2024

RIGA at SMM4H-2024 Task 1: Enhancing ADE discovery with GPT-4

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES