Thailand

We conducted a systematic evaluation of seven large language models (LLMs) on tasks in Kazakh. Kazakh is a Turkic language spoken by approximately 13 million native speakers in Kazakhstan and abroad. We used six datasets corresponding to different tasks -- questions answering, causal reasoning, middle school math problems, machine translation, and spelling correction. Three of the datasets were prepared for this study. As expected, the quality of the LLMs on the Kazakh tasks is lower than on the parallel English tasks. GPT-4 shows the best results, followed by Gemini and Aya. In general, LLMs perform better on classification tasks and struggle with generative tasks. Our results provide valuable insights into the applicability of currently available LLMs for Kazakh. We will publish the data collected for this study, which will be a good start for an LLM benchmark focused on Kazakh.

ACL 2024

Do LLMs Speak Kazakh? A Pilot Evaluation of Seven Models

kazakh

large language models

evaluation

workshop paper

### Welcome!
The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. Our Virtual Poster Sessions will take place online Thursday, August 22, 2024.

You are required to register for this event. **Please register [here](https://2024.aclweb.org/registration). **

If you have already registered, please check your inbox for an email from Underline granting you access to ACL 2024 content.

Please register!

The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. More information will be announced soon.

This paper presents our work on tools to support the Tatar language, using Revita, a web-based Intelligent Tutoring System for language learning and teaching. The system allows the users — teachers and learners — to upload arbitrary authentic texts, and automatically creates exercises based on these texts that engage the learners in active production of language. It provides graduated feedback when they make mistakes, and performs continuous assessment, which helps the students maintain their learning pace, and helps the teachers to monitor their progress. 
The paper describes the functionality currently implemented for Tatar, which enables learners — who possess basic proficiency beyond the beginner level — to improve their competency, using texts of their choice as learning content. Support for Tatar is being developed to increase public interest in learning the language of this important regional minority, as well as to to provide tools for improving fluency to "heritage speakers" — those who have substantial passive competency, but lack active fluency and need support for regular practice.

An Intelligent Tutor to Support Teaching and Learning of Tatar

This paper reports on the performance of SRCB’s system in the Social Media Mining for Health (#SMM4H) 2024 Shared Task 1: extrac- tion and normalization of adverse drug events (ADEs) in English tweets. We develop a sys- tem composed of an ADE extraction module and an ADE normalization module which fur- ther includes a retrieval module and a filtering module. To alleviate the data imbalance and other issues introduced by the dataset, we em- ploy 4 data augmentation techniques based on Large Language Models (LLMs) across both modules. Our best submission achieves an F1 score of 53.6 (49.4 on the unseen subset) on the ADE normalization task and an F1 score of 52.1 on ADE extraction task.

SRCB at #SMM4H 2024: Making Full Use of LLM-based Data Augmentation in Adverse Drug Event Extraction and Normalization

This paper presents our approaches for the SMM4H’24 Shared Task 5 on the binary classi- fication of English tweets reporting children’s medical disorders. Our first approach involves fine-tuning a single RoBERTa-large model, while the second approach entails ensembling the results of three fine-tuned BERTweet-large models. We demonstrate that although both approaches exhibit identical performance on validation data, the BERTweet-large ensemble excels on test data. Our best-performing system achieves an F1-score of 0.938 on test data, out- performing the benchmark classifier by 1.18%. Our code is available on Github1.

LT4SG@SMM4H’24: Tweets Classification for Digital Epidemiology of Childhood Health Outcomes Using Pre-Trained Language Models

In this paper, we present our approach to ad- dressing the binary classification tasks, Tasks 5 and 6, as part of the Social Media Mining for Health (SMM4H) text classification challenge. Both tasks involved working with imbalanced datasets that featured a scarcity of positive ex- amples. To mitigate this imbalance, we em- ployed a Large Language Model to generate synthetic texts with positive labels, aiming to augment the training data for our text classifi- cation models. Unfortunately, this method did not significantly improve model performance. Through clustering analysis using text embed- dings, we discovered that the generated texts significantly lacked diversity compared to the raw data. This finding highlights the challenges of using synthetic text generation for enhanc- ing model efficacy in real-world applications, specifically in the context of health-related so- cial media data.

UTRad-NLP at #SMM4H 2024: Why LLM-Generated Texts Fail to Improve Text Classification Models

This document describes our system used for the Social Media Mining for Health (SMM4H) 2024 Task 05. The objective of this task was to perform binary classification on the tweets provided in the dataset. The dataset contained two categories of tweets: those reporting medi- cal disorders and those merely mentioning the disease. We tackled this problem using a 5-fold cross-validation approach. Our method utilizes the RoBERTa-Large model with 5-fold cross- validation. The evaluation results yielded an F1-score of 0.886 on the validation dataset and 0.823 on the test dataset.

PheonixTrio918 at SMM4H 2024: 5 Fold Cross Validation for Classification of tweets reporting children’s disorders

We present our approach to solving the task of identifying the effect of outdoor activities on social anxiety based on reddit posts. We employed state-of-the-art transformer models enhanced with a combination of advanced loss functions. Data augmentation techniques were also used to address class imbalance within the training set. Our method achieved a macro- averaged F1 score of 0.655 in the test data, exceeding the mean F1 score of the shared task of 0.519. These findings suggest that integrat- ing weighted loss functions improves the per- formance of transformer models in classifying unbalanced text data, while data augmentation can improve the model’s ability to generalize.

PCIC at SMM4H 2024: Enhancing Reddit Post Classification on Social Anxiety Using Transformer Models and Advanced Loss Functions

With the widespread increase in the use of so- cial media platforms such as Twitter, Instagram, and Reddit, people are sharing their views on various topics. They have become more vocal on these platforms about their views and opin- ions on the medical challenges they are facing. This data is a valuable asset of medical insights in the study and research of healthcare. This paper describes our adoption of transformer- based approaches for tasks 3 and 5. For both tasks, we fine-tuned large RoBERTa, a BERT- based architecture, and achieved an F1 score of 0.413 and 0.900 in tasks 3 and 5, respectively.

Transformers at #SMM4H 2024: Identification of Tweets Reporting Children’s Medical Disorders And Effects of Outdoor Spaces on Social Anxiety Symptoms on Reddit Using RoBERTa

This paper presents our approach for SMM4H 2024 Task 5, focusing on identifying tweets where users discuss their child’s health con- ditions of ADHD, ASD, delayed speech, or asthma. Our approach uses a pipeline that com- bines transformer-based classifiers and GPT-4 large language models (LLMs). We first ad- dress data imbalance in the training set using topic modelling and under-sampling. Next, we train RoBERTa-based classifiers on the ad- justed data. Finally, GPT-4 refines the clas- sifier’s predictions for uncertain cases (confi- dence below 0.9). This strategy achieved signif- icant improvement over the baseline RoBERTa models. Our work demonstrates the effective- ness of combining transformer classifiers and LLMs for extracting health insights from social media conversations.

CHAAI@SMM4H’24: Enhancing Social Media Health Prediction Certainty by Integrating Large Language Models with Transformer Classifiers

This is the demonstration of systems and results of our team’s participation in the Social Medi- cal Mining for Health (SMM4H) 2024 Shared Task. Our team participated in two tasks: Task 1 and Task 5. Task 5 requires the detection of tweet sentences that claim children’s medi- cal disorders from certain users. Task 1 needs teams to extract and normalize Adverse Drug Event terms in the tweet sentence. The team selected several Pre-trained Language Models and generative Large Language Models to meet the requirements. Strategies to improve the per- formance include cloze test, prompt engineer- ing, Low Rank Adaptation etc. The test result of our system has an F1 score of 0.935, Preci- sion of 0.954 and Recall of 0.917 in Task 5 and an overall F1 score of 0.08 in Task 1.

PolyuCBS at SMM4H 2024: LLM-based Medical Disorder and Adverse Drug Event Detection with Low-rank Adaptation

The advent of Large Language Models (LLMs) such as Generative Pre-trained Transformers (GPT-4) mark a transformative era in Natu- ral Language Generation (NLG). These mod- els demonstrate the ability to generate coher- ent text that closely resembles human-authored content. They are easily accessible and have become invaluable tools in handling various text-based tasks, such as data annotation, report generation, and question answering. In this pa- per, we investigate GPT-4’s ability to discern between data it has annotated and data anno- tated by humans, specifically within the context of tweets in the medical domain. Through ex- perimental analysis, we observe GPT-4 outper- form other state-of-the-art models. The dataset used in this study was provided by the SMM4H (Social Media Mining for Health Research and Applications) shared task. Our model achieved an accuracy of 0.51, securing a second rank in the shared task.

Do LLMs Speak Kazakh? A Pilot Evaluation of Seven Models

Downloads

Next from ACL 2024

An Intelligent Tutor to Support Teaching and Learning of Tatar

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Do LLMs Speak Kazakh? A Pilot Evaluation of Seven Models

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from ACL 2024

An Intelligent Tutor to Support Teaching and Learning of Tatar

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads