Thailand

Ottoman Turkish, as a historical variant of modern Turkish, suffers from a scarcity of available corpora and NLP models. This paper outlines our pioneering endeavors to address this gap by constructing a clean text corpus of Ottoman Turkish materials. We detail the challenges encountered in this process and offer potential solutions. Additionally, we present a case study wherein the created corpus is employed in continual pre-training of BERTurk, followed by evaluation of the model&#39;s performance on the named entity recognition task for Ottoman Turkish. Preliminary experimental results suggest the effectiveness of our corpus in adapting existing models developed for modern Turkish to historical Turkish.

ACL 2024

Towards a Clean Text Corpus for Ottoman Turkish

data cleansing

text mining

corpus

Ottoman Turkish, as a historical variant of modern Turkish, suffers from a scarcity of available corpora and NLP models. This paper outlines our pioneering endeavors to address this gap by constructing a clean text corpus of Ottoman Turkish materials. We detail the challenges encountered in this process and offer potential solutions. Additionally, we present a case study wherein the created corpus is employed in continual pre-training of BERTurk, followed by evaluation of the model's performance on the named entity recognition task for Ottoman Turkish. Preliminary experimental results suggest the effectiveness of our corpus in adapting existing models developed for modern Turkish to historical Turkish.

workshop paper

### Welcome!
The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. Our Virtual Poster Sessions will take place online Thursday, August 22, 2024.

You are required to register for this event. **Please register [here](https://2024.aclweb.org/registration). **

If you have already registered, please check your inbox for an email from Underline granting you access to ACL 2024 content.

Please register!

The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. More information will be announced soon.

Euphemisms are a form of figurative language relatively understudied in natural language processing. This research extends the current computational work on potentially euphemistic terms (PETs) to Turkish. We introduce the Turkish PET dataset, the first available of its kind in the field. By creating a list of euphemisms in Turkish, collecting example contexts, and annotating them, we provide both euphemistic and non-euphemistic examples of PETs in Turkish. We describe the dataset and methodologies, and also experiment with transformer-based models on Turkish euphemism detection by using our dataset for binary classification. We compare performances across models using F1, accuracy, and precision as evaluation metrics.

Turkish Delights: a Dataset on Turkish Euphemisms

We conducted a systematic evaluation of seven large language models (LLMs) on tasks in Kazakh. Kazakh is a Turkic language spoken by approximately 13 million native speakers in Kazakhstan and abroad. We used six datasets corresponding to different tasks -- questions answering, causal reasoning, middle school math problems, machine translation, and spelling correction. Three of the datasets were prepared for this study. As expected, the quality of the LLMs on the Kazakh tasks is lower than on the parallel English tasks. GPT-4 shows the best results, followed by Gemini and Aya. In general, LLMs perform better on classification tasks and struggle with generative tasks. Our results provide valuable insights into the applicability of currently available LLMs for Kazakh. We will publish the data collected for this study, which will be a good start for an LLM benchmark focused on Kazakh.

Do LLMs Speak Kazakh? A Pilot Evaluation of Seven Models

This paper presents our work on tools to support the Tatar language, using Revita, a web-based Intelligent Tutoring System for language learning and teaching. The system allows the users — teachers and learners — to upload arbitrary authentic texts, and automatically creates exercises based on these texts that engage the learners in active production of language. It provides graduated feedback when they make mistakes, and performs continuous assessment, which helps the students maintain their learning pace, and helps the teachers to monitor their progress. 
The paper describes the functionality currently implemented for Tatar, which enables learners — who possess basic proficiency beyond the beginner level — to improve their competency, using texts of their choice as learning content. Support for Tatar is being developed to increase public interest in learning the language of this important regional minority, as well as to to provide tools for improving fluency to "heritage speakers" — those who have substantial passive competency, but lack active fluency and need support for regular practice.

An Intelligent Tutor to Support Teaching and Learning of Tatar

This paper reports on the performance of SRCB’s system in the Social Media Mining for Health (#SMM4H) 2024 Shared Task 1: extrac- tion and normalization of adverse drug events (ADEs) in English tweets. We develop a sys- tem composed of an ADE extraction module and an ADE normalization module which fur- ther includes a retrieval module and a filtering module. To alleviate the data imbalance and other issues introduced by the dataset, we em- ploy 4 data augmentation techniques based on Large Language Models (LLMs) across both modules. Our best submission achieves an F1 score of 53.6 (49.4 on the unseen subset) on the ADE normalization task and an F1 score of 52.1 on ADE extraction task.

SRCB at #SMM4H 2024: Making Full Use of LLM-based Data Augmentation in Adverse Drug Event Extraction and Normalization

This paper presents our approaches for the SMM4H’24 Shared Task 5 on the binary classi- fication of English tweets reporting children’s medical disorders. Our first approach involves fine-tuning a single RoBERTa-large model, while the second approach entails ensembling the results of three fine-tuned BERTweet-large models. We demonstrate that although both approaches exhibit identical performance on validation data, the BERTweet-large ensemble excels on test data. Our best-performing system achieves an F1-score of 0.938 on test data, out- performing the benchmark classifier by 1.18%. Our code is available on Github1.

LT4SG@SMM4H’24: Tweets Classification for Digital Epidemiology of Childhood Health Outcomes Using Pre-Trained Language Models

In this paper, we present our approach to ad- dressing the binary classification tasks, Tasks 5 and 6, as part of the Social Media Mining for Health (SMM4H) text classification challenge. Both tasks involved working with imbalanced datasets that featured a scarcity of positive ex- amples. To mitigate this imbalance, we em- ployed a Large Language Model to generate synthetic texts with positive labels, aiming to augment the training data for our text classifi- cation models. Unfortunately, this method did not significantly improve model performance. Through clustering analysis using text embed- dings, we discovered that the generated texts significantly lacked diversity compared to the raw data. This finding highlights the challenges of using synthetic text generation for enhanc- ing model efficacy in real-world applications, specifically in the context of health-related so- cial media data.

UTRad-NLP at #SMM4H 2024: Why LLM-Generated Texts Fail to Improve Text Classification Models

This document describes our system used for the Social Media Mining for Health (SMM4H) 2024 Task 05. The objective of this task was to perform binary classification on the tweets provided in the dataset. The dataset contained two categories of tweets: those reporting medi- cal disorders and those merely mentioning the disease. We tackled this problem using a 5-fold cross-validation approach. Our method utilizes the RoBERTa-Large model with 5-fold cross- validation. The evaluation results yielded an F1-score of 0.886 on the validation dataset and 0.823 on the test dataset.

PheonixTrio918 at SMM4H 2024: 5 Fold Cross Validation for Classification of tweets reporting children’s disorders

We present our approach to solving the task of identifying the effect of outdoor activities on social anxiety based on reddit posts. We employed state-of-the-art transformer models enhanced with a combination of advanced loss functions. Data augmentation techniques were also used to address class imbalance within the training set. Our method achieved a macro- averaged F1 score of 0.655 in the test data, exceeding the mean F1 score of the shared task of 0.519. These findings suggest that integrat- ing weighted loss functions improves the per- formance of transformer models in classifying unbalanced text data, while data augmentation can improve the model’s ability to generalize.

PCIC at SMM4H 2024: Enhancing Reddit Post Classification on Social Anxiety Using Transformer Models and Advanced Loss Functions

With the widespread increase in the use of so- cial media platforms such as Twitter, Instagram, and Reddit, people are sharing their views on various topics. They have become more vocal on these platforms about their views and opin- ions on the medical challenges they are facing. This data is a valuable asset of medical insights in the study and research of healthcare. This paper describes our adoption of transformer- based approaches for tasks 3 and 5. For both tasks, we fine-tuned large RoBERTa, a BERT- based architecture, and achieved an F1 score of 0.413 and 0.900 in tasks 3 and 5, respectively.

Transformers at #SMM4H 2024: Identification of Tweets Reporting Children’s Medical Disorders And Effects of Outdoor Spaces on Social Anxiety Symptoms on Reddit Using RoBERTa

This paper presents our approach for SMM4H 2024 Task 5, focusing on identifying tweets where users discuss their child’s health con- ditions of ADHD, ASD, delayed speech, or asthma. Our approach uses a pipeline that com- bines transformer-based classifiers and GPT-4 large language models (LLMs). We first ad- dress data imbalance in the training set using topic modelling and under-sampling. Next, we train RoBERTa-based classifiers on the ad- justed data. Finally, GPT-4 refines the clas- sifier’s predictions for uncertain cases (confi- dence below 0.9). This strategy achieved signif- icant improvement over the baseline RoBERTa models. Our work demonstrates the effective- ness of combining transformer classifiers and LLMs for extracting health insights from social media conversations.

Downloads

Next from ACL 2024

Turkish Delights: a Dataset on Turkish Euphemisms

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from ACL 2024

Turkish Delights: a Dataset on Turkish Euphemisms

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads