Thailand

Large language models (LLMs) have become a dominant and important tool for NLP researchers in a wide range of tasks. Today, many researchers use LLMs in synthetic data generation, task evaluation, fine-tuning, distillation, and other model-in-the-loop research workflows. However, challenges arise when using these models that stem from their scale, their closed source nature, and the lack of standardized tooling for these new and emerging workflows. The rapid rise to prominence of these models and these unique challenges has had immediate adverse impacts on open science and on the reproducibility of work that uses them. In this ACL 2024 theme track paper, we introduce DataDreamer, an open source Python library that allows researchers to write simple code to implement powerful LLM workflows. DataDreamer also helps researchers adhere to best practices that we propose to encourage open science and reproducibility. The library and documentation are available at: https://github.com/datadreamer-dev/DataDreamer.

ACL 2024

DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows

workflows

alignment

training

synthetic data

prompting

reproducibility

poster

### Welcome!
The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. Our Virtual Poster Sessions will take place online Thursday, August 22, 2024.

You are required to register for this event. **Please register [here](https://2024.aclweb.org/registration). **

If you have already registered, please check your inbox for an email from Underline granting you access to ACL 2024 content.

Please register!

The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. More information will be announced soon.

Large language models (LLMs) have shown great capabilities in various tasks but also exhibited memorization of training data, raising tremendous privacy and copyright concerns. While prior works have studied memorization during pre-training, the exploration of memorization during fine-tuning is rather limited. Compared to pre-training, fine-tuning typically involves more sensitive data and diverse objectives, thus may bring distinct privacy risks and unique memorization behaviors. In this work, we conduct the first comprehensive analysis to explore language models' (LMs) memorization during fine-tuning across tasks. Our studies with open-sourced and our own fine-tuned LMs across various tasks indicate that memorization presents a strong disparity among different fine-tuning tasks. We provide an intuitive explanation of this task disparity via sparse coding theory and unveil a strong correlation between memorization and attention score distribution.

Exploring Memorization in Fine-tuned Language Models

We introduce a simple and effective prompting technique called in-context mixing (ICM) for effective in-context learning (ICL) with multilingual large language models (MLLMs). With ICM, we modify the few-shot examples within ICL prompts to be intra-sententially code-mixed by randomly swapping content words in the target languages with their English translations. We observe that ICM prompts yield superior performance in NLP tasks such as disfluency correction, grammar error correction and text simplification that demand a close correspondence between the input and output sequences. Significant improvements are observed mainly for low-resource languages that are under-represented during the pretraining and finetuning of MLLMs. We present an extensive set of experiments to analyze when ICM is effective and what design choices contribute towards its effectiveness. ICM works consistently and significantly better than other prompting techniques across models of varying capacity such as mT0-XXL, BloomZ and GPT-4.

In­-context Mixing (ICM): Code­-mixed Prompts for Multilingual LLMs

Large Language Models (LLMs) show strong instruction understanding ability across multiple languages. However, they 
are easily biased towards English in instruction tuning, and generate English responses even given non-English instructions. In this paper, we investigate the language inconsistent generation problem in monolingual instruction tuning. We find that instruction tuning in English increases the models' preference for English responses. It attaches higher probabilities to English responses than to responses in the same language as the instruction. Based on the findings, we alleviate the language inconsistent generation problem by counteracting the model preference for English responses in both the training and inference stages. Specifically, we propose Pseudo-Inconsistent Penalization (PIP) which prevents the model from generating English responses when given non-English language prompts during training, and Prior Enhanced Decoding (PED) which improves the language-consistent prior by leveraging the untuned base language model. Experimental results show that our two methods significantly improve the language consistency of the model without requiring any multilingual data. Our code, data, and models will be released.

Respond in my Language: Mitigating Language Inconsistency in Response Generation based on Large Language Models

Automated disinformation generation is often listed as one of the risks of large language models (LLMs). The theoretical ability to flood the information space with disinformation content might have dramatic consequences for democratic societies around the world. This paper presents a comprehensive study of the disinformation capabilities of the current generation of LLMs to generate false news articles in English language. In our study, we evaluated the capabilities of 10 LLMs using 20 disinformation narratives. We evaluated several aspects of the LLMs: how well they are at generating news articles, how strongly they tend to agree or disagree with the disinformation narratives, how often they generate safety warnings, etc. We also evaluated the abilities of detection models to detect these articles as LLM-generated. We conclude that LLMs are able to generate convincing news articles that agree with dangerous disinformation narratives.

Disinformation Capabilities of Large Language Models

Large language models (LLMs) show remarkable human-like capability in various domains and languages. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures across a range of model sizes. We highlight Cendol's effectiveness across a diverse array of tasks, attaining ~20% improvement, and demonstrate its capability to generalize to unseen tasks and indigenous languages of Indonesia. Furthermore, Cendol models showcase improved human favorability despite their limitations in capturing indigenous knowledge and cultural values in Indonesia. In addition, we discuss the shortcomings of parameter-efficient tunings, such as LoRA, for language adaptation. Alternatively, we propose the usage of vocabulary adaptation to enhance efficiency. Lastly, we evaluate the safety of Cendol and showcase that safety in pre-training in one language such as English is transferable to low-resource languages, such as Indonesian, even without RLHF and safety fine-tuning.

Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

We introduce Latxa, a family of large language models for Basque ranging from 7 to 70 billion parameters. Latxa is based on Llama 2, which we continue pretraining on a new Basque corpus comprising 4.3M documents and 4.2B tokens. Addressing the scarcity of high-quality benchmarks for Basque, we further introduce 4 multiple choice evaluation datasets: EusProficiency, comprising 5,169 questions from official language proficiency exams; EusReading, comprising 352 reading comprehension questions; EusTrivia, comprising 1,715 trivia questions from 5 knowledge areas; and EusExams, comprising 16,046 questions from public examinations. In our extensive evaluation, Latxa outperforms all previous open models we compare to by a large margin. In addition, it is competitive with GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. Both the Latxa family of models, as well as our new pretraining corpora and evaluation datasets, are publicly available under open licenses. Our suite enables reproducible research on methods to build LLMs for low-resource languages.

Latxa: An Open Language Model and Evaluation Suite for Basque

Texts written in different languages reflect different culturally-dependent beliefs of their writers. Thus, we expect multilingual LMs (MLMs), that are jointly trained on a concatenation of text in multiple languages, to encode different cultural values for each language. Yet, as the `multilinguality' of these LMs is driven by cross-lingual sharing, we also have reason to belief that cultural values bleed over from one language into another. This limits the use of MLMs in practice, as apart from being proficient in generating text in multiple languages, creating language technology that can serve a community also requires the output of LMs to be sensitive to their biases (Naous et al. 2023). Yet, little is known about how cultural values emerge and evolve in MLMs (Hershcovich et al. 2022). We are the first to study how languages can exert influence on the cultural values encoded for different test languages, by studying how such values are revised during fine-tuning. Focusing on the fine-tuning stage allows us to study the interplay between value shifts when exposed to new linguistic experience from different data sources and languages. Lastly, we use a training data attribution method to find patterns in the fine-tuning examples, and the languages that they come from, that tend to instigate value shifts.

The Echoes of Multilinguality: Tracing Cultural Value Shifts during Language Model Fine-tuning

A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts.
Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically meaningless units. To address the disparities, we introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. Our encoding convention (MYTE) is based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods. We show that MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

Argumentation mining (AM) aims to detect the arguments and their inherent relations from argumentative textual compositions. Generally, AM comprises three key challenging subtasks, including argument component type classification (ACTC), argumentative relation identification (ARI), and argumentative relation type classification (ARTC). Prior methods are afflicted by a sequential feature decoding paradigm, wherein they initially address the features of argumentation components (ACs) for the task of ACTC. Then, these features are amalgamated in pairs to tackle the task of ARI. Finally, the AC pairs and ascertained pertinent relations are employed for ARTC. However, the explicit and comprehensive inter-relationship among the three subtasks is neglected. In this paper, we propose a novel method PITA for \underline{P}rompt\underline{I}ng \underline{T}ask inter\underline{A}ction to model the inter-relationships among the three subtasks within a generative framework. Specifically, we employ a dynamic prompt template to indicate all ACs and AC pairs in the three subtasks. Then, from a multi-relational perspective, we construct an undirected heterogeneous graph to capture the various relationships within and between ACs and AC pairs. We apply the Relational Graph Convolutional Network (RGCN) on the graph and inject the task interaction information into the soft prompts with continuous representations. PITA jointly decodes all ACs and AC pairs using the prompt template with task interaction information, which thus explicitly and comprehensively harmonizes the information propagation across the three subtasks. Extensive experiments show PITA achieves state-of-the-art performances on two AM benchmarks.

PITA: Prompting Task Interaction for Argumentation Mining

Concerns regarding Large Language Models (LLMs) to memorize and disclose private information, particularly Personally Identifiable Information (PII), become prominent within the community. Many efforts have been made to mitigate the privacy risks.
However, the mechanism through which LLMs memorize PII remains poorly understood. To bridge this gap, we introduce a pioneering method for pinpointing PII-sensitive neurons (privacy neurons) within LLMs. Our method employs learnable binary weight masks to localize specific neurons that account for the memorization of PII in LLMs through adversarial training. Our investigations discover that PII is memorized by a small subset of neurons across all layers, which shows the property of PII specificity. Furthermore, we propose to validate the potential in PII risk mitigation by deactivating the localized privacy neurons. Both quantitative and qualitative experiments demonstrate the effectiveness of our neuron localization algorithm.

Downloads

Next from ACL 2024

Exploring Memorization in Fine-tuned Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES