Multilingual LLMs often have knowledge disparities across languages, with larger gaps in under-resourced languages. Teaching LLMs to abstain in the face of knowledge gaps is thus a promising strategy to mitigate hallucinations in multilingual settings. However, previous studies on LLM abstention primarily focus on English; we find that directly applying existing solutions beyond English results in up to 20.5% performance gaps between high and low-resource languages, potentially due to LLMs' drop in calibration and reasoning beyond a few resource-rich languages. To this end, we propose strategies to enhance LLM abstention by learning from multilingual feedback, where LLMs self-reflect on proposed answers in one language by generating multiple feedback items in related languages: we show that this helps identifying the knowledge gaps across diverse languages, cultures, and communities. Extensive experiments demonstrate that our multilingual feedback approach outperforms various strong baselines, achieving up to 9.2% improvement for low-resource languages across three black-box and open models on three datasets, featuring open-book, closed-book, and commonsense QA. Further analysis reveals that multilingual feedback is both an effective and a more equitable abstain strategy to serve diverse language speakers, and cultural factors have great impact on language selection and LLM abstention behavior, highlighting future directions for multilingual and multi-cultural reliable language modeling.

Teaching LLMs to Abstain across Languages via Multilingual Feedback

Yoruba---an African language with roughly 47 million speakers---encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus; YORULECT across three domains and four regional yoruba dialects. To develop this corpus, we engaged native speakers, traveling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard yoruba and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yoruba and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We will release YORULECT dataset and models publicly under an open license.

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

Neural Machine Translation models are extremely data and compute-hungry. However, not all data
points contribute equally to model training and generalization. Data pruning to remove the low-
value data points has the benefit of drastically reducing the compute budget without significant
drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints
Across Time (CAT ), that leverages early model training dynamics to identify the most relevant
data points for model performance. We benchmark CAT against several data pruning techniques
including COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks on
Indo-European languages on multiple test sets. When applied to English-German, English-French
and English-Swahili translation tasks, CAT achieves comparable performance to using the full
dataset, while pruning up to 50% of training data. We inspect the data points that CAT selects
and find that it tends to favour longer sentences and sentences with unique or rare words.

Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning

A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts.
Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically meaningless units. To address the disparities, we introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. Our encoding convention (MYTE) is based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods. We show that MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

Identifying linguistic differences between dialects of a language often requires expert knowledge and meticulous human analysis. This is largely due to the complexity and nuance involved in studying various dialects. We present a novel approach to extract distinguishing lexical features of dialects by utilizing interpretable dialect classifiers, even in the absence of human experts. We explore both post-hoc and intrinsic approaches to interpretability, conduct experiments on Mandarin, Italian, and Low Saxon, and experimentally demonstrate that our method successfully identifies key language-specific lexical features that contribute to dialectal variations.

Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers

Language models have graduated from being research prototypes to commercialized products offered as web APIs, and recent works have highlighted the multilingual capabilities of these products. The API vendors charge their users based on usage, more specifically on the number of ``tokens'' processed or generated by the underlying language models. What constitutes a token, however, is training data and model dependent with a large variance in the number of tokens required to convey the same information in different languages. In this work, we analyze the effect of this non-uniformity on the fairness of an API's pricing policy across languages. We conduct a systematic analysis of the cost and utility of OpenAI's language model API on multilingual benchmarks in 22 typologically diverse languages. We show evidence that speakers of a large number of the supported languages are overcharged while obtaining poorer results. These speakers tend to also come from regions where the APIs are less affordable, to begin with. Through these analyses, we aim to increase transparency around language model APIs' pricing policies and encourage the vendors to make them more equitable.

Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

A ���bigger is better�۝ explosion in the number of parameters in deep neural networks has made it increasingly challenging to make state-of-the-art networks accessible in compute-restricted environments. Compression techniques have taken on renewed importance as a way to bridge the gap. However, evaluation of the trade-offs incurred by popular compression techniques has been centered on high-resource datasets. In this work, we instead consider the impact of compression in a data-limited regime. We introduce the term low-resource double bind to refer to the co-occurrence of data limitations and compute resource constraints. This is a common setting for NLP for low-resource languages, yet the trade-offs in performance are poorly studied. Our work offers surprising insights into the relationship between capacity and generalization in data-limited regimes for the task of machine translation. Our experiments on magnitude pruning for translations from English into Yoruba, Hausa, Igbo and German show that in low-resource regimes, sparsity preserves performance on frequent sentences but has a disparate impact on infrequent ones. However, it improves robustness to out-of-distribution shifts, especially for datasets that are very distinct from the training distribution. Our findings suggest that sparsity can play a beneficial role at curbing memorization of low frequency attributes, and therefore offers a promising solution to the low-resource double bind.

The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation

A "bigger is better" explosion in the number of parameters in deep neural networks has made it increasingly challenging to make state-of-the-art networks accessible in compute-restricted environments. Compression techniques have taken on renewed importance as a way to bridge the gap. However, evaluation of the trade-offs incurred by popular compression techniques has been centered on high-resource datasets. In this work, we instead consider the impact of compression in a data-limited regime. We introduce the term low-resource double bind to refer to the co-occurrence of data limitations and compute resource constraints. This is a common setting for NLP for low-resource languages, yet the trade-offs in performance are poorly studied.
Our work offers surprising insights into the relationship between capacity and generalization in data-limited regimes for the task of machine translation. Our experiments on magnitude pruning for translations from English into Yoruba, Hausa, Igbo and German show that in low-resource regimes, sparsity preserves performance on frequent sentences but has a disparate impact on infrequent ones. However, it improves robustness to out-of-distribution shifts, especially for datasets that are very distinct from the training distribution. Our findings suggest that sparsity can play a beneficial role at curbing memorization of low frequency attributes, and therefore offers a promising solution to the low-resource double bind.

This poster session includes Main Conference posters and Findings from the following areas:

Question Answering  • Computational Social Science and Cultural Analytics • Machine Translation • Resources and Evaluation

In-Person Poster Session C (Riverfront Hall)

poster

## Welcome to EMNLP 2024! 
We are excited to welcome you to one of the most prominent conferences in the field of Natural Language Processing. This year, EMNLP 2024 is being held in a hybrid format,
offering both virtual and in-person participation in beautiful Miami. Due to a record-breaking number of submissions, we've expanded the total number of accepted papers to accommodate more cutting-edge research from around the globe.
### [Conference Handbook](https://drive.google.com/file/d/1WPROgxjLAC96AJL7Ugy0tEnYm7dkrbHt/view?usp=sharing)

You are required to register for this event. **Please register [here](https://2024.emnlp.org/registration/).** The EMNLP 2024 event page on Underline will be open to public one week prior to the event.

Please register!

EMNLP 2024

EMNLP 2024 will take place in Miami, Florida from Nov 12th to Nov 16th, 2024, at the Hyatt Regency Miami Hote and on Underline for remote participants.

This poster session includes Main Conference posters and Findings from the following areas:

Language Modeling • Ethics, Bias, and Fairness • Discourse and Pragmatics • Multilinguality and Language Diversity • Phonology, Morphology, and Word Segmentation • Syntax: Tagging, Chunking and Parsing

In-Person Poster Session B (Riverfront Hall)

Findings In-Person Poster Session 3

### Welcome!
The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. Our Virtual Poster Sessions will take place online Thursday, August 22, 2024.

You are required to register for this event. **Please register [here](https://2024.aclweb.org/registration). **

If you have already registered, please check your inbox for an email from Underline granting you access to ACL 2024 content.

ACL 2024

The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. More information will be announced soon.

This poster session includes Main Conference posters, TACL and CL , Demos and SRW papers.

In-Person Poster Session 3

This poster session includes Main Conference posters and Demos from the following areas: 
Interpretability and Analysis of Models for NLP • Multilinguality and Language Diversity • Multimodality and Language Grounding to Vision, Robotics and Beyond • Semantics: Sentence-level Semantics, Textual Inference and Other areas • Summarization

## Welcome to 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics!
This year, the conference is in Mexico City. NAACL was actually already planned for Mexico City in 2021, but due to the pandemic the entire conference was moved online. This year, finally, we get to go! So it is my sincere pleasure to welcome you to Mexico City, whether in person or virtually. Having the conference in Mexico City is a good opportunity to emphasize that NAACL is our flagship conference for ACL members not only in North America but also in Central and South America, even though NAACL has been bearing “North” in its name. At this year’s conference, we have a theme to match, with a theme track on the Languages of Latin America to showcase the linguistic diversity of the region.
 
The opportunity to present at NAACL should not depend on a researcher’s travel budget, or their family status. This is why it is so important to make virtual participation at NAACL as good an experience as possible – but we want to also provide a good experience for in-person participants. As a community, we are still working out the best way to do that. This year at NAACL, we are trying out a big virtual poster session ahead of the conference, with the hope that this will make make for a lively and interactive experience. At the same time, we are reducing virtual oral presentations, which seem to be particularly tricky to make to work well. A big thanks to the NAACL program chairs and to Luciana Benotti for all their ideas and work to improve the virtual experience. And participants, virtual as well as in-person: Please let us know what worked for you and what didn’t, so we can continue to improve hybrid conferences. 
 
I have been lucky to work with many amazing people. Without their insight, dedication and patience, and without the many hours of work they put in, NAACL would not have been possible. A huge thank you to the program chairs Helena Gomez, Kevin Duh, and Steve Bethard – you are the best! 
 
Finally, I would like to thank all authors, invited speakers and panelists, area chairs and reviewers, the volunteers organizing and chairing sessions, and all attendees, in-person and virtual. Thank you for helping us make NAACL 2024 come to life. 
 
Welcome and hope you all enjoy the conference! 
Katrin Erk 
The University of Texas at Austin 
NAACL-2024 General Chair 
*You can read the full Welcome message in the [Conference Handbook (downloadable)](https://drive.google.com/file/d/1H1NvW0VASQjkSCYgw3mr-3l4yYDSxz6B/view?usp=sharing)*

To access the event page you need to register [**here.**](http://acl.swoogo.com/naacl2024) 
Your access to the event page is limited based on your registration type. If you registered for workshops only, you will gain access to full workshops content on the day of the workshops program.

NAACL 2024

2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics

**Are you attending this poster session virtually?** 
In-person printed posters are available for in-person attendees only.
 
**Are you attending this poster session in person?** 
Hybrid posters are displayed in the East Foyer.

PS4 (In-person) Posters, Demo, Industry, Findings

**Welcome to EMNLP 2023!**

On behalf of the EMNLP 2023 Organizing Committee, I extend a warm and heartfelt welcome to all of you to the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). It is with immense pleasure and excitement that we gather here in Singapore, a vibrant hub of innovation and technological advancement.

The conference program is packed with insightful presentations, thought-provoking workshops, and engaging networking opportunities. In addition to the technical sessions, we have also planned several social events that will provide you with the opportunity to connect with your colleagues.

The conference is held in person at the Resorts World Convention Centre in Singapore, and available on-line with the help of Underline.

For those in Singapore, I hope that you will find time to explore the exciting Sentosa Island, Gardens by the Bay, Singapore Botanic Gardens and the many other attractions unique to Singapore.

Of course, we are grateful to our sponsors and partners for their generous support of EMNLP 2023, whose contributions make it possible for us to host this world-class event.

Yuji Matsumoto (RIKEN AIP) 
EMNLP 2023 General Chair

To access the **EMNLP 2023** event page on Underline, you need to register for the Conference. 
Please follow **[this link](https://2023.emnlp.org/registration/)** for more details.

EMNLP 2023

EMNLP 2023 took place in Singapore from Dec 6th to Dec 10th, 2023.

Posters: Machine Translation and Multilinguality

findings / work in progress

# Welcome everyone to ACL 2022!

The 60th Annual Meeting of the Association for Computational Linguistics is taking place May 22-27, 2022 as a hybrid event, in Dublin and online. We are happy to welcome all of you to this anniversary edition with an almost 50-50 in-person and virtual participation. 
The main conference program features oral presentations, in-person and virtual posters and demo sessions, a plenary session for our best paper presentations and awards, three amazing keynote events and two new initiatives of invited talks: Spotlight Talks for Young Rising Stars (STIRS) and The Next Big Idea Talks. Posters (including Findings of ACL 2022) and demos are grouped by areas for both the in-person and the virtual sessions. For the virtual component, the talks will be on Zoom and the posters and the demos will be in GatherTown. The Student Research Workshop will have an oral session and a poster session as part of Poster Session 1. The program also features eight Tutorials and 28 Workshops. 

 
We wish you a wonderful conference! 
[**The ACL 2022 Organizing Committee**](https://www.2022.aclweb.org/organisers)
 
[**Conference Handbook**](https://drive.google.com/file/d/1_BUCMfhMVrjG9E2e71aHdHeE28KSje0l/view?usp=sharing) 
[**Mini Handbook**](https://drive.google.com/file/d/1qlBKl0wzmlVF1oCeMQl3BahLd9nLP5Ce/view?usp=sharing) 
[**Posters and Demo guides**](https://drive.google.com/file/d/1UucMAoCNncIOaH1rMMDa0owuG9qgvJTG/view?usp=sharing)

ACL 2022

The Association for Computational Linguistics (ACL) is the premier international scientific and professional society for people working on computational problems involving human language, a field often referred to as either computational linguistics or natural language processing (NLP). 



[![](https://s3.amazonaws.com/pf-upload-01/u-59356/0/2021-09-18/me23j72/button_workshop-website.png)](http://www.statmt.org/wmt21/) 

W11: Sixth Conference on Machine Translation Day 1

workshop paper

EMNLP 2021 is planned to be a hybrid event in Punta Cana, Dominican Republic, with both on-site and fully virtual participation possible. The experience for on-site participants would closely approximate a normal pre-COVID *ACL conference, with 5-6 thematically organized parallel sessions and live Q/A and interactive discussion immediately after the talks. Presentations by virtual participants will be equitably interleaved with those of on-site participants, projected on the auditorium screens as if on-site, and also followed immediately by live Q/A and interactive discussion at a time during reasonable waking hours for the virtual presenter. For all participants, on-site and virtual, who are unable to attend a session due to either time-zone issues or because they are participating in another session live, talk recordings and slides will be available online at a minimum after the live presentation (and in many cases before as well), and questions may be submitted in advance on session-specific discussion boards and answered live in session with the usual visual aids if desired.

<iframe style="width:700px;height:400px" src="https://online.fliphtml5.com/ebtyf/ceby/" seamless="seamless" scrolling="no" frameborder="0" allowtransparency="true" allowfullscreen="true" ></iframe>

Please Note: The EMNLP registration system is not currently connected to the underline site as we are still in the process of building out EMNLP 2021. You will receive access instructions from underline the week of November 1st. 

Access is given only to EMNLP upon registration, if you have not registered please do so [here](https://2021.emnlp.org/registration).

Registered attendees will receive access the week of November 1st.

EMNLP 2021

EMNLP 2021 is planned to be a hybrid event in Punta Cana, Dominican Republic, with both on-site and fully virtual participation possible.

Orevaoghene Ahia

9

7

SHORT BIO

Presentations

Teaching LLMs to Abstain across Languages via Multilingual Feedback

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers

Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation

The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES