Wikipedia articles (content pages) are commonly used corpora in Natural Language Processing (NLP) research, especially in low-resource languages other than English. Yet, a few research studies have studied the three Arabic Wikipedia editions, Arabic Wikipedia (AR), Egyptian Arabic Wikipedia (ARZ), and Moroccan Arabic Wikipedia (ARY), and documented issues in the Egyptian Arabic Wikipedia edition regarding the massive automatic creation of its articles using template-based translation from English to Arabic without human involvement, overwhelming the Egyptian Arabic Wikipedia with articles that do not only have low-quality content but also with articles that do not represent the Egyptian people, their culture, and their dialect. In this paper, we aim to mitigate the problem of template translation that occurred in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics through exploratory analysis and building automatic detection systems. We first explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and utilize the resulting insights to build multivariate machine learning classifiers leveraging articles’ metadata to detect the template-translated articles automatically. We then publicly deploy and host the best-performing classifier as an online application called ‘Egyptian Wikipedia Scanner’ and release the extracted, filtered, labeled, and preprocessed datasets to the research community to benefit from our datasets and the online, web-based detection system.

[PUBLISHED] Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition

Instruction tuning has emerged as a prominent methodology for teaching Large Language Models (LLMs) to follow instructions. However, current instruction datasets predominantly cater to English or are derived from English-dominated LLMs, leading to inherent biases toward Western culture. This bias negatively impacts non-English languages such as Arabic and the unique culture of the Arab region. This paper addresses this limitation by introducing CIDAR, the first open Arabic instruction-tuning dataset culturally aligned by native Arabic speakers. CIDAR contains 10,000 instruction and output pairs that represent the Arab region. We discuss the cultural relevance of CIDAR via the analysis and comparison to a few models fine-tuned on other datasets. Our experiments indicate that models fine-tuned on CIDAR achieve better cultural alignment compared to those fine-tuned on 30x more data.

CIDAR: Culturally Relevant Instruction Dataset For Arabic

Text classification systems have been proven vulnerable to adversarial text examples, modified versions of the original text examples that are often unnoticed by human eyes, yet can force text classification models to alter their classification. Often, research works quantifying the impact of adversarial text attacks have been applied only to models trained in English. In this paper, we introduce the first word-level study of adversarial attacks in Arabic. Specifically, we use a synonym (word-level) attack using a Masked Language Modeling (MLM) task with a BERT model in a black-box setting to assess the robustness of the state-of-the-art text classification models to adversarial attacks in Arabic. To evaluate the grammatical and semantic similarities of the newly produced adversarial examples using our synonym BERT-based attack, we invite four human evaluators to assess and compare the produced adversarial examples with their original examples. We also study the transferability of these newly produced Arabic adversarial examples to various models and investigate the effectiveness of defense mechanisms against these adversarial examples on the BERT models. We find that fine-tuned BERT models were more susceptible to our synonym attacks than the other Deep Neural Networks (DNN) models like WordCNN and WordLSTM we trained. We also find that fine-tuned BERT models were more susceptible to transferred attacks. We, lastly, find that fine-tuned BERT models successfully regain at least 2% in accuracy after applying adversarial training as an initial defense mechanism.

Arabic Synonym BERT-based Adversarial Examples for Text Classification

Wikipedia articles are a widely used source of training data for Natural Language Processing (NLP) research, particularly as corpora for low-resource languages like Arabic. However, it is essential to understand the extent to which these corpora reflect the representative contributions of native speakers, especially when many entries in a given language are directly translated from other languages or automatically generated through automated mechanisms. In this paper, we study the performance implications of using inorganic corpora that are not representative of native speakers and are generated through automated techniques such as bot generation or automated template-based translation. The case of the Arabic Wikipedia editions gives a unique case study of this since the Moroccan Arabic Wikipedia edition (ARY) is small but representative, the Egyptian Arabic Wikipedia edition (ARZ) is large but unrepresentative, and the Modern Standard Arabic Wikipedia edition (AR) is both large and more representative. We intrinsically evaluate the performance of two main NLP upstream tasks, namely word representation and language modeling, using word analogy evaluations and fill-mask evaluations using our two newly created datasets: Arab States Analogy Dataset (ASAD) and Masked Arab States Dataset (MASD). We demonstrate that for good NLP performance, we need both large and organic corpora; neither alone is sufficient. We show that producing large corpora through automated means can be a counter-productive, producing models that both perform worse and lack cultural richness and meaningful representation of the Arabic language and its native speakers.

Performance Implications of Using Unrepresentative Corpora in Arabic Natural Language Processing

Wikipedia articles are a common source of training data for Natural Language Processing (NLP) research, especially as a source for corpora in languages other than English. However, research has shown that not all Wikipedia editions are produced organically by native speakers, and there are substantial levels of automation and translation activities in the Wikipedia project that could negatively impact the degree to which they truly represent the language and the culture of native speakers. To encourage transparency in the Wikipedia project, Wikimedia Foundation introduced the depth metric as an indication of the degree of collaboration or how frequently users edit a Wikipedia edition’s articles. While a promising start, this depth metric suffers from a few serious problems, like a lack of adequate handling of inflation of edits metric and a lack of full utilization of users-related metrics. In this paper, we propose the DEPTH+ metric, provide its mathematical definitions, and describe how it reflects a better representation of the depth of human collaborativeness. We also quantify the bot activities in Wikipedia and offer a bot-free depth metric after the removal of the bot-created articles and the bot-made edits on the Wikipedia articles.

DEPTH+: An Enhanced Depth Metric for Wikipedia Corpora Quality

Wikipedia is a common source of training data for Natural Language Processing (NLP) research, especially as a source for corpora in languages other than English. However, for many downstream NLP tasks, it is important to understand the degree to which these corpora reflect representative contributions of native speakers. In particular, many entries in a given language may be translated from other languages or produced through other automated mechanisms. Language models built using corpora like Wikipedia can embed history, culture, bias, stereotypes, politics, and more, but it is important to understand whose views are actually being represented. In this paper, we present a case study focusing specifically on differences among the Arabic Wikipedia editions (Modern Standard Arabic, Egyptian, and Moroccan). In particular, we document issues in the Egyptian Arabic Wikipedia with automatic creation/generation and translation of content pages from English without human supervision. These issues could substantially affect the performance and accuracy of Large Language Models (LLMs) trained from these corpora, producing models that lack the cultural richness and meaningful representation of native speakers. Fortunately, the metadata maintained by Wikipedia provides visibility into these issues, but unfortunately, this is not the case for all corpora used to train LLMs.

Learning From Arabic Corpora But Not Always From Arabic Speakers: A Case Study of the Arabic Wikipedia Editions

The use of word embeddings is an important NLP technique for extracting meaningful conclusions from corpora of human text. One important question that has been raised about word embeddings is the degree of gender bias learned from corpora. Bolukbasi et al. (2016) proposed an important technique for quantifying gender bias in word embeddings that, at its heart, is lexically based and relies on sets of highly gendered word pairs (e.g., mother/father and madam/sir) and a list of professions words (e.g., doctor and nurse). In this paper, we document problems that arise with this method to quantify gender bias in diachronic corpora. Focusing on Arabic and Chinese corpora, in particular, we document clear changes in profession words used over time and, somewhat surprisingly, even changes in the simpler gendered defining set word pairs. We further document complications in languages such as Arabic, where many words are highly polysemous/homonymous, especially female professions words.

Roadblocks in Gender Bias Measurement for Diachronic Corpora

Please note that this Workshop was not recorded. This workshop focuses on applying NLP techniques to improve Wikipedia content, including automating updates, enhancing accessibility, and improving the quality of articles. Topics include information extraction, fact verification, and bias detection. [Read more](https://meta.wikimedia.org/wiki/NLP_for_Wikipedia_(EMNLP_2024))

W19: The First Workshop on Advancing Natural Language Processing for Wikipedia (WikiNLP)

## Welcome to EMNLP 2024! 
We are excited to welcome you to one of the most prominent conferences in the field of Natural Language Processing. This year, EMNLP 2024 is being held in a hybrid format,
offering both virtual and in-person participation in beautiful Miami. Due to a record-breaking number of submissions, we've expanded the total number of accepted papers to accommodate more cutting-edge research from around the globe.
### [Conference Handbook](https://drive.google.com/file/d/1WPROgxjLAC96AJL7Ugy0tEnYm7dkrbHt/view?usp=sharing)

You are required to register for this event. **Please register [here](https://2024.emnlp.org/registration/).** The EMNLP 2024 event page on Underline will be open to public one week prior to the event.

Please register!

EMNLP 2024

EMNLP 2024 will take place in Miami, Florida from Nov 12th to Nov 16th, 2024, at the Hyatt Regency Miami Hote and on Underline for remote participants.

Papers in this section have not been assigned a presentation time slot.

Virtual Poster Session (asynchronous viewing)

poster

### Welcome!
The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. Our Virtual Poster Sessions will take place online Thursday, August 22, 2024.

You are required to register for this event. **Please register [here](https://2024.aclweb.org/registration). **

If you have already registered, please check your inbox for an email from Underline granting you access to ACL 2024 content.

ACL 2024

The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. More information will be announced soon.

Welcome to the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Continuing its mission of expanding and involving the science community of all European countries, EACL has selected the Malta community for the 18th EACL. Considering the importance of physical interaction among researchers, the conference will be held at the Hotel Radisson Blu, St. Julians, in Malta, from 17 to 22 of March, 2024. The conference will also feature the possibility to participate virtually. As the flagship European conference in the field of computational linguistics, EACL welcomes European and international researchers covering a broad spectrum of research areas that are concerned with computational approaches to natural language. 
#### [Conference Handbook](https://drive.google.com/file/d/13Wn38q0yev6U3RTgnf4Tn5T9GuAuZ-xW/view?usp=sharing) *downloadable*

To access the EACL 2024 event page you need to register [**here**](https://acl.swoogo.com/EACL2024). 

EACL 2024

As the flagship European conference in the field of computational linguistics, EACL welcomes European and international researchers covering a broad spectrum of research areas that are concerned with computational approaches to natural language.

**[Workshop Organizers:](https://arabicnlp2023.sigarab.org/organizers)**
 *Coffee Break 10:30 am* 
*Lunch Break 12:30 pm* 
*Coffee Break 3:30 pm* [**Workshop Website**](https://arabicnlp2023.sigarab.org/) The First Arabic Natural Language Processing Conference (ArabicNLP 2023) is organized by the Special Interest Group on Arabic NLP (SIGARAB). ArabicNLP 2023 builds on the seven previous workshop editions, which have been extremely successful drawing in a large active participation in various capacities. This conference is timely given the continued rise in research projects focusing on Arabic NLP. [**Schedule**](https://arabicnlp2023.sigarab.org/program)
 [**Go to Gather**](https://app.gather.town/app/ROH1myZvp9ba7a3C/EMNLP2023?spawnToken=Bd1cMzLKR-W1LhrjFP55)

W17: The First Arabic Natural Language Processing Conference (ArabicNLP 2023)

workshop paper

**Welcome to EMNLP 2023!**

On behalf of the EMNLP 2023 Organizing Committee, I extend a warm and heartfelt welcome to all of you to the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). It is with immense pleasure and excitement that we gather here in Singapore, a vibrant hub of innovation and technological advancement.

The conference program is packed with insightful presentations, thought-provoking workshops, and engaging networking opportunities. In addition to the technical sessions, we have also planned several social events that will provide you with the opportunity to connect with your colleagues.

The conference is held in person at the Resorts World Convention Centre in Singapore, and available on-line with the help of Underline.

For those in Singapore, I hope that you will find time to explore the exciting Sentosa Island, Gardens by the Bay, Singapore Botanic Gardens and the many other attractions unique to Singapore.

Of course, we are grateful to our sponsors and partners for their generous support of EMNLP 2023, whose contributions make it possible for us to host this world-class event.

Yuji Matsumoto (RIKEN AIP) 
EMNLP 2023 General Chair

To access the **EMNLP 2023** event page on Underline, you need to register for the Conference. 
Please follow **[this link](https://2023.emnlp.org/registration/)** for more details.

EMNLP 2023

EMNLP 2023 took place in Singapore from Dec 6th to Dec 10th, 2023.

Recent advances in Natural Language Processing, and the emergence of pretrained Large Language Models (LLM) specifically, have made NLP systems omnipresent in various aspects of our everyday life. In addition to traditional examples such as personal voice assistants, recommender systems, etc, more recent developments include content-generation models such as ChatGPT, text-to-image models (Dall-E), and so on. While these emergent technologies have an unquestionable potential to power various innovative NLP and AI applications, they also pose a number of challenges in terms of their safe and ethical use. To address such challenges, NLP researchers have formulated various objectives, e.g., intended to make models more fair, safe, and privacy-preserving. However, these objectives are often considered separately, which is a major limitation since it is often important to understand the interplay and/or tension between them. For instance, meeting a fairness objective might require access to users’ demographic information, which creates tension with privacy objectives. The goal of this workshop is to move toward a more comprehensive notion of Trustworthy NLP, by bringing together researchers working on those distinct yet related topics, as well as their intersection. 
For more details visit our workshop website: https://trustnlpworkshop.github.io/
#### [Visit GatherTown](https://app.gather.town/app/5OscLswEEuYf2Psb/ACL2023)

W15: The 3rd Workshop on Trustworthy NLP (TrustNLP)

### Welcome to ACL 2023, the 61st Annual Meeting of the Association for Computational Linguistics! 
 The conference will be held in Toronto, Canada, July 9-14, 2023. 
Following the succession of the recent conferences in our field, ACL 2023 will adopt a hybrid format.
While the impact of Covid has considerably diminished in terms of traveling, obtaining visas to Canada
entails a very long process. Moreover, the global economic conditions pose challenges for many individuals to travel to conferences. Recognizing these circumstances, we know many participants may not be
able to attend the conference in person. Therefore, we are committed to providing a great virtual platform
so everyone has the opportunity to interact with other participants and enjoy the conference. Based on the
current registered participants, approxiately 30% have chosen to attend the conference virtually. Whether
you join us in person or virtually, we sincerely hope everyone has a remarkable conference experience. 
This General Chair’s message is where I express my gratitude to the many individuals who have made
enormous contributions to the conference over the past year.

Read [**ACL 2023 General Chair's message**](https://docs.google.com/document/d/1WobYM7norbG4dI48s75HfJoD89qgX5a_F-6U8AteLSA/edit?usp=sharing/) in full.

##### **[Conference Handbook](https://2023.aclweb.org/downloads/acl2023-handbook.pdf)**

ACL 2023

The Association for Computational Linguistics (ACL) is the premier international scientific and professional society for people working on computational problems involving human language, a field often referred to as either computational linguistics or natural language processing.

**Chairs:** Hend Al-Khalifa, Nadi Tomeh, Houda Bouamor, Kareem Darwish, and Ahmed Abdelali **Organizers:** Houda Bouamor, Hend Al-Khalifa, Kareem Darwish, Owen Rambow, Fethi Bougares, Ahmed Abdelali, Wajdi Zaghouani, Nadi Tomeh, and Salam Khalifa 
**Invited speaker**: [Prof. Karim Bouzoubaa](https://underline.io/speakers/212906-karim-bouzoubaa) 
Please visit our [**website**](https://sites.google.com/view/wanlp2022/) for more information and detailed schedule.

**RocketChat channel for this workshop:** https://emnlp2022.rocket.chat/channel/workshop-24-wanlp

W24 (WANLP) The Seventh Arabic Natural Language Processing Workshop

## Welcome to EMNLP 2022!
I am delighted to welcome you to EMNLP 2022! I believe this conference has been complicated beyond any precedent. Over the past year, it’s been thrilling to see the organization team approach each new puzzle with creativity and enthusiasm. We hope that those participating in Abu Dhabi as well as those joining remotely will leave the conference feeling newly inspired by the program and newly connected to our ever-growing community. Following EMNLP 2021 and major NLP conferences since, EMNLP 2022 is “hybrid,” serving both virtual and in-person participants.

Our key innovations for EMNLP 2022 include:

* EMNLP 2022 is “hybrid” in a second sense, as well: we allowed both direct and rolling review paper submissions, building on the pilot experiment of EMNLP 2021, which considered a small number of ARR submissions. 
* Familiar from NAACL but new to EMNLP, we’ve added an industry track.
* During the conference, “portals” will link virtual poster sessions to in-person conference participants during poster sessions each day.
* The first ACL-family conference in the United Arab Emirates.

 *Message from Noah A. Smith, University of Washington and Allen Institute for AI, Seattle, Washington, USA* 
***EMNLP 2022 General Chair***
 
[![](https://assets.underline.io/uploads/markdown_image/1/image/9eec7d4a287ee18c278b08229290aa83.png)](https://drive.google.com/file/d/1OlPv6QBeo62VVTughj2jkiLeyHd1WnUt/view)
 
[![](https://assets.underline.io/uploads/markdown_image/1/image/a3db7a768409f05192210d98601edb25.png)](https://emnlp2022.rocket.chat/)

To access this site you need to register. Please register [here](https://2022.emnlp.org/registration/).

Register here

EMNLP 2022

Welcome!
EMNLP 2022 will take place in Abu Dhabi from December 7th to December 11th, 2022. And it will be held in hybrid mode, both online and offline.

**Organizers:** Nina Tahmasebi, Lars Borin, Simon Hengchen, Syrielle Montariol, Haim Dubossarsky,
Andrey Kutuzov **Description:** This workshop explores state-of-the-art computational methodologies, theories and digital text resources on exploring the
time-varying nature of human language. The aim of this workshop is three-fold. First, we want to provide pioneering
researchers who work on computational methods, evaluation, and large-scale modelling of language change an outlet for
disseminating cutting-edge research on topics concerning language change. We want to utilize this proposed workshop as
a platform for sharing state-of-the-art research progress in this fundamental domain of natural language research. Second,
in doing so we want to bring together domain experts across disciplines. by connecting researchers in historical linguistics
with those that develop and test computational methods for detecting semantic change and laws of semantic change; and
those that need knowledge (of the occurrence and shape) of language change, for example, in digital humanities and
computational social sciences where text mining is applied to diachronic corpora subject to e.g., lexical semantic change.
Third, the detection and modelling of language change using diachronic text and text mining raise fundamental theoretical
and methodological challenges for future research. **Please visit our [website](https://languagechange.org/events/2022-acl-lchange/)**

(LChange) 3rd International Workshop on Computational Approaches to Historical Language Change (LChange’22)

# Welcome everyone to ACL 2022!

The 60th Annual Meeting of the Association for Computational Linguistics is taking place May 22-27, 2022 as a hybrid event, in Dublin and online. We are happy to welcome all of you to this anniversary edition with an almost 50-50 in-person and virtual participation. 
The main conference program features oral presentations, in-person and virtual posters and demo sessions, a plenary session for our best paper presentations and awards, three amazing keynote events and two new initiatives of invited talks: Spotlight Talks for Young Rising Stars (STIRS) and The Next Big Idea Talks. Posters (including Findings of ACL 2022) and demos are grouped by areas for both the in-person and the virtual sessions. For the virtual component, the talks will be on Zoom and the posters and the demos will be in GatherTown. The Student Research Workshop will have an oral session and a poster session as part of Poster Session 1. The program also features eight Tutorials and 28 Workshops. 

 
We wish you a wonderful conference! 
[**The ACL 2022 Organizing Committee**](https://www.2022.aclweb.org/organisers)
 
[**Conference Handbook**](https://drive.google.com/file/d/1_BUCMfhMVrjG9E2e71aHdHeE28KSje0l/view?usp=sharing) 
[**Mini Handbook**](https://drive.google.com/file/d/1qlBKl0wzmlVF1oCeMQl3BahLd9nLP5Ce/view?usp=sharing) 
[**Posters and Demo guides**](https://drive.google.com/file/d/1UucMAoCNncIOaH1rMMDa0owuG9qgvJTG/view?usp=sharing)

ACL 2022

The Association for Computational Linguistics (ACL) is the premier international scientific and professional society for people working on computational problems involving human language, a field often referred to as either computational linguistics or natural language processing (NLP). 

Saied Alshahrani

7

2

SHORT BIO

Presentations

[PUBLISHED] Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition

CIDAR: Culturally Relevant Instruction Dataset For Arabic

Arabic Synonym BERT-based Adversarial Examples for Text Classification

Performance Implications of Using Unrepresentative Corpora in Arabic Natural Language Processing

DEPTH+: An Enhanced Depth Metric for Wikipedia Corpora Quality

Learning From Arabic Corpora But Not Always From Arabic Speakers: A Case Study of the Arabic Wikipedia Editions

Roadblocks in Gender Bias Measurement for Diachronic Corpora

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES