Thailand

Transliterating Sumerian is a key step in understanding Sumerian texts, but remains a difficult and time-consuming task. With more than 100,000 known texts and comparatively few specialists, manually maintaining up-to-date transliterations for the entire corpus is impractical. While many transliterations have been published online thanks to the dedicated effort of previous projects, the lack of a comprehensive, easily accessible dataset that pairs digital representations of source glyphs with their transliterations has hindered the application of natural language processing (NLP) methods to this task.

To address this gap, we present SumTablets, the largest collection of Sumerian cuneiform tablets structured as Unicode glyph--transliteration pairs. Our dataset comprises 91,606 tablets (totaling 6,970,407 glyphs) with associated period and genre metadata. We release \textit{SumTablets} as a Hugging Face Dataset.

To construct SumTablets, we first preprocess and standardize publicly available transliterations. We then map them back to a Unicode representation of their source glyphs, retaining parallel structural information (e.g., surfaces, newlines, broken segments) through the use of special tokens.

We leverage SumTablets to implement and evaluate two transliteration approaches: 1) weighted sampling from a glyph&#39;s possible readings, 2) fine-tuning an autoregressive language model. Our fine-tuned language model achieves an average transliteration character-level F-score (chrF) of 97.55, demonstrating the potential use of deep learning methods in Assyriological research. 


ACL 2024

SumTablets: A Transliteration Dataset of Sumerian Tablets

cuneiform

sumerian

glyph

transliteration

low-resource

Transliterating Sumerian is a key step in understanding Sumerian texts, but remains a difficult and time-consuming task. With more than 100,000 known texts and comparatively few specialists, manually maintaining up-to-date transliterations for the entire corpus is impractical. While many transliterations have been published online thanks to the dedicated effort of previous projects, the lack of a comprehensive, easily accessible dataset that pairs digital representations of source glyphs with their transliterations has hindered the application of natural language processing (NLP) methods to this task.

To address this gap, we present SumTablets, the largest collection of Sumerian cuneiform tablets structured as Unicode glyph--transliteration pairs. Our dataset comprises 91,606 tablets (totaling 6,970,407 glyphs) with associated period and genre metadata. We release \textit{SumTablets} as a Hugging Face Dataset.

To construct SumTablets, we first preprocess and standardize publicly available transliterations. We then map them back to a Unicode representation of their source glyphs, retaining parallel structural information (e.g., surfaces, newlines, broken segments) through the use of special tokens.

We leverage SumTablets to implement and evaluate two transliteration approaches: 1) weighted sampling from a glyph's possible readings, 2) fine-tuning an autoregressive language model. Our fine-tuned language model achieves an average transliteration character-level F-score (chrF) of 97.55, demonstrating the potential use of deep learning methods in Assyriological research. 


workshop paper

### Welcome!
The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. Our Virtual Poster Sessions will take place online Thursday, August 22, 2024.

You are required to register for this event. **Please register [here](https://2024.aclweb.org/registration). **

If you have already registered, please check your inbox for an email from Underline granting you access to ACL 2024 content.

Please register!

The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. More information will be announced soon.

The Machine-Actionable Ancient Text (MAAT) Corpus is a new resource providing training and evaluation data for restoring lacunae in ancient Greek, Latin, and Coptic texts. Current text restoration systems require large amounts of data for training and task-relevant means for evaluation. The MAAT Corpus addresses this need by converting texts available in EpiDoc XML format into a machine-actionable format that preserves the most textually salient aspects needed for machine learning: the text itself, lacunae, and textual restorations. Structured test cases are generated from the corpus that align with the actual text restoration task performed by papyrologists and epigraphist, enabling more realistic evaluation than the synthetic tasks used previously. The initial 1.0 beta release contains approximately 134,000 text editions, 178,000 text blocks, and 750,000 individual restorations, with Greek and Latin predominating. This corpus aims to facilitate the development of computational methods to assist scholars in accurately restoring ancient texts.

A new machine-actionable corpus for ancient text restoration

The Ancient Egyptian (AE) writing system was characterised by widespread use of graphemic classifiers: silent (unpronounced) hieroglyphic signs clarifying the meaning or indicating the pronunciation of the host word. The study of classifiers has intensified in recent years with the launch and quick growth of the iClassifier project, which provided a web-based platform for annotation and analysis of classifiers in hieroglyphic, cuneiform, and ancient Chinese texts. Thanks to the data contributed by the project participants, it is now possible to formulate the identification of classifiers in AE texts as an NLP task. In this paper, we make first steps towards solving this task by implementing a series of sequence-labelling neural models, which achieve promising performance despite the modest amount of training data. We discuss tokenisation and operationalisation issues arising from tackling AE texts and contrast our approach with frequency-based baselines.

Classifier identification in Ancient Egyptian as a low-resource sequence-labelling task

In this paper we present a deep learning pipeline for automatically dating ancient Greek papyrus fragments based solely on fragment images. The overall pipeline consists of several stages, including handwritten text recognition (HTR) to detect and classify characters, filtering and grouping of detected characters, 24 character-level date prediction models, and a fragment-level date prediction model that utilizes the per-character predictions. A new dataset (containing approximately 7,000 fragment images and 778,000 character images) was created by scraping papyrus databases, extracting fragment images with known dates, and running them through our HTR models to obtain labeled character images. Transfer learning was then used to fine-tune separate ResNets to predict dates for individual characters which are then used, in aggregate, to train the fragment-level date prediction model. Experiments show that even though the average accuracies of character-level dating models is low, between 35%-45%, the fragment-level model can achieve up to 79% accuracy in predicting a broad, two-century date range for fragments with many characters. We then discuss the limitations of this approach and outline future work to improve temporal resolution and further testing on additional papyri. This image-based deep learning approach has great potential to assist scholars in the palaeographical analysis and dating of ancient Greek manuscripts.

A deep learning pipeline for the palaeographical dating of ancient Greek papyrus fragments

We present a novel approach to extracting recurring narrative patterns, or type-scenes, in Biblical Hebrew and Biblical Greek with an information retrieval network. We use cross-references to train an encoder model to create similar representations for verses linked by a cross-reference. We then query our trained model with phrases informed by humanities scholarship and designed to elicit particular kinds of narrative scenes. Our models can surface relevant instances in the top-10 ranked candidates in many cases. Through manual error analysis and discussion, we address the limitations and challenges inherent in our approach. Our findings contribute to the field of Biblical scholarship by offering a new perspective on narrative analysis within ancient texts, and to computational modeling of narrative with a genre-agnostic approach for pattern-finding in long, literary texts.



Detecting Narrative Patterns in Biblical Hebrew and Greek

In this paper, we present a study of Named Entity Recognition (NER) as applied to Ancient Greek texts, with an emphasis on identifying individuals. Recent research shows that, while the task remains difficult, the use of transformer models results in significant improvements. In the first part of the paper, we therefore compare the performance of four transformer models on the task of NER for the categories of people, locations and groups, and add an out-of-domain test set to the existing datasets. Results on this set highlight the shortcomings of the models when confronted with a random sample of sentences. Hence, in the second part of the paper, we narrow down our approach to the category of people, to be able to include domain knowledge. First, we simplify the task to a binary PERS/MISC classification on the token level, starting from capitalised words. Next, we test the use of domain- and linguistic knowledge to improve the results. We found that including simple gazetteer information as a binary mask has a marginally positive effect on newly annotated data and that treebanks can be used to help identify multi-word individuals if they are scarcely or inconsistently annotated in the available training data. We conclude with a qualitative error analysis that identifies further areas of improvement. 


"Gotta catch `em all!": Retrieving people in Ancient Greek texts combining transformer models and domain knowledge

This paper explores the possibility to exploit different Pretrained Language Models (PLMs) to assist in a manual annotation task consisting in assigning the appropriate sense to verbal predicates in a Latin text. Indeed, this represents a crucial step when annotating data according to the Uniform Meaning Representation (UMR) framework, designed to annotate the semantic content of a text in a cross-linguistic perspective. We approach the study as a Word Sense Disambiguation task, with the primary goal of assessing the feasibility of leveraging available resources for Latin to streamline the labor-intensive annotation process. Our methodology revolves around the exploitation of contextual embeddings to compute token similarity, under the assumption that predicates sharing a similar sense would also share their context of occurrence. We discuss our findings, emphasizing applicability and limitations of this approach in the context of Latin, for which the limited amount of available resources poses additional challenges.

Predicate Sense Disambiguation for UMR Annotation of Latin: Challenges and Insights

Beginning with the discovery of the cuneiform writing system in 1835, there have been numerous grammars published illustrating the complexities of the Sumerian language. However, the one thing they have in common is their omission of dependency rules for syntax in Sumerian linguistics. For this reason we are working toward a better understanding of Sumerian syntax, by means of dependency-grammar in the Universal Dependencies (UD) framework. Therefore, in this study we articulate the methods and engineering techniques that can address the hardships in annotating dependency relationships in the Sumerian texts in transliteration from the Electronic Text Corpora of Sumerian (ETCSUX).

UD-ETCSUX: Toward a Better Understanding of Sumerian Syntax

This paper presents a research project on the application of machine learning to the edition of ancient Greek inscriptions. More specifically, it implements Ithaca, a deep neural network architecture, for elaborating a new critical edition of the enquiries to the oracle of Zeus and Dione which are preserved on the lead tablets discovered in the sanctuary of Dodona, in northern Greece.
The goal of the project is twofold: first, it constitutes an attempt to incorporate a new technological tool into the classical epigraphist’s specialized workflow, mainly state-of-the-art machine learning; second, it is conceived as a case study for evaluating the performance of deep learning for editing a corpus of Greek inscriptions which presents a high level of complexity.

Application of Machine Learning to the Critical Edition of Ancient Greek Inscriptions: Ithaca and the Corpus of Oracular Inscriptions of Dodona

We investigate the problem of restoring Mycenaean linear B clay tablets, dating from about 1400 B.C. to roughly 1200 B.C., by using text infilling methods based on machine learning models. Our goals here are: first to try to improve the results of the methods used in the related literature by focusing on the characteristics of the Mycenaean Linear B writing system (series D), second to examine the same problem for the first time on series A&B and finally to investigate transfer learning using series D as source and the smaller series A&B as target. Our results show promising results in the supervised learning tasks, while further investigation is needed to better exploit the merits of transfer learning.

Restoring Mycenaean Linear B 'A&B' series tablets using supervised and transfer learning

Natural language processing for Greek and Latin, inflectional languages with small corpora, requires special techniques. For morphological tagging, transformer models show promising potential, but the best approach to use these models is unclear. For both languages, this paper examines the impact of using morphological lexica, training different model types (a single model with a combined feature tag, multiple models for separate features, and a multi-task model for all features), and adding linguistic constraints. We find that, although simply fine-tuning transformers to predict a monolithic tag may already yield decent results, each of these adaptations can further improve tagging accuracy.

Downloads

Next from ACL 2024

A new machine-actionable corpus for ancient text restoration

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from ACL 2024

A new machine-actionable corpus for ancient text restoration

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads