Methods and Plan of Execution – Machine-Readable Texts for Egyptologists

2021 Machine-Readable Texts for Egyptologists (Finnish Cultural Foundation)

The production of digital texts for the use of future Egyptologists must begin somewhere: the aim of the first year of the project was to manually produce encoded hieroglyphic texts using original sources. To enable automatic transliteration and other data science methods using the original hieroglyphs, an encoded version is necessary. There are several methods for encoding hieroglyphic texts so that the positions of the hieroglyphs relative to each other are preserved. In Unicode version 12.0, released in spring 2019, control characters were released that can be used to encode hieroglyphs into groups using Unicode. Unicode would seems to be the most reasonable and most stable solution for encoding, so that the texts can also be used in other projects in Egyptology. However, producing hieroglyphic texts with unicode is not easy, since the hieroglyphic characters are outside of the Basic Multilingual Plane and thus exceed the four-digit limit of the Unicode hex input. In MaReTE, I have chosen to produce machine-readable texts with JSesh. From JSesh it is possible to copy out texts as picture but also as Manuel de Codage (MDC) code and as unicode. All texts processed in the project will be published as JSesh compatible .gly files, as MDC and as unicode. During the first year, I developed a tool called gly2mdc for cleaning the .gly files. To get the MDC code out of JSesh, one has to use the Edit manu to choose which form to copy to clipboard and then paste it to a file. The tool saves time as you can get the MDC to a new file with one command. With the newly published Ramses automated translitteration software data corpus the future plans for had to be modified. In the corpus the encoded hieroglyphic sentences are not aligned to the respective transliterations, so a list of encoded words mapped to their transliterations is under preparation.

2022-2024 From Sherds of Pottery to Open Egyptological Data (Kone Foundation)

A transliteration tool that uses artificial neural networks and a decoder-encoder method was published recently, but the tool has turned out to be extremely slow and to work only on short sentences of texts. The hieroglyphic writing does not indicate the sentence boundaries in any way, and one would have to understand the text to find them. In the administrative and personal texts from ancient Egypt, the sentences – and sometimes even words – generally span from one line to the next. Therefore, all the lines of a document must be transliterated as one, often long, string and a new method is needed.

For training and validating the method developed in this project, the Ramses corpus of hieroglyphic sentences are used. The corpus contains the development, validation, and test sets of hieroglyphic sentences with transliterations (Rosmorduc 2021). The texts from the Thesaurus Lingua Aegyptiaca (TLA) were recently released as JSON files (Schweitzer 2021). For some of the words in the TLA dataset, the hieroglyphic signs are included and can be used for a word list.

The method that will be developed in this project is built using an iterative development model. The method will be tested after each enhancement against the Ramses corpus’ validation set. As the word boundaries are not indicated in a hieroglyphic text, the first task is word segmentation. Since I have excellent experiences with statistical language modelling, I will generate language models of words and sign n-grams from both the Ramses and the TLA datasets.

The transliteration method will, in its basic form, use the language model of words. However, the word forms found in the text to be transliterated might not be in the language model as Egyptian words tend to be written in multiple ways. Therefore, a token based backoff scheme and clustering will also be used.

It cannot be expected for any automatic transliteration method to produce perfect transliteration as even human experts disagree. The proposed method will be sufficient for releasing large corpora of transliterated hieroglyphic text in a raw format, but for publishing transliterated texts they must be checked and corrected. Once a prototype of the transliteration tool is ready, I will build a software for improving the automatically produced transliteration. In the tool, one will be able to see the original hieroglyphic text side by side with the transliteration. New words can also be added to the language model of the local implementation of the transliteration method through the program. This process will be iterative and the transliteration method will be improved based on the evaluation of the automatically produced texts and new words added to the language model.

The project thus produces both manually encoded and automatically transliterated hieroglyphic texts for the use of Egyptologists. All texts produced will be freely available and reusable. All data will be stored in Zenodo, from where it is freely downloadable.