Background

With the help of data science, researchers in the humanities can look at large amounts of data at once and find regularities that they might not otherwise see. For example, the long-term borrowing of newspaper items has traditionally been studied by examining a particular era or the repetition chain of a few news items. Using digital methods, a co-operation project between the Universities of Turku and Helsinki examined the borrowing of items in now digitized newspapers published in Finland over a period of 140 years and found about 13.8 million repetitions in the papers.

In order to use digital methods, the texts to be examined must be in machine-readable form. For large modern languages, large amounts of machine-readable text are readily available, for example, online or from digitized journal archives, and there are ready-made language technology tools for these languages, such as parsers and morphological analyzers, to analyze and annotate texts. The situation is more difficult for small languages ​​that use special writing systems. Most digital tools make it easier to manipulate text written in Latin letters than, for example, hieroglyphs or cuneiform. Without automated methods utilizing language technology studying, for example, the meaning of words would be a very time-consuming task and would require good language proficiency as well as years, if not decades, of experience in reading texts in the language in question. Assyriologists are increasingly publishing transliterated and annotated texts online, so it has been possible to study them using digital methods. In Egyptology, on the other hand, there is no established way of publishing texts written in hieroglyphs in machine-readable form, and there are only a few ancient Egyptian text corpora that can be read online, and only some of them can be stored in local copies for data science purposes. Digital Egyptology is therefore in dire need of openly available machine-readable texts.

Transliteration refers to the conversion of text written using one writing system to another writing system. When translating and analyzing ancient Egyptian texts, Egyptologists usually transliterate texts written in hieroglyphs into words written in Latin letters and diacritical marks. The conversion is more complex than in many other languages, as the writing system uses characters that correspond to more than one Latin letter or have alternatives for equivalences. The language of ancient Egypt belongs to the Afro-Asian languages, and like the Semitic languages ​​(e.g., Akkadian), it was written using characters that describe the consonant roots of words. Hieroglyphic texts are a mixture of phonetic symbols and logograms, among other things. Some characters have multiple consonant values, and sometimes characters may repeat letters already written in other characters. One hieroglyph can correspond to one to three consonants in transliteration, and some hieroglyphs have no equivalent at all. Groups of characters written in the same way can have several meanings and therefore often have a so-called determinant at the end, which specifies what word the group means. In addition, the same word can be spelled in many different ways. Because many hieroglyphs, and in particular the groups formed from them, can be transliterated in many ways, manual transliteration of a hieroglyphic text requires examination of vocabularies and conclusions to be drawn from it, making it a slow process.

Transliterated text in digital format (e.g. as a text or xml file) is machine readable data. In order to use the methods of modern computer science, a large amount of such machine-readable text would preferably be needed, and automatic transliteration would be needed to speed up their production. Egyptian documents, unlike monumental texts written in hieroglyphs, are usually written in so-called hieratic writing, which is a cursive version of hieroglyphs. One possibility of producing a machine-readable ancient Egyptian text would be to produce it from images published from original hieratic texts or from the hieroglyphic versions published by Egyptologists. However, to be successful, optical character recognition requires a lot of text written in the same handwriting. In addition, the automatic creation of transliteration from hieroglyphs is hampered by copyright law, which protects Egyptologists who have not published their interpretation of the original hieratic text under an open license.

Automatic transliteration of hierarchical and hieroglyphic text requires an intermediate form in which each hieroglyph corresponds to a code and in which the positions of the hieroglyphs relative to each other are also marked. The corresponding encoded format is required when hieroglyphic texts are printed. Since the 1980s, several standards for encoding signs have been developed in Egyptology. However, there are only a small number of openly available encoded hieroglyphic texts, as the texts used in printed books are, firstly, owned by publishers and, secondly, have been considered only as an intermediate stage in the publication process.