Guest Post: Digitization of Historical Material and the Impact on Digital Humanities

The following post has been written by Julia Poutanen, who recently got a BA in French Philology, and is now continuing with Master’s studies in the same field. She wrote this essay as part of the Digital Humanities studies and was gracious enough to allow us to publish it here. Thank you!


Essay on the Digitization of Historical Material and the Impact on Digital Humanities

On this essay, I will discuss the things that one needs to take into consideration when historical materials are digitized for further use. The essay is based on the presentation given on the 8th  of December 2015 at the symposium Conceptual Change – Digital Humanities Case Studies and on the study made by Järvelin et al. (2015). First, I will give the basic information of the study. Then I will move on reflecting the results of the study and their impact. Finally I will try to consider the study when it comes to digital humanities, and the sort of impact it has on it.

The study

In order to understand what the study is about, I will give a little presentation of the background related to the retrieval of historical data, as also given in the study itself. The data used is OCRed, which means that the data has been transformed into a digital form via Optical Character Recognition. In this situation, the OCR is very prone to errors with historical texts for reasons like paper quality, typefaces, and print. Other problem comes from the language itself: the language is different from the one used nowadays because of the lack of standardization. This is why geographical variants were normal, and in general, there was no consensus on how to properly spell all kinds of words, which can be seen both in the variation of newspapers and in the variation of language use. In a certain way, historical document retrieval is thus a “cross-language information retrieval” problem. To get a better picture of the methods in this study, s-gram matching and frequency case generation were used with topic-level analysing. S-gram matching is based on the fact that same concept is represented by different morphological variants and that similar words have similar meanings. These variants can then be used in order to query for words. Frequent case generation on the other hand is used to decide which morphological forms are the most significant ones for the query words. This is done by studying the frequency distributions of case forms. In addition, the newspapers were categorized by topics.

The research questions are basically related more to computer science than to
humanities. The main purpose is to see if using methods with string matching is good in order to retrieve material, which in this case is old newspapers, with a very inflectional language like Finnish. More closely looked, they try to measure how well the s-gram matching captures the actual variation in the data. They also try to see which is better when it comes to retrieval: better coverage of all the word forms or high precision? Topics were also used to analyze if query expansion would lead to a better performance. All in all, the research questions are only related to computer science, but the humanities are “under” these questions. How does Finnish work with retrieving material? How has the language changed? How is the morphological evolution visible in the data, and how does this affect the retrieval? All these things have to be taken into consideration in order to answer to the “real” research questions.

The data used in this research is a large corpus of historical newspapers in Finnish
and it belongs to the archives of the National Library of Finland. Although in this specific research, they only used a subset of the archive, which includes digitized newspapers from 1829 to 1890 with 180 468 documents. The quality of the data varies a lot, which makes pretty much everything difficult. People can access the same corpus from the web in three ways – the National Library web pages, the corpus service Korp, and the data can be downloaded as a whole –, but not all of it is available to everyone. On the National library webpage, one can do queries with words from the data, but the results are page images. On Korp, the given words are seen in KWIC-view. The downloadable version of the data includes, for instance, metadata, and can be hard to handle on one’s own, but for people who know what they’re doing, it can be a really good source of research material. Since the word forms vary from the contemporary ones, it is difficult to form a solid list of query words. In order to handle this, certain measurements have to be done. The methods used are pretty much divided into three steps. First, they used string matching in order to receive lists of words that are close to their contemporary counterparts in newspaper collections from the 1800s. This was done quite easily since in 19th century Finnish, the variants, whether they are inflectional, historical or OCR related, follow patterns with same substitutions, insertions, deletions, and their combinations. Also, left padding was used to emphasize the stability of the beginnings of the words. This way, it becomes really clear that Finnish is a language of suffixes which mainly uses inflected forms as a way to express linguistic relations, leaving the endings of the words determining the meaning – and thus, in this case, the similarity of the strings. But to handle the noisiness of the data – as they call it –, they used a combination of frequent case generation and s-grams to cover the inflectional variation and to reduce the noisiness. This creates overlapping in s-gram variants and that is why it was chosen to add each variant only once into the query, which enhances the query performance, although the number of variants is lower. Second, they categorized these s-gram variants manually according to how far away they were from the actual words. Then, they categorized how relevant the variant candidates were and analyzed the query word length and the frequencies on the s-gram. Some statistical measurements were helpful in analyzing. It was important to see the relevance of these statistics since irrelevant documents naturally aren’t very useful for the users. That is why they also evaluated the results based on the topic levels, which helped them to see how well topics and query keys perform.

People who participated in this project were from the Department of Information Sciences at the University of Tampere. In addition, one of the participants came from the National Library of Finland, the Centre for Preservation and Digitisation. A total of five persons were included. I have to admit that first, I was quite surprised by the fact that all minus one of the participants were from Information Sciences, until I checked that this department includes disciplines of Computer Sciences, Statistics and Information Studies, which seems like a great combo for a research like this.

The results and their impact

The findings are a collection of little things. For example, some OCR errors were more common than the actual words or their modern counterparts. The length of the query word has an effect on what kind of variant candidates are generated. The study shows that the shorter the word is, the more noise there is, since in shorter words even the slightest changes can make a significant change. On the other hand, longer words give more words related to the query: longer words are more often compounds and don’t necessarily appear more than once in the data. But by s-gram matching, the longer queries were often damaged. It became very clear that morphological processing is required for inflective languages to handle morphological variation. Finally, what they found out was that query expansion proved to have better results than using inflectional forms, which means that covering different forms of variation is more important than the precision in one type of variation. They also provided some ways to improve quality, and like usually, it is the combination of re-OCRing and automatic post correction that is the best way to go. Human correction is out of the question since there are too many somehow suspicious words already.

All in all, this research is very important when it comes to the study of historical document retrieval, since not many researches have been made on that subject yet. In order to preserve cultural heritage, it is vital to get material of our world into a digitized form so that it can be studied at the present and future. This study is also different in the fact that it revealed more things on not just string level, but also on conceptual, vocabulary and syntactic variation. Now we have perhaps a bit better ways to digitize materials, which can make us understand the world a whole lot better. So is the project ambitious? I think it is pretty clear that the answer is yes, definitely. This project has relevance in many fields and not just in academic ones. How to retrieve documents as accurately as possible is relevant almost everywhere. As far as realism is concerned, the project is fairly realistic in its research questions. The point is to test a way that would help historical document retrieval in an inflective language. We’re talking about an institution that is specialised in historical documents, the National Library, which means that they have special knowledge of retrieving documents during centuries and that they know what they’re doing. Moreover, they explained things that didn’t go so well and things that are still unsolved. Nevertheless, I feel that even for them, the promising results are something fairly big. In my opinion, the whole study was conducted in a precise manner, so that everything essential was covered. This study, yet again, shows how difficult something like getting a picture into a digitized form can be. So many aspects have to be taken into consideration, and even the littlest processes take time and demand pre-processing and a lot of thinking – and you still won’t get perfect accuracy.

How is this related to digital humanities?

To my mind, this research represents digital humanities at its finest. I really paid attention to the fact, when they said on the presentation that “collections and tools should be designed more collaboratively”. That again shows the demand for multidisciplinarity, which should be the basis of almost any research, but at least of the ones done in digital humanities. And yes, it applies to hearing the users of the webpage, because they are the ones to whom the whole thing is made, obviously. To improve the experience, there should be some kind of reminders that the data they’re working with may not be totally accurate and may not work the way they would think it works. I think not everyone knows about these OCR problems; at least I didn’t before hearing about this!

To come back to the relationship with digital humanities, I’d say – based on what I’ve learned during this course – that this study is what you call digital humanities 1.0. It is very quantitative and clearly focuses on the retrieval. Yet, it is very innovative in itself, using new tools for old questions. I wouldn’t say this as a black-and-white truth, but if I have to make a clear cut, I’d choose this one. On the other hand, as it is said on an article about the current situation, there are two tendencies in digital humanities, and one of them highlights the “making of things”. This research is truly more about making concrete things and programming than “theorizing things” and getting a wider perspective of the digitized culture (as is defined the other tendency). According to the article, there has been a lot of discussion over what digital technologies is: an object or an instrument? I don’t see any point in making these kinds of distinctions. What ever the research is about, it is already clear enough to say if it is related to digital humanities. It is actually better to let people have their own subjective opinions on the real truth behind this field of study, because more things will be enabled then. Yet, in order for the digital humanities to gain foothold, some kind of restrictions have to be made so that people would actually know what they’re doing.

This leads to something else I came to think about. I’m not sure exactly what the participants were experts in, but it seems like the project was not multidisciplinary enough. All of the participants came from the same department from the same university, in addition to the one person from the National Library, which I believe is obligatory in this case. However, I still want to point out one thing: how many of these people know that they’re doing digital humanities? There was basically no mention about it. It is true that it isn’t really well known around this corner of the world – nor is it a very old and solid field of study – but that strikes my mind. It seems like most of the people doing digital humanities don’t know that they’re doing digital humanities. To me, multidisciplinarity would make this clearer in a sense that people would realise what they’re really doing, and perhaps it would increase motivation towards it, too. Multidisciplinarity is a virtue in itself, and nowadays very much in demand.

When it comes to making the historical retrieval better, I started to think if it were possible to take advantage of the metadata. With metadata they could perhaps categorize the newspapers in a different way to see, for example, which papers had certain kind of conventions, and perhaps categorize the papers by their date of publication. This would help seeing where the problems start and it would maybe be easier to see what the problems are specifically. As it is a project of the library in Finland, it should be easy for them to use metadata, especially now that they know a bit more of methods that would improve the retrieval. And now that ideas are thrown in the air, I guess visualization would be something to at least try on. Of course I don’t know if they already used visualization as a part of their study, but it might let them – and us! – spot the problems and see how things evolve.

To give my last take on this subject, as open science is nowadays becoming more and more important, it is good that the methods they used are explained in a really detailed manner. This way other people can use these methods, too. (However, not all the things are explained in the study. Where can these methods be found?) Still, the process was very well described so that it could be reproduced by anyone, which shows it’s valid at least on that level.

 

References

[1] Järvelin, A., Keskustalo, H., Sormunen, E., Saastamoinen, M. and Kettunen, K. (2015 ) . Information retrieval from historical newspaper collections in highly inflectional languages: A  query expansion approach. Journal of the Association for Information Science and Technology. http://dx.doi.org/10.1002/asi.23379 .

[2] http://www.ennenjanyt.net/2015/02/koodaamisen-ja-kirjoittamisen-vuoropuhelu-mita- on-digitaalinen-humanistinen-tutkimus/ .


The post above was written by Julia Poutanen, who recently got a BA in French Philology, and is now continuing with Master’s studies in the same field. She wrote this essay as part of the Digital Humanities studies and was gracious enough to allow us to publish it here. Thank you!

1 thought on “Guest Post: Digitization of Historical Material and the Impact on Digital Humanities

  1. Pingback: Historiallisten aineistojen digitointi – HumKoodi

Comments are closed.