Guest Post: Digitization of Historical Material and the Impact on Digital Humanities

The following post has been written by Julia Poutanen, who recently got a BA in French Philology, and is now continuing with Master’s studies in the same field. She wrote this essay as part of the Digital Humanities studies and was gracious enough to allow us to publish it here. Thank you!


Essay on the Digitization of Historical Material and the Impact on Digital Humanities

On this essay, I will discuss the things that one needs to take into consideration when historical materials are digitized for further use. The essay is based on the presentation given on the 8th  of December 2015 at the symposium Conceptual Change – Digital Humanities Case Studies and on the study made by Järvelin et al. (2015). First, I will give the basic information of the study. Then I will move on reflecting the results of the study and their impact. Finally I will try to consider the study when it comes to digital humanities, and the sort of impact it has on it.

The study

In order to understand what the study is about, I will give a little presentation of the background related to the retrieval of historical data, as also given in the study itself. The data used is OCRed, which means that the data has been transformed into a digital form via Optical Character Recognition. In this situation, the OCR is very prone to errors with historical texts for reasons like paper quality, typefaces, and print. Other problem comes from the language itself: the language is different from the one used nowadays because of the lack of standardization. This is why geographical variants were normal, and in general, there was no consensus on how to properly spell all kinds of words, which can be seen both in the variation of newspapers and in the variation of language use. In a certain way, historical document retrieval is thus a “cross-language information retrieval” problem. To get a better picture of the methods in this study, s-gram matching and frequency case generation were used with topic-level analysing. S-gram matching is based on the fact that same concept is represented by different morphological variants and that similar words have similar meanings. These variants can then be used in order to query for words. Frequent case generation on the other hand is used to decide which morphological forms are the most significant ones for the query words. This is done by studying the frequency distributions of case forms. In addition, the newspapers were categorized by topics.

The research questions are basically related more to computer science than to
humanities. The main purpose is to see if using methods with string matching is good in order to retrieve material, which in this case is old newspapers, with a very inflectional language like Finnish. More closely looked, they try to measure how well the s-gram matching captures the actual variation in the data. They also try to see which is better when it comes to retrieval: better coverage of all the word forms or high precision? Topics were also used to analyze if query expansion would lead to a better performance. All in all, the research questions are only related to computer science, but the humanities are “under” these questions. How does Finnish work with retrieving material? How has the language changed? How is the morphological evolution visible in the data, and how does this affect the retrieval? All these things have to be taken into consideration in order to answer to the “real” research questions.

The data used in this research is a large corpus of historical newspapers in Finnish
and it belongs to the archives of the National Library of Finland. Although in this specific research, they only used a subset of the archive, which includes digitized newspapers from 1829 to 1890 with 180 468 documents. The quality of the data varies a lot, which makes pretty much everything difficult. People can access the same corpus from the web in three ways – the National Library web pages, the corpus service Korp, and the data can be downloaded as a whole –, but not all of it is available to everyone. On the National library webpage, one can do queries with words from the data, but the results are page images. On Korp, the given words are seen in KWIC-view. The downloadable version of the data includes, for instance, metadata, and can be hard to handle on one’s own, but for people who know what they’re doing, it can be a really good source of research material. Since the word forms vary from the contemporary ones, it is difficult to form a solid list of query words. In order to handle this, certain measurements have to be done. The methods used are pretty much divided into three steps. First, they used string matching in order to receive lists of words that are close to their contemporary counterparts in newspaper collections from the 1800s. This was done quite easily since in 19th century Finnish, the variants, whether they are inflectional, historical or OCR related, follow patterns with same substitutions, insertions, deletions, and their combinations. Also, left padding was used to emphasize the stability of the beginnings of the words. This way, it becomes really clear that Finnish is a language of suffixes which mainly uses inflected forms as a way to express linguistic relations, leaving the endings of the words determining the meaning – and thus, in this case, the similarity of the strings. But to handle the noisiness of the data – as they call it –, they used a combination of frequent case generation and s-grams to cover the inflectional variation and to reduce the noisiness. This creates overlapping in s-gram variants and that is why it was chosen to add each variant only once into the query, which enhances the query performance, although the number of variants is lower. Second, they categorized these s-gram variants manually according to how far away they were from the actual words. Then, they categorized how relevant the variant candidates were and analyzed the query word length and the frequencies on the s-gram. Some statistical measurements were helpful in analyzing. It was important to see the relevance of these statistics since irrelevant documents naturally aren’t very useful for the users. That is why they also evaluated the results based on the topic levels, which helped them to see how well topics and query keys perform.

Continue reading

Utilizing digital newspapers in research

Existing users of the http://digi.kansalliskirjasto.fi web service, might have noticed few new functions, which were launched at the end of last year. There already has existed an opportunity to copy a nifty reference to Wikipedia , but now we experimented bit to take it to next level with adding support for RefWorks. RefWorks is quite commonly used in e.g. University of Helsinki for managing references of all kinds.

If this sounds like useful or even fun, then follow these instructions.

  • Firstly, find the page or clipping, which contains the information you want to cite
  • Then find the  ” – quote icon and click that to open reference view
  • Then go to RefWorks – tab, and copy the text of the field to your clipboard.

You are done for digi’s side.

Digi.kansalliskirjasto.fi and using references

Digi.kansalliskirjasto.fi and using references

Then in RefWorks, you can import the copied text to your reference list.

Creating new reference in RefWorks

  1. Login to RefWorks
  2. In RefWorks, click import from the right, and then select ‘From text’ in the given popup
  3. Paste the citation info from the clipboard to the field and click Import.The new citation is stored and is viewable in your list and you can use it in your articles.

What about BibTex?
digi_2016-01-26 17_49_10
BibTex is the another alternative for references, you can the reference via the “- tool and copy entry to your bibliography.

API, which and where?

In order to improve the accessibility to the digital materials, one of the thing the users have been desiring for is an API (application programming interface). Basically via API you can access the digital data you are interested about or as defined by the National Digital Library vocabulary in Finnish: “Ohjelmisto tai ohjelmistokomponentti, jolla eri ohjelmistot voivat vaihtaa tietoja keskenään”.

01.11.1910 Rakennustaito : Suomen rakennusmestariliiton ammattilehti no 22 s. 20

Paalutuskartta, 01.11.1910 Rakennustaito : Suomen rakennusmestariliiton ammattilehti no 22 s. 20

When reading the feedback we get about the digital collections of the National Library of Finland at http://digi.kansalliskirjasto.fi main feedback are about usage and functions of the user interface. However, we recently got a feedback mentioning an API. So how an API would respond to the user needs, and what kind of API would work best?

API Jam as catalyst

With the APIOps people we have been able to think about API both in both meta and practical level in couple of sessions already. In the last one we were discussing the overall lifecycle of API and at which point to involve the users.

For example, it was good to get brief comments of “lessons learnt” of API development, which where in short:

  1. Pick a user user story, but just one!
  2. Don’t over-design
  3. Versioning

In the core the first one is the key, in the sense that as in other software development it would be important to get some quick wins, so that the ball starts rolling. The second point also relates to that since, it is very easy to get carried away with all the technical thingymagics and details, but in the core, the user just wants the data, so serve that use case. Then versioining works quite well in there, because you can always ‘freeze’ the one version, but continue development for the other version – that way the users have option to move to the new one, when they are ready for the new features, but leaving still avenues open to implement new development and ideas.

What next?

Keep an eye of this page for the development, or send us ideas or feedback via the http://digi.kansalliskirjasto.fi -feedback functionality of what you would be interested to open up discussion and what kinds of things could be experimented together.

Brief intro to Tesseract

13.01.1844 Maamiehen Ystävä no 2 s. 4

13.01.1844 Maamiehen Ystävä no 2 s. 4

If you sometimes wonder all the tools that to relate to text recognition of e.g. digitized newspapers, you might have heard of a tool called Tesseract. Tesseract is a open-source tool for Optical Character Recognition (OCR).

In this post we go briefly through how to setup and use Tesseract.

Continue reading

Tutkimusta tukevia työkaluja

Digitaalisten aineistojen professori Mikko Tolonen oli Mikkelin yliopistokeskuksen (MUC) haastateltavana ja kertoi, että työn alla on kehittää menetelmiä ja työkaluja, joiden avulla digitoituja vanhoja sanomalehtiä voi lukea mahdollisimman virheettömästi.

Pyrimme parantamaan aineiston laatua, jotta virheiden määrä vähenisi.

korostaa Mikko Tolonen. Ja hän jatkaa

Tämä on täysin poikkeuksellinen aineisto, johon voi tehdä hakuja millä tahansa hakusanoilla tai etsiä aineistoja vaikka paikkakunnan mukaan. Palvelun ominaisuuksia kehitetään koko ajan.

Tutkijoille ja kehittäjille

Aineistojen parantaminen auttaa kaikkia käyttäjäryhmiä, tutkijoista, ohjelmistokehittäjiin ja kansalaisiin, mahdollistaen jopa uusien palvelujen synnyn erilaisia aineistoja yhdisteltäessä.

Halleyn pyrstötähti;  26.09.1909 Helsingin Sanomat no 222

Halleyn pyrstötähti; 26.09.1909 Helsingin Sanomat no 222 – Sanomalehdet – Digitoidut aineistot – Kansalliskirjasto

Löydät koko haastattelun MUC:n verkkosivuilta.

Kansalliskirjaston digitoidut aineistot löydät osoitteesta http://digi.kansalliskirjasto.fi , jossa aineisto vuoteen 1910 asti on vapaasti luettavissa ja haettavissa verkosta.