Digital Humanities at Liber2016

During this week in Helsinki there has been Liber 2016 conference, which is the main network for research libraries in Europe. This year it was orgPaasitorni 1908anized in Helsinki at Paasitorni, by co-operation of the National Library of Helsinki, Helsinki University Library and Finnish Research Libraries. There was record number of attendees, proposals  submitted and also a great variety in countries where the attendees came. The theme of the conference was “Libraries Opening Paths to Knowledge”, which both gave possibilities for wide range of talks, while still keeping all presentations in linked to each other.

Digital humanities, eh?

Digital humanities was visible also in Liber, as an recognized field, which requires libraries all over to adjust their services towards those who want to utilize materials via algorithms in addition to close reading. It also came about in the discussions after sessions, and also via side discussions during the breaks. Especially on Thursday there were sessions that talked about service design, where there was conscious effort to the respond to the new needs of clients of the libraries via service design, service modeling, net promoter score metrics, which might appear more in private sector. But this shows that libraries are, opening new ways to information, as the conference tagline goes, in whatever ways are available — even small things, like having people to guide students can have huge impact to the daily high and low points of an individual researcher.

Plenary: Digital Humanities in and of the Library

The plenary session of Yale University Library / Digital Humanities Lab by Peter Leonard was particularly interesting, as he went through both in theory and practice how a library can help in digital humanities. He went through the collections but also research projects, which utilize the data.

One of the aspects discussed were the special collections & DH, where the user engagement, which some might call, crowdsourcing, is a way to both enrich the data, but also to get awareness of the materials increased.  This first example talked mainly about the fascinating Transcribe-project , where Cherokee language could be transcribed via unique font set by the volunteers, who know the language. As mentioned in Twitter, this approach of targeted crowdsourcing, is similar to the well-known Fenno-Ugrica project.


Enriching options

The second case of enriching was done via kinds of crowdsourcing tagging of original materials. From the digital materials, for example from books, volunteers draw bounding boxes to surround significant terms, e.g.  the characters, writers, or titles. One of the demos shown had then a network analysis of works, artists and their content.

Analyzing materials

Then for a different kind of example, the final example was about Vogue journal, where the content itself was restricted, but the lab had sought ways to still utilize the materials in an image-based analysis. For example we saw how the front page of the journal changed over the years and how the hue and brightness varies. From the vogue materials there was also made “frequency counts” of terms which were visualized in a timeline. Via topic modelling it was then also possible to let the content to speak to the researchers, as that created groups of terms which seem to appear close to themselves and thusly created groupings of terms, themes, which then researchers can dig deeper. The access controls were utilized as the system provides also the full text of articles for those who have access to the restricted content, others have to work on the aggregate level. This is bit similar to the NLF’s Aviisi-project goals, where access control has been a key thing to develop, in order to increase availability to the more recent materials.

Programming

It was understood by library people that “writing software” is one of the key skills, which a researcher might need. Actually in the discussion of the Yale’s plenary session there were considerations that how goal-oriented those skills should be? The comparison given was related to e.g. language courses, they might be taken at some point, but even if the learner doesn’t have passable language skills for everyday discussion, it still can be seen as valid learning as e.g. culture and other knowledge is got.

Especially in programming the software stacks, languages,  different toolings, arise and fall quite rapidly and one of the core skills could be just to juggle with many possible programming language, in the selection of the most suitable tool for a job. As mentioned in Liber2015 in prof. Tolonen and Dr Leo Lahti, there should be reproducible workflows from data, to code and to results.

For example in the open science portal there just came text reminding of citing software, computer programs should be done, which is talked more, for example in the Force11 software citation principles . Personally, I’ve always liked if a github repository contains also the citing instructions directly, because then it is relatively convenient to just copy the given format. For example, Mike Jackson lists examples how to get citation from various tools, e.g.  from R modules via citation(‘modulename’) and finally the recommendation to do a own command line option for citation, is actually very implementable suggestion. In a way generating an URN to a blog post would be possible, but would it be useful?  Especially if using github for blogging it could be possible to just get an URN to a post automatically – then it would happen behind the scenes but anyhow be available.

Training DH

On Friday there were talks about training DH, where e.g. Maija Paavolainen from University of Helsinki told about the current digital humanities course in Digital Humanities Helsinki and what students, content providers, and trainers had learnt throughout the way. National Library of Netherlands also talked about the library role in DH, it all starts with data (digitized), but there was also good reminder to be critical towards all tools, and to the digital sources in general.

National Library of NetherlandsAll data, tools and methods have their usages, so knowing what suits where is also one field, where new competences need to be built. It is also constant work, as field evolves all the time.

Presentation materials, posters of the Liber2016 can be found via the conference website , and discussions during the  conference can be found via #liber2016 hashtag .

 

Tallenna

Tallenna

Tallenna

Guest Post: Digitization of Historical Material and the Impact on Digital Humanities

The following post has been written by Julia Poutanen, who recently got a BA in French Philology, and is now continuing with Master’s studies in the same field. She wrote this essay as part of the Digital Humanities studies and was gracious enough to allow us to publish it here. Thank you!


Essay on the Digitization of Historical Material and the Impact on Digital Humanities

On this essay, I will discuss the things that one needs to take into consideration when historical materials are digitized for further use. The essay is based on the presentation given on the 8th  of December 2015 at the symposium Conceptual Change – Digital Humanities Case Studies and on the study made by Järvelin et al. (2015). First, I will give the basic information of the study. Then I will move on reflecting the results of the study and their impact. Finally I will try to consider the study when it comes to digital humanities, and the sort of impact it has on it.

The study

In order to understand what the study is about, I will give a little presentation of the background related to the retrieval of historical data, as also given in the study itself. The data used is OCRed, which means that the data has been transformed into a digital form via Optical Character Recognition. In this situation, the OCR is very prone to errors with historical texts for reasons like paper quality, typefaces, and print. Other problem comes from the language itself: the language is different from the one used nowadays because of the lack of standardization. This is why geographical variants were normal, and in general, there was no consensus on how to properly spell all kinds of words, which can be seen both in the variation of newspapers and in the variation of language use. In a certain way, historical document retrieval is thus a “cross-language information retrieval” problem. To get a better picture of the methods in this study, s-gram matching and frequency case generation were used with topic-level analysing. S-gram matching is based on the fact that same concept is represented by different morphological variants and that similar words have similar meanings. These variants can then be used in order to query for words. Frequent case generation on the other hand is used to decide which morphological forms are the most significant ones for the query words. This is done by studying the frequency distributions of case forms. In addition, the newspapers were categorized by topics.

The research questions are basically related more to computer science than to
humanities. The main purpose is to see if using methods with string matching is good in order to retrieve material, which in this case is old newspapers, with a very inflectional language like Finnish. More closely looked, they try to measure how well the s-gram matching captures the actual variation in the data. They also try to see which is better when it comes to retrieval: better coverage of all the word forms or high precision? Topics were also used to analyze if query expansion would lead to a better performance. All in all, the research questions are only related to computer science, but the humanities are “under” these questions. How does Finnish work with retrieving material? How has the language changed? How is the morphological evolution visible in the data, and how does this affect the retrieval? All these things have to be taken into consideration in order to answer to the “real” research questions.

The data used in this research is a large corpus of historical newspapers in Finnish
and it belongs to the archives of the National Library of Finland. Although in this specific research, they only used a subset of the archive, which includes digitized newspapers from 1829 to 1890 with 180 468 documents. The quality of the data varies a lot, which makes pretty much everything difficult. People can access the same corpus from the web in three ways – the National Library web pages, the corpus service Korp, and the data can be downloaded as a whole –, but not all of it is available to everyone. On the National library webpage, one can do queries with words from the data, but the results are page images. On Korp, the given words are seen in KWIC-view. The downloadable version of the data includes, for instance, metadata, and can be hard to handle on one’s own, but for people who know what they’re doing, it can be a really good source of research material. Since the word forms vary from the contemporary ones, it is difficult to form a solid list of query words. In order to handle this, certain measurements have to be done. The methods used are pretty much divided into three steps. First, they used string matching in order to receive lists of words that are close to their contemporary counterparts in newspaper collections from the 1800s. This was done quite easily since in 19th century Finnish, the variants, whether they are inflectional, historical or OCR related, follow patterns with same substitutions, insertions, deletions, and their combinations. Also, left padding was used to emphasize the stability of the beginnings of the words. This way, it becomes really clear that Finnish is a language of suffixes which mainly uses inflected forms as a way to express linguistic relations, leaving the endings of the words determining the meaning – and thus, in this case, the similarity of the strings. But to handle the noisiness of the data – as they call it –, they used a combination of frequent case generation and s-grams to cover the inflectional variation and to reduce the noisiness. This creates overlapping in s-gram variants and that is why it was chosen to add each variant only once into the query, which enhances the query performance, although the number of variants is lower. Second, they categorized these s-gram variants manually according to how far away they were from the actual words. Then, they categorized how relevant the variant candidates were and analyzed the query word length and the frequencies on the s-gram. Some statistical measurements were helpful in analyzing. It was important to see the relevance of these statistics since irrelevant documents naturally aren’t very useful for the users. That is why they also evaluated the results based on the topic levels, which helped them to see how well topics and query keys perform.

Continue reading