Digitalia at summer school of document analysis & recognition (SSDA2018-TC10/TC11)

Digitalia project participated to the #SSDA2018 conference, where aim was to gather up researchers who work with document analysis of various materials, and this time the seminar consisted of roughly 40 people, from all over the world.

Summer school was started with Jean-Marc Ogier, who set the table with interesting examples of past and present in document analysis and how the change in media cause new research fields to appear. From digitized materials to born-digital, from physical signatures to online signatures and so forth, which enable also capturing new information. presented the International Association for Pattern Recognition (IAPR) working groups. There IAPR TC-10 is Technical Committee on Graphics Recognition  and IAPR TC-11 is about Reading Systems. Finland is part of the IAPR via the Pattern Recognition Society of Finland.

There was also an introduction round to the L3i laboratories, where various researchers gave us bit of time and presented their work via demos. We got a glimpse of equipment that was used in the lab, and various research done at the lab.

One demo was about segmentation and text capturing of comics , i.e. recognition of individual panels and then the text and also characters who is speaking in the comic. The material for which it was developed were various online comics and some digitized ones. It was patterns all the way, first detecting contours of the panel, then the characters, and also the speech balloons, which were also then processed in order to get the text from the balloons.

Second demo was about transcribing video, the system picked 1 frame from each second of the given input video, and it then transcribed both the content on the captured image and also the audio. Then with help of search engine it was possible to find certain text just via search.

One demo was done as an experiment to monitor the traffic in the harbor. The system detected the moving ships coming or leaving : it detected the boat from the moving image and thusly enabled creation of statistics, how many boats arrived and how many left. This worked in real-time, highlighting boat in the video feed, but also mapping them to the map of the harbor area, which could be very useful in a dashboard. Second live video feed system recognized faces, and aimed to capture sentiments, for example : surprised, content . There were 6 basic feelings, which the system gave ‘bars’ based on the probability.

Bart Lamiroy from University of Lorain talked about familiar concern “Honest, Reproduciple Reporting of Results is Difficult … an Analysis and some Good Practices” about how to do good research and how to spot it when you see it. There the familiar issues, which software, various modules and how fast things change, make it harder for the next researcher to reproduce the results – there are many tiny details, which the publication might not include and there also the tenet of “publish and perish” can create pressures for the researchers, which make them unable to use enough time to documentation, which could be invaluable for the next reader.

Andreas Fischer from University of Fribourg, Switzerland then took the participants to the structural methods of handwriting analysis. Their research team had worked e.g. with George Washington’s letters from Library of Congress. In the Lab the Diva Services http://divaservices.unifr.ch/data attempted to create a platform where a researcher could give data/methods for other researchers to use, thus easing the amount of software needed to be installed to own machine. Diva Services enable processing remotely via specified APIs, first researcher can upload documents (i.e. page images) to the service and then call API of a method and to get results via that api. Very nifty idea of a system and one more implementation for these analysis systems for researchers. Currently can be used for small experiments as the service is currently powered by one machine at the moment. Anyhow enough so that one participant could use its methods to distinct between different Japanese script characters, which in the end won her the excellence award of the summer school.

Poster session

Poster session was also eye-opening – so different kinds of problems where machine learning was used when possible. There were OCR issues, very fragile materials like palm leaves, script and languages for which there does not even exist unicode fonts. Also team sizes varied, some were alone in their university, but then in one lab there was 25 people doing research. In any case, hearing about research done elsewhere was interesting, and some things need to be followed later on, too.

 

Deep Neural Networks and data

Seiichi Uchida from Kyushu University (Japan) walked us in to the deep neural networks in document analysis and actively promoted the thinking to go beyond 100% accuracy. His slides can be viewed from below:

 

Vincent Poulain-d’Andecy from YOOZ (France) gave the industrial talk with multitude of examples of all kinds of document types and formats we actually live with. For example IAPR’s TC10 and TC11 has a number of datasets, which researchers can use as a baseline to check the capabilities of own algorithm, against given data set , thusly giving a joint baseline and benchmark which can help in evaluating how well a specific algorithm works.

David Doermann from University of Buffalo (USA) talked about the Document Forensics in document analysis, and about the current situation, where images, videos and audio can be created in such a way, that a real thing can be impossible to distinct from a fake. He defined the terminology, and gave various examples from different centuries how forgeries and tamperings have been tried, and on the other hand how they are and were being prevented. A comporting thought anyway was that :

“It only takes one piece of inconsistent evidence to prove something is not authentic”

 

He highlighted also how unique each persons handwriting actually is, as each writer has hard time to change ingrained habits, as he mentioned that letter design, spacing and proportions rarely change, and for untrue cases, there are properties which experts can follow.

Marçal Rusiñol from CVC, Universitat Autònoma de Barcelona,Spain told the details of Large-scale document indexing, going bit deeper how the indexing actually works. He had good points on how to make things work to the end-user, i.e. how to improve human computer interaction, and how to rank documents about semantic similarity for a query which goes bit beyond just the search word given by the user.

Dimosthenis Karatzas from CVC, Universitat Autònoma de Barcelona,Spain then talked about urban world and the scene text understanding – how to detect text from noisy and vibrant environment, which is important as he said that over 50% of urban imaginery contains text in some form.  Turned out that trick is in a way in a context, text has some typical features like alignment and characters, which make them distinct even if the observer would not necessarily know the language in question. Many machine-learning algorithms are suitable for the word and object spotting, but the desire was to get to higher level to the true information spotting. There the idea is that machine could answer questions like “what is the price of a 1 litre of milk” from any shop. For this they had generated simulated 3D traffic environment where the autonomous cars can be immersed to various weather, and traffic simulations to test it and now the idea was to create simulated “supermarket” where before mentioned information spotting could be trained upon.

Jean-Yves Ramel, Lifat, University of Tours (France)  talk was titled “Interactive approaches and techniques for Document Image Analysis” compared the computer vision and the document image analysis problems and solutions, which turned out to have quite a many similarities. He highlighted the service aspect of a tool or algorithm, it would be beneficial to let domain experts to fine-tune the algorithms and define parameters, which are meaningful in certain context. By introducing the user, which in fact is the other researcher, would enable them to do more, in a limited time, with a given dataset. In short, researcher should be part of the defining of services which are meant for them to ensure that the features and data is available as needed.

Jean-Christophe Burie from University of La Rochelle kept then the last talk and a workshop where the idea was to experiment with comics, and their segmentation to the speech bubbles. This was one of the demos of the tour of the L3i laboratory of the university so it was interesting to experiment with it even for a short while. Python, opencv, were in use for the image segmentation, and Tesseract was used for then extracting the text part from the speech balloons in the comics.

 

At DHN18 conference

Digitalia was well represented in the Digital Humanities i Norden conference, which was held during 7-9.3.2018 in Helsinki.

Results and work of Digitalia was presented for example, in a short paper called: “Creating and using ground truth OCR sample data for Finnish historical newspapers and journals” (paper and slides). This paper was one of the few,  which was given the label of distinguished short paper. Also the notification of the ground truth material got some interests in social media, so maybe we will have others who are interested to improve OCRring methods to experiment with the base data. We got also some ideas on how to improve the data packages, which we can probably improve down the line.

Ongoing work was also visible in the poster session via poster, which describe all of the ground breaking work targeted to make next steps with processing the materials easier. The poster was titled: “Research and development efforts on the digitized historical newspaper and journal collection of The National Library of Finland”. Based on the amount of people who were using newspapers or especially Finnish newspapers, here is opportunity to do good improvements, which benefit many researchers all the way.

All in all whole conference was full of interesting topics and researchers of multiple fields, which shows the vitality of the “digital humanism scene” in Finland, but also in whole of Nordics. There were also many papers of note that related to the National Library of Finland via newspapers or other listed data:

  • Semantic National Biography of Finland  (paper)
  • Digitised newspapers and the geography of the nineteenth-century “lingonberry rush” in Finland. (paper)
  • Sculpting Time: Temporality in the Language of Finnish Socialism, 1895–1917 (paper)
  • Two cases of meaning change in Finnish newspapers, 1820-1910 (paper)
  • Geocoding, Publishing, and Using Historical Places and Old Maps in Linked Data Applications (paper)
  • A long way? Introducing digitized historic newspapers in school, a case study from Finland , which has continued from earlier project. (paper)
  • Local Letters to Newspapers – Digital History Project (paper)

Potentially interesting also for further development were for example, paper about Sentimentatior full title “Sentimentator: Gamifying Fine-grained Sentiment Annotation” (paper), which enables easy creation of learning data of sentences with annotated sentiments with number of predefined sentiment categorization. The talk “The Nordic Tweet Stream: A dynamic real-time monitor corpus of big and rich language data” (paper) utilized social media data and was preparing online tool for access while still keeping in mind the generic open data needs. The recently started Wikidocumentaries project (paper) is a interesting case as it could act as a bridge between local history endeavours and citizen or local scientists.

All in all very thought-provoking conference and it was super exciting that it was in Helsinki, so it was easy to visit by Finnish DH people and naturally from the Nordic countries. Discussions during breaks were lively and gave more details to the background to various  papers and presentations there were.

 

1918-1929 Finnish newspapers readable for 2018

National Library of Finland has come to agreement with the copyright organization Kopiosto about opening the newspapers and journals, which National Library has digitized, for use for the duration of the year 2018. Read more from the announcement(in Finnish).

Bundle of lion cubs / Source: Joulukukka : lasten joululehti, 01.12.1924, p. 17 https://digi.kansalliskirjasto.fi/aikakausi/binding/1354292?page=17

A hint of the using of the search also with newer materials. If you seek for specific newspaper from specific date, utilize their own search fields:

Free text search is not optimum for a specific newspaper can lead to typos and errors in dates

Instead utilize the separate search fields if you want to target to certain specific newspaper and you even know the date. You can use one or several search options within one search. More options can be found behind the yellow ?-mark in the search page of digi.
Pick date and newspaper if want to target that specific one. Free text search and all of its options are useful if you have specific search idea, but want to see across newspapers (or journals) what is found from it.

So have fun, and share information about the opening to anyone who would like to read about the news, weather, sports, politics or any other topics that the newspapers might have covered back in the date. There might be surprises ahead as things have changed from back then.

+Edit. The news item of Kopiosto with statements from both parties is also now online .

Heldig 2017 Summit

Also Digitalia was presented at the Heldig Summit on 18.10.2017 , where there was 85 presentations and bit over 230 participants from universities and cultural heritage organizations.

“Developing the Digital World Together”

Heldig director, professor Eero Hyvönen opens the summit by explaining Heldig basis and the growth both in personnel and the facebook group. A Digital Humanism forum is also formed for the life-long learning for the researchers, where online learning materials and MOOCs is being developed and if there is short tutorials to be shared they would be interested.

Session 1 was about researhcer use. Professor Hannu Toivonen explains the data science and how the master science master’s programme is focused. There is also Helsinki center for data science, as a multi-displinary field, which is coordinated by the Kumpula (more info at Data Science Msc->”Contact”). University of Helsinki has lots of interesting research like web-scale surveillance of the news media (http://newswebs.cs.helsinki.fi ) or https://blogs.helsinki.fi/methodology , which is generating for eample R-environment for analysing text-based Suomi24 discussion forum data. Also everyone might have already about Citizen mindscapes, which was presented by telling about the unique interests of the researchers working in the collective. The future generation of digital humanists get onwards by learning, where one example is the introduction to open data science , which professor Hyvönen feels are important for DH and UH at large. Also the legal issues were taken presented by the LegalTechLab and  Growing Mind project, which has just got Academy of Finland funding (2018-2023) with regard to digitization of schools. FIN-CLARIN was there to describe its infrastructure for DH mentioning e.g. FIN-CLARIN corpora, 18 GW data in > 650 databases, which can be found via KieliPankki (both the corpora and the tools). The Bank of Finnish terminology in arts and science is also a unique research infrastructure, to be a continuous database for all research done in Finland.

The cultural heritage institutions role

After the 1st break National Library of Finland presented their services Finna (in a way as a metadata aggregator of several cultural heritage institutes), Finto.fi about ontologies and finally digi.kansalliskirjasto.fi with the extensive newspaper collection, which is being used e.g. in Digitalia and Comhis projects. National Archives talked about arkisto.fi/df (and mentioned about a seminar on 24.11.)

Kotus presented its materials via video, which is available online:

Then SKS presents the scholarly open access monographs at https://oa.finlit.fi , to offer open access research books as well as Codices Fennici https://www.codicesfennici.fi  . Elias Lönnrot letters online offers XML, and service as a whole aims for total shareability, so that anyone can use tools developed by the digital humanists, data can be downloaded by collection or as a whole.

On afternoon the rapid fire of presentations continued with CSC and Ministry of Finance presentation, where focus was in building up a common infrastructure, which also researchers could use. The senior researcher Toni Ryynänen from the Ruralia institute presented how the digitized newspapers from their full timeline can give insight on how specific discussions have evolved.

 

The Helsinki University library talk is about linked data, where demo was also available during the evening session. Https://opensubtitles.org was mentioned as a data source, which was used as http://opus.lingfil.uu.se  , to create parallel corpora of having same text available in multiple languages, which in a way also visualizes differences, but also points where different language translations agree. Depart of Modern Languages of UH also showed examples how they had utilized Tensorflow for analyzing datasets, ending with result that speech and DH are a perfect match.

Researcher world

One theme of the summit was also the concern how to make complex world easier for the researchers. Mietta Liennes presented http://chipster.csc.fi and the Mill https://www.kielipankki.fi/support/mylly , which take care of tool environment, and just with the CSC account it is possible to login and access to the resources, so that you can select data, and pick the tool, which can be run and after a while the results appear and can be viewed either in the user interface or downloaded to an excel file or shared to a colleague. Based on the brief run-through on the summit, definitely something a researchers should take a look!

Professor Timo Honkela was also present, talking about various fields, which digital humanities combine. His new book is now in printing and will be available soon, the hope is that the peace machine activities will lead to better lives onwards and even until  2117.

In digital Russia studies from the Aleksanteri Institute talked about Digi-Pravda and various ways digitalisation is visible. Daria Gritsenko also asked to anyone who is interested about Women in Tech, to check out their website at http://www.digitalicons.org (Studies in Russian, Eurasian and Central European New Media). Professor Parvinen told about the mixed reality user laboratory. Next topics talked about networks of different types, talking about Aalto work with family research or social networks – what do the networks tell and why they have formed in such a way. Digital cultural history, Warsampo were also part of the presentation pool.

Besides Digitalia , also Comhis project was mentioned as one of the research cases who use digitized newspapers as source material.

Final thoughts

All in all, it seems that the digital humanism in Finland has a good network of people and let’s hope that there are possibilities to collaborate later on, too , at least when some initial glimpse to each others work has been got. Unique materials, unique methods and also unique ideas,  that will help current research but also work as an start for the next wave of data scientists or so called digihumanists, for whom there are own training programs also forming.

All of the materials presented can be found from the Heldig pages, and even more can be found via publications of each research team. Let’s see maybe next year if the summit reappears, it could beat its record of 85 presentations and aim higher to the even 100, there seem to be so much happening all around.

 

Tallenna

Tallenna

Tallenna

Tallenna

Tallenna

Guest Post: Digitization of Historical Material and the Impact on Digital Humanities

The following post has been written by Julia Poutanen, who recently got a BA in French Philology, and is now continuing with Master’s studies in the same field. She wrote this essay as part of the Digital Humanities studies and was gracious enough to allow us to publish it here. Thank you!


Essay on the Digitization of Historical Material and the Impact on Digital Humanities

On this essay, I will discuss the things that one needs to take into consideration when historical materials are digitized for further use. The essay is based on the presentation given on the 8th  of December 2015 at the symposium Conceptual Change – Digital Humanities Case Studies and on the study made by Järvelin et al. (2015). First, I will give the basic information of the study. Then I will move on reflecting the results of the study and their impact. Finally I will try to consider the study when it comes to digital humanities, and the sort of impact it has on it.

The study

In order to understand what the study is about, I will give a little presentation of the background related to the retrieval of historical data, as also given in the study itself. The data used is OCRed, which means that the data has been transformed into a digital form via Optical Character Recognition. In this situation, the OCR is very prone to errors with historical texts for reasons like paper quality, typefaces, and print. Other problem comes from the language itself: the language is different from the one used nowadays because of the lack of standardization. This is why geographical variants were normal, and in general, there was no consensus on how to properly spell all kinds of words, which can be seen both in the variation of newspapers and in the variation of language use. In a certain way, historical document retrieval is thus a “cross-language information retrieval” problem. To get a better picture of the methods in this study, s-gram matching and frequency case generation were used with topic-level analysing. S-gram matching is based on the fact that same concept is represented by different morphological variants and that similar words have similar meanings. These variants can then be used in order to query for words. Frequent case generation on the other hand is used to decide which morphological forms are the most significant ones for the query words. This is done by studying the frequency distributions of case forms. In addition, the newspapers were categorized by topics.

The research questions are basically related more to computer science than to
humanities. The main purpose is to see if using methods with string matching is good in order to retrieve material, which in this case is old newspapers, with a very inflectional language like Finnish. More closely looked, they try to measure how well the s-gram matching captures the actual variation in the data. They also try to see which is better when it comes to retrieval: better coverage of all the word forms or high precision? Topics were also used to analyze if query expansion would lead to a better performance. All in all, the research questions are only related to computer science, but the humanities are “under” these questions. How does Finnish work with retrieving material? How has the language changed? How is the morphological evolution visible in the data, and how does this affect the retrieval? All these things have to be taken into consideration in order to answer to the “real” research questions.

The data used in this research is a large corpus of historical newspapers in Finnish
and it belongs to the archives of the National Library of Finland. Although in this specific research, they only used a subset of the archive, which includes digitized newspapers from 1829 to 1890 with 180 468 documents. The quality of the data varies a lot, which makes pretty much everything difficult. People can access the same corpus from the web in three ways – the National Library web pages, the corpus service Korp, and the data can be downloaded as a whole –, but not all of it is available to everyone. On the National library webpage, one can do queries with words from the data, but the results are page images. On Korp, the given words are seen in KWIC-view. The downloadable version of the data includes, for instance, metadata, and can be hard to handle on one’s own, but for people who know what they’re doing, it can be a really good source of research material. Since the word forms vary from the contemporary ones, it is difficult to form a solid list of query words. In order to handle this, certain measurements have to be done. The methods used are pretty much divided into three steps. First, they used string matching in order to receive lists of words that are close to their contemporary counterparts in newspaper collections from the 1800s. This was done quite easily since in 19th century Finnish, the variants, whether they are inflectional, historical or OCR related, follow patterns with same substitutions, insertions, deletions, and their combinations. Also, left padding was used to emphasize the stability of the beginnings of the words. This way, it becomes really clear that Finnish is a language of suffixes which mainly uses inflected forms as a way to express linguistic relations, leaving the endings of the words determining the meaning – and thus, in this case, the similarity of the strings. But to handle the noisiness of the data – as they call it –, they used a combination of frequent case generation and s-grams to cover the inflectional variation and to reduce the noisiness. This creates overlapping in s-gram variants and that is why it was chosen to add each variant only once into the query, which enhances the query performance, although the number of variants is lower. Second, they categorized these s-gram variants manually according to how far away they were from the actual words. Then, they categorized how relevant the variant candidates were and analyzed the query word length and the frequencies on the s-gram. Some statistical measurements were helpful in analyzing. It was important to see the relevance of these statistics since irrelevant documents naturally aren’t very useful for the users. That is why they also evaluated the results based on the topic levels, which helped them to see how well topics and query keys perform.

Continue reading