This time, the Digitization Project of Kindred Languages goes west and is attending the DATeCH 2014 conference and Digitisation Days in Madrid, 19-20 May. The Digitisation Days event is organised by IMPACT centre of competence and Succeed, both funded by European Union. Our team is attending this event in order to learn what is the current state of branch and to discuss the various aspects related to the text enhancement of digitized material. See the full programme here. You may track the comments made by the conference attendees in Twitter with the help of hashtag #digidays. This is our recap on the presentations, which took place on day 1.
Generally speaking, my thoughts after the day 1 are more than positive. When I am reflecting the presentations of the day with the work done in our project, especially when it comes to the OCR editor, I can proudly say that we are on the right tracks. Project-wise, it seems than we definately have solved some serious issues, which are hauting some others, but it is good to see that we do still have a plenty to learn about from the kinship-projects what were staged today. It is a pity that we hadn’t chance to contribute to discussion by presenting a paper, but there are always second chances to do so.
The day took off with brief addresses by the hosting institutions and was directly followed by the Session 1: Document Analysis & OCR.
The first paper was given by David Hébert of Uni of Rouen/Litis He showcased the non-OCR-based solution for the article extraction of old digital newspapers. The recognition is through the pixels of the digitized items with the help of Conditional Random Field model. Basically that means that the algorythm is able to move entities into a correct, logical place and show the result as logical result.
The second speaker, Christian Clausner (PRImA), spoke on precise region description for document layout detection. According to Clausner, there’s a need for Precise Region Descriptions and I don’t disagree ad state-of-art OCR systems, like ABBYY and Tesseract cannot tackle the problem of overlapping entities, like pictures and columns. Clausner &co had a created a solution for this and it is called as polygonal fitting: first you create bitmasks, then you fill the wanted gaps, or exclude the unwanted regions the ares and then you trace the contour and finally you create a polygon. Pity that the web site of the project is down for maintenance and additional information could be retrieved and linked here.
Sajid Saleem was the hard core text recognition man of the day. He had held a paper on the Recognition of Degraded Ancient Characters Based on Dense SIFT. For an old slavist like me, who hated to study glagolitic alphabets upon his studies, this was a breathtaking moment. Not only because the Missale Sinaiticum was brought into daylight character by character, but also because the SIFT based “character trackback” is impressive example what can be done with really noisy data. They basically need to create the characters pixel by pixel. During the first coffe break, I had a chance to ask Sajid, whether they use any morphological analysis upon the character (text) re-build, but he indicated that it isn’t possible with this material, since the characters too so noisy with the background.
Session 2 after the coffee break was started by Klaus Schulz, who held a presention “Automated Assignment of Topics to OCRed Historical Texts” CIS – Centre for Information and Languages Processing at the University of Munich. They have taken a challence to compute automatically all topics and fields that adequately describe contents of given document, and add the hierarchical order of topics of the content. TopicZoom is brand name and additional information can be found here. This presentation reminded me that we do have a similar case with our material and clients. Could the researcher of kindred languages benefit from this tool when searcing the semantics of Mordvinic newspaper in the 1920s, or could the Kirjallisuuspankki (The Literature Bank of Finnish Literature) project get something out of this? I am convinced that this method could gain wider audiences among the literature studies in particular, despite the fact that according to Schulz the primary use is the to create better metadata on digitized content. Pity that this is available for the German and English material only. At the moment.
Petar Mitankin’s paper was followed by Schulz. “An approach to unsupervised historical text normalisation” was the title and he presented the Statistical Machine Translation system that enables the translation from a historical form to modern language – something similar is required also in our own project, in which we do have a need to build a search possibilities that enables a search within a context in old orthography as well. For us, quite typical examples are Russian characters of old orthography, but the similar cases are found in other Kindred languages. Could the normalisation do the translitteration as well and convert the text from Cyrillic to Latin letters according to the transliteration scheme ISO-9? Could be beneficial with library data at least.
Arianna Ciula and Øyvind Eide made the first duetto at DATech2014. Their presentation grasped the are of ontologies, TEI and CRM in particular. Once again, I started to think about the Finnish parallels for this field of problematics and of course it was found in the National Digital Library project that has improved the availibilty and usability of the electronic materials of GLAM sector.
Session 3 focused on the post-correction and for me this session was the highlight of the day. The Australian Evershed presented their logic for better OCR correction of newspapers. Their concise web page says it all, but in brief OverProof also uses a visual representation model to evaluate the likelihood of various OCR errors.
Günther Mühlberger presented the paper on new approaches to the correction of noisy OCR. Their approach was to let the users to correct the misspelled content. Their interface offers two main features for the completion of this task: for improving precision it provides a view of the word snippets of a specific search string and the possibility of validating each word snippet with a simple yes/no decision, as it is stated in the abstract of his paper.
Christoph Ringlstetter presented PoToCo – An open source system for efficient interactive postcorrection was presented here and I did find several parallels to their tools from our OCR editor. Most of all, I was impressed with the feature that corrects automatically the repeating mistakes in sets. Language technology used in the background takes orthographic variation in historical language into account. Using this knowledge, the tool visualizes possible OCR errors and series of similar possible OCR errors in a given input document. Error series can be corrected in one shot. – this is a typical case with our material as well – no-one has interest to correct a word ‘dog’ for the 12345th time. We gotta implement this one into ours too, so Christoph, don’t forget to drop me line by e-mail.
The last session of a day was linked with intellectual property rights (basically copyrights) of digitised content. The major contribution of the speakers was related to European directive on orphan works. See the EIFL guide on orphan works here.
Hey, that’s it! I will call it for the day and get some sleep in order to be keen, well-rested and fresh tomorrow in order to tackle the Day 2 of DATeCH 2014 and Digitisation Days. In this text, I have referred from time to time in the proceedings of the conference, which can be found here.