DATeCH2014 & Digitisation Days, 19-20 May, Madrid: Day 1

This time, the Digitization Project of Kindred Languages goes west and is attending the DATeCH 2014 conference and Digitisation Days in Madrid, 19-20 May. The Digitisation Days event is organised by IMPACT centre of competence and Succeed, both funded by European Union. Our team is attending this event in order to learn what is the current state of branch and to discuss the various aspects related to the text enhancement of digitized material. See the full programme here. You may track the comments made by the conference attendees in Twitter with the help of hashtag #digidays. This is our recap on the presentations, which took place on day 1.

Generally speaking, my thoughts after the day 1 are more than positive. When I am reflecting the presentations of the day with the work done in our project, especially when it comes to the OCR editor, I can proudly say that we are on the right tracks. Project-wise, it seems than we definately have solved some serious issues, which are hauting some others, but it is good to see that we do still have a plenty to learn about from the kinship-projects what were staged today. It is a pity that we hadn’t chance to contribute to discussion by presenting a paper, but there are always second chances to do so.

The day took off with brief addresses by the hosting institutions and was directly followed by the Session 1: Document Analysis & OCR.

The first paper was given by David Hébert of Uni of Rouen/Litis He showcased the non-OCR-based solution for the article extraction of old digital newspapers. The recognition is through the pixels of the digitized items with the help of Conditional Random Field model. Basically that means that the algorythm is able to move entities into a correct, logical place and show the result as logical result.

The second speaker, Christian Clausner (PRImA), spoke on precise region description for document layout detection. According to Clausner, there’s a need for Precise Region Descriptions and I don’t disagree ad state-of-art OCR systems, like ABBYY and Tesseract cannot tackle the problem of overlapping entities, like pictures and columns. Clausner &co had a created a solution for this and it is called as polygonal fitting: first you create bitmasks, then you fill the wanted gaps, or exclude the unwanted regions the ares and then you trace the contour and finally you create a polygon. Pity that the web site of the project is down for maintenance and additional information could be retrieved and linked here.

Sajid Saleem was the hard core text recognition man of the day. He had held a paper on the Recognition of Degraded Ancient Characters Based on Dense SIFT. For an old slavist like me, who hated to study glagolitic alphabets upon his studies, this was a breathtaking moment. Not only because the Missale Sinaiticum was brought into daylight character by character, but also because the SIFT based “character trackback” is impressive example what can be done with really noisy data. They basically need to create the characters pixel by pixel. During the first coffe break, I had a chance to ask Sajid, whether they use any morphological analysis upon the character (text) re-build, but he indicated that it isn’t possible with this material, since the characters too so noisy with the background.

Session 2 after the coffee break was started by Klaus Schulz, who held a presention “Automated Assignment of Topics to OCRed Historical Texts” CIS – Centre for Information and Languages Processing at the University of Munich. They have taken a challence to compute automatically all topics and fields that adequately describe contents of given document, and add the hierarchical order of topics of the content. TopicZoom is brand name and additional information can be found here. This presentation reminded me that we do have a similar case with our material and clients. Could the researcher of kindred languages benefit from this tool when searcing the semantics of Mordvinic newspaper in the 1920s, or could the Kirjallisuuspankki (The Literature Bank of Finnish Literature) project get something out of this? I am convinced that this method could gain wider audiences among the literature studies in particular, despite the fact that according to Schulz the primary use is the to create better metadata on digitized content. Pity that this is available for the German and English material only. At the moment.

Petar Mitankin’s paper was followed by Schulz. “An approach to unsupervised historical text normalisation” was the title and he presented the Statistical Machine Translation system that enables the translation from a historical form to modern language – something similar is required also in our own project, in which we do have a need to build a search possibilities that enables a search within a context in old orthography as well. For us, quite typical examples are Russian characters of old orthography, but the similar cases are found in other Kindred languages. Could the normalisation do the translitteration as well and convert the text from Cyrillic to Latin letters according to the transliteration scheme ISO-9? Could be beneficial with library data at least.

Arianna Ciula and Øyvind Eide made the first duetto at DATech2014. Their presentation grasped the are of ontologies, TEI and CRM in particular. Once again, I started to think about the Finnish parallels for this field of problematics and of course it was found in the National Digital Library project that has improved the availibilty and usability of the electronic materials of GLAM sector.

Session 3 focused on the post-correction and for me this session was the highlight of the day. The Australian Evershed presented their logic for better OCR correction of newspapers. Their concise web page says it all, but in brief OverProof also uses a visual representation model to evaluate the likelihood of various OCR errors.

Günther Mühlberger presented the paper on new approaches to the correction of noisy OCR. Their approach was to let the users to correct the misspelled content. Their interface offers two main features for the completion of this task: for improving precision it provides a view of the word snippets of a specific search string and the possibility of validating each word snippet with a simple yes/no decision, as it is stated in the abstract of his paper.

Christoph Ringlstetter presented PoToCo – An open source system for efficient interactive postcorrection was presented here and I did find several parallels to their tools from our OCR editor. Most of all, I was impressed with the feature that corrects automatically the repeating mistakes in sets. Language technology used in the background takes orthographic variation in historical language into account. Using this knowledge, the tool visualizes possible OCR errors and series of similar possible OCR errors in a given input document. Error series can be corrected in one shot. – this is a typical case with our material as well – no-one has interest to correct a word ‘dog’ for the 12345th time. We gotta implement this one into ours too, so Christoph, don’t forget to drop me line by e-mail.

The last session of a day was linked with intellectual property rights (basically copyrights) of digitised content. The major contribution of the speakers was related to European directive on orphan works. See the EIFL guide on orphan works here.

***

Hey, that’s it! I will call it for the day and get some sleep in order to be keen, well-rested and fresh tomorrow in order to tackle the Day 2 of DATeCH 2014 and Digitisation Days. In this text, I have referred from time to time in the proceedings of the conference, which can be found here.

Good night,

Yours &c.
Jussi-Pekka

16 thoughts on “DATeCH2014 & Digitisation Days, 19-20 May, Madrid: Day 1”

카지노사이트 on 24.10.2022 at 21.42 said:

BetMGM WV is an on the web caasino that delivers real cash gaming to
players from Wesst Virginia.

Also visit my webpage: 카지노사이트

Reply ↓
xewe on 3.11.2023 at 17.20 said:

As I website owner I believe the content material here is really good appreciate it for your efforts.

Reply ↓
Otoe County Country - KNCY on 4.11.2023 at 12.39 said:

Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

Reply ↓
news max live on 7.11.2023 at 13.28 said:

I truly appreciate your technique of writing a blog. I added it to my bookmark site list and will

Reply ↓
news in farsi on 9.11.2023 at 18.24 said:

Nice post. I learn something totally new and challenging on websites Watch news in farsi

Reply ↓
Subhavaartha TV on 10.11.2023 at 20.47 said:

I just like the helpful information you provide in your articles

Reply ↓
watch horse racing online on 15.11.2023 at 2.14 said:

So great to find someone with some original thoughts on this topic. Really..

Reply ↓
Will it ever be possible for time travel to occur? on 15.11.2023 at 2.35 said:

Thank you for starting this up. This website is something that is needed on the internet someone with a little originality!

Reply ↓
How to Listen to SiriusXM Radio Online on 26.11.2023 at 15.05 said:

I really like reading through a post that can make men and women think. Also thank you for allowing me to comment!

Reply ↓
Newsmax TV Live on 26.11.2023 at 17.16 said:

This is really interesting You re a very skilled blogger. I ve joined your feed and look forward to seeking more of your magnificent post.

Reply ↓
Live TV on 3.2.2024 at 8.24 said:

There is some nice and utilitarian information on this site.Live TV

Reply ↓
al manar tv live stream on 9.2.2024 at 18.38 said:

Thank you very much for this wonderful information.-programm vox

Reply ↓
hot deals on 23.2.2024 at 1.08 said:

lso thank you for allowing me to comment!.Himalayan Pet Supply Jughead Slim Cheese Chew Insert 100% Natural Long Lasting Gluten Free Healthy

Reply ↓
hey dude shoes for women on 10.3.2024 at 2.29 said:

Very nice blog post. definitely love this site. – womens hey dude shoes

Reply ↓
khasiss.com on 18.3.2024 at 8.32 said:

jelenakaludjerovic.com
만약을 대비해 장란은 개인 경호원들에게 진영 문 밖에 서 있으라고 명령했습니다.

Reply ↓
leather slip on shoes womens on 14.4.2024 at 6.58 said:

A big thank you for your blog.Really looking forward to read more.

Reply ↓

Fenno-Ugrica

The Blog of the Minority Languages Project – National Library of Finland

DATeCH2014 & Digitisation Days, 19-20 May, Madrid: Day 1

16 thoughts on “DATeCH2014 & Digitisation Days, 19-20 May, Madrid: Day 1”

Leave a Reply Cancel reply