Stage One Completed, Second Round to Begin

The continuation phase of the Digitization Project of Kindred Languages (2014-2015) took off in January 2014. Since then, we have conducted the copyright clearance for all material that will be digitized and published by the end of this year. Also, naturally, we have signed the needed agreements with the National Library of Russia on digitization of the material. Not to mention, there’s a great deal of work done behind the scenes: the development of our OCR editor has taken a step forward, and a plenty of time has been spend on post-production of material here in Helsinki. By the mid-July, we have published exactly new 400 monographs in Khanty, Mansi, Hill and Meadow Mari, Nenets, Selkup, Komi-Permyak, Komi-Zyrian and Udmurt in our Fenno-Ugrica collection. We more than are glad that we have passed the first round of release now.

The next stage is looming around the corner and more action will come: In the beginning of August we will start releasing the biggest set of material so far. This bunch of material is colloquially called as Parallel Titles, which consist of more than 128 books in various different translations in Uralic languages. All in all, this set has more than 620 items and 50 000 pages relating to 25 different subjects, varying from the natural sciences to politics and from technology to children’s literature. The most of the material have originally been published in the 1920s and the 1930s and they are offering a great view on the history of Kindred Peoples, education, society, linguistic development etc. The full reference list Here. The majority of digitizing work in Saint Peterburg has already been completed, but the post-production and cataloging tasks will take some time – we do anticipate that the first of Parallel titles will be available in Fenno-Ugrica in early-August and work will be completed by the end of September.

Some new steps in our services will also be taken. In addition to the release of searchable PDFs, we have started to offer the XML files of each item per page for those who are interested. See the example here. Also, the post-correction of material has begun: the Ingrian material is first at the stake and we do use crowdsourcing, or perhaps even nichesourcing, to enhance the OCR’d text for the benefit of the linguistic research. For more details on our approach in crowdsourcing, see my presentation on our crowdsourcing/nichesourcing methods here or read my article from the latest Kansalliskirjasto magazine in Finnish here (pp. 38-40)

If you have any questions related to our project, materials that are published or will be published, don’t hesitate to contact us by e-mail. Write to kk-fennougrica@helsinki.fi for consultation. Watch also out our VKontakte site for news flashes in Russian and project manager’s Twitter account for news in English.

Wishing you all a pleasant and sunny summer,

Jussi-Pekka

 

 

2 thoughts on “Stage One Completed, Second Round to Begin

  1. Many books is great. But why so terrible quality?
    For example, I have digitized this book 3 years ago: http://fennougrica.kansalliskirjasto.fi/handle/10024/67159
    My version can be found here: http://libgen.org/book/index.php?md5=42195827209e36836a25dc2f24d26d82
    And this Nenets dictionary on your site is quite readable because it uses big font and doesn’t have diacritic marks.
    But other books? Books using diacritics are completely unreadable.
    You should change the compression algorithm and somehow improve raw scans. I use for this free software called ScanKromsator. I understand this will take much more time. But, if this technically impossible, why don’t you upload raw scans in perfect quality together with lighter PDF version, for future thorough processing. This technique is used on https://archive.org/details/texts where large size jpeg best quality versions of books are also avaliable.
    Thanks for your attention.

    • Thanks to your question and greetings to Murmansk,

      To some extent, I do agree with your criticism. Some of the items could have a better OCR text available, but in general we don’t really consider our quality as terrible. Let me explain the process.

      The items have been scanned into TIFF format as greyscale in order to provide better quality upon OCRing. The OCR’d texts are created with the help of state-of-art software, ie. ABBYY Fine Reader, based on its inbuilt-dictionaries or custom-made regular expressions. If we rely on the inbuilt-dictionaries, there will obviously be quite a few errors in the text, especially with the languages that were written in old orthography etc. In some cases, the regex can provide a better quality, but many mistakes will remain. We are well aware of it and we could easily improve the OCR quality in ABBYY Fine Reader by correcting the text, word by word, chapter by chapter, page by page etc. However, that would be too time-consuming for us to do.

      Instead of spending too much time with diacritics etc. at this stage, we focus on enhancing the text in another way. With the help of our OCR editor, the volunteers etc. can edit the text in order to provide better word quality for the linguistic research. For the OCR editor, see the blog entry of my colleague here: http://blogs.helsinki.fi/fennougrica/2014/02/21/ocr-text-editor/ and consult yourself with my recent presentation on the project and text editing (crowdsourcing) here: http://www.doria.fi/handle/10024/99373

      Yours &c.,
      Jussi-Pekka Hakkarainen
      Project Manager

Leave a Reply

Your email address will not be published. Required fields are marked *