More Data, Improved Quality

As of March 2016, we have started to release the revised versions of monograph items in Fenno-Ugrica collection. Based on vocabulary that gathered in the proof-reading campaign during summer 2015, we have enhanced our OCRing (optical text recognition) procedures and managed to improve the quality of monographs.

KuvaIn addition to the revised PDFs, we are realising the dataset of the digitized monographs item. On the side of the PDF copy, you will and a ZIP package that contains all TIFF, XML, TXT and CSV of each individual items. The TXT and CSV files are quite reliable when it comes to the word accuracy, but unfortunately the XML files aren’t equally good. This is due to dual procedure at our end, as the XML files are created with the help of another programme than the others and its vocabularies cannot be customized to meet the standards. This problem does occure especially with the texts which are originally written with Latin alphabets in Khanty, Mansi and Komi. The Cyrillic versions, however, are somewhat good. If you are unhappy with the quality of the data, you may find the TIFF files as useful for your work.

At the moment, the enhanced PDFs and data packages are available in Khanty, Mansi, Nenets and Selkup. The improved PDFs and datasets for the other languagues will be made available in April 2016.

Yours &c.,
Jussi-Pekka

Leave a Reply

Your email address will not be published. Required fields are marked *