As of March 2016, we have started to release the revised versions of monograph items in Fenno-Ugrica collection. Based on vocabulary that gathered in the proof-reading campaign during summer 2015, we have enhanced our OCRing (optical text recognition) procedures and managed to improve the quality of monographs.
In addition to the revised PDFs, we are realising the dataset of the digitized monographs item. On the side of the PDF copy, you will and a ZIP package that contains all TIFF, XML, TXT and CSV of each individual items. The TXT and CSV files are quite reliable when it comes to the word accuracy, but unfortunately the XML files aren’t equally good. This is due to dual procedure at our end, as the XML files are created with the help of another programme than the others and its vocabularies cannot be customized to meet the standards. This problem does occure especially with the texts which are originally written with Latin alphabets in Khanty, Mansi and Komi. The Cyrillic versions, however, are somewhat good. If you are unhappy with the quality of the data, you may find the TIFF files as useful for your work.