Anis Moubarik, an information system specialist at the National Library and a member of DPKL team, will introduce you to that procedure what happens to a digitized book in our post-production processes. During the project, Anis has been in charge of creating both, OCR’ed PDFs that are available in our Fenno-Ugrica collection and Alto XML files per book, which are made available for editing in Revizor, the text editor for enhancing the data.
Our materials have mainly come from the National Library of Russia, located in St. Petersburg. Our Russian colleagues have done the work of digitizing Finno-Ugric material to a lossless TIFF format. Getting terabytes of data reliably from St. Petersburg to Helsinki is not always a simple task, we fortunately had the opportunity to use a SSH tunnel to the Russian servers and I used tools, such as rsync, to make our life easier and get the materials reliably and fast to us.
Once we get the images, we use ABBYY recognition software (Recognition Server 3.5 and FineReader) to produce searchable PDFs and ALTO XMLs. Usually, I do the work language by language, as mass work, doing batches from every work so OCRing would be as quick as possible. One of the problems with the material itself is changing and evolving orthographies in languages. Languages, like Nenets, have had their orthography and even their alphabet changed (in the Nenets case from Latin to Cyrillic), and we have material in both, which raises big challenges for OCRing. Our OCR software has wordlists for many of the minority languages, but unfortunately they won’t take in to account the changing alphabets/orthographies.
There is little we can do about it, other than provide some self-generated wordlists to our Recognition Server and making our own custom languages in FineReader, hopefully by correcting the OCR mistakes in our editor we could contribute and make these OCRing software better for these languages.
Some languages, like Komi, are not supported at all, so my approach is to recognize them with FineReader. It lets us define our own language and produce PDFs. What frustrates with FineReader, is its slowness, since recognizing is done document by document (picture/page) basis, and it doesn’t support doing ALTO XML files at all. In order to make the data suitable for Revizor, we have to either recognize with some similar language in Recognition Server, and sacrifice quality, or skip doing XMLs altogether.
After recognizing and importing to PDFs they are delivered to a librarian, who catalogs and uploads them to Fenno-Ugrica repository for you to view and use. The obvious problem with these files is that they contain errors and that is the reason for Revizor. As was mentioned in our previous blog entry, Revizor reads ALTO XML files so I upload XMLs and images to our server (again using a SSH tunnel), convert the images to JPG and import them to their respective collections. Revizor helps us correct the errors from the automated OCRing. The corrected text can then be downloaded for reading, research, or as is in our goals, importing wordlists and linking to corpus search interfaces, such as Korp.
Information System Specialist