Brief technical overview of Revizor, the editor for correcting OCR text material

OCR’ed text is often lacking in quality because of errors during the optical recognition process, especially when the source material is old or otherwise in a bad state. These errors make it hard to rely on the text for building a corpus or word lists and makes the source material less accessible to use for study or to incorporate into other tooling for language researchers. This is a problem that our OCR editor tries to eradicate, or at least contribute a possible solution towards.

Continue reading

Erzya Language Day, April 16th

Today we are celebrating the Erzya language day. Erzya is one of the Uralic languages and it is spoken in the Republic of Mordovia in Russia.

During the Digitization Project of Kindred Languages, we have paid a special attention to the materials published in Mordvinic languages, Erzya, Moksha, Shoksha. Erzya was converted into a medium of popular education, enlightenment and dissemination of information pertinent to the developing political agenda of the Soviet state.

Continue reading