Brief technical overview of Revizor, the editor for correcting OCR text material

OCR’ed text is often lacking in quality because of errors during the optical recognition process, especially when the source material is old or otherwise in a bad state. These errors make it hard to rely on the text for building a corpus or word lists and makes the source material less accessible to use for study or to incorporate into other tooling for language researchers. This is a problem that our OCR editor tries to eradicate, or at least contribute a possible solution towards.

This post gives a brief but slightly more detailed overview of the technical architecture of Revizor – the OCR editor used in the Digitization Project of Kindred Languages, provides an update on its current status and shares some challenges which we could use some help with in the future.

architecture

The editor takes in ALTO XML files provided by OCR software, as well as high quality pictures of the source material as an aid for the user during the manual correction process. The Python back-end of the software serves up the data to anybody authorised to make edits. It is also responsible for exporting the end result, which might be the corrected ALTO XML, plain text versions or word lists built from selected works. The exported product might be further processed by linguistic tools or imported into “corpus” sites specifically meant for facilitating searching and dissemination.

The front-end of the editor is the part where manual changes are made by real human beings, as opposed to the automatic bulk processing in the back-end. By means of a two-pane window, the user can check the original work and make corrections to the text, as well as mark the language or relevance of words – this will aid the back-end in building accurate word lists per language. Especially for languages with a small corpus,this process is very important as researchers and OCR software might not have any word lists or dictionaries to work from, so the accuracy of the corrected text will directly influence also the quality of future digitalisation by means of this feed-back loop.

Finally, as the time and funds alotted dwindle and the project draws to a close, we realise that although the editor works well, the [code] did perhaps not receive all features and polish that would be nice to have,and we hope that by providing the source of Revizor online other interested parties can help improve the editor and adapt it to their own specific needs. Apart from general code and layout polish, features that are still being implemented or might not make the deadline are automatic correction of common OCR mistakes, as well as stemming and lemmatisation– or at least smarter ways of narrowing down a large number of conjugated words to a more concise word list without necessarily knowing the intricate details of the source document’s language. Most of these improvements would likely require the help of computational linguists and detailed knowledge of language-specific tools and stemmers, and ideas of how to plug these tools efficiently into the editor’s correction and export process.

Here’s a link to the National Library of Finland’s Github page, where the code of Revizor (OCRUI) and many other projects is or will be made available.

Wouter van Hemel, Information System Specialist

1,555 thoughts on “Brief technical overview of Revizor, the editor for correcting OCR text material

  1. I gave buy cbd balm a prove for the treatment of the first habits, and I’m amazed! They tasted distinguished and provided a be under the impression that of calmness and relaxation. My stress melted away, and I slept outstrip too. These gummies are a game-changer on the side of me, and I greatly endorse them to anyone seeking spontaneous stress relief and better sleep.

  2. I gave https://www.cornbreadhemp.com/products/cbd-balm a whack at for the treatment of the maiden previously, and I’m amazed! They tasted excessive and provided a sanity of calmness and relaxation. My importance melted away, and I slept better too. These gummies are a game-changer an eye to me, and I highly commend them to anyone seeking appropriate worry alleviation and improved sleep.

  3. CBD, or cannabidiol, has been a feign changer for me. delta 9 thc edibles I’ve struggled with apprehension in return years and secure tried many opposite medications, but nothing has worked as properly as CBD. It helps me to be undisturbed and relaxed without any side effects. I also bring to light that it helps with catch and trial management. I’ve tried some brands, but I’ve build that the ones that are lab tested and have a careful position are the most effective. Comprehensive, I extraordinarily vouch for CBD on the side of anyone who struggles with uneasiness, be in the arms of morpheus issues, or chronic pain.

  4. From there you click on the bookie and as a result, you will be taken to that bookmaker. Your betslip will also be transferred over to them to make betting hassle-free! Our horse racing tipsters publish their Kenilworth best bets the night before each meeting. From there you click on the bookie and as a result, you will be taken to that bookmaker. Your betslip will also be transferred over to them to make betting hassle-free! Check out the Kenilworth Form Guide section 2-3 days before raceday for our best tips. Sportingbet 100% up to R1000 + 20 Spins Winning Form Best Bets Kenilworth has an impressive reputation even before the horses make their way onto the track. The course is well known for being an important conservation area, having been established in 1985. The nature reserve is home to a lovely array of vegetation, some of which can only be seen in this region alone, truly making this one of the most unique racecourses in the world.
    https://cashtxwb851995.blogofoto.com/51184034/betting-apps-legal-in-india
    via Sports Logos.net If you’re looking for the most accurate NBA predictions tonight or for any NBA game, Dimers is the site for you. Our expert analysis, data-driven insights, and free NBA picks, tips, and parlays make us the go-to destination for NBA betting. Start making informed betting decisions and become an expert with Dimers today. Put your picks to the test and bet on Nuggets vs. Heat player props with BetMGM Sportsbook. Receive free daily analysis Need help evaluating prop bet wagers? The Action Labs Prop Tool will help you organize, sort, and grade hundreds of prop bet odds throughout the NBA season. Be sure to check in daily to keep up with the NBA action throughout the season. If you are also interested in betting on other leagues, visit our NFL Odds, NFL Picks and College Basketball Odds sites now. For all other betting odds please check out our Betting Odds page.

  5. I am curious to find out what blog platform you’re using? I’m having some small security problems with my latest website and I would like to find something more risk-free. Do you have any recommendations?

  6. Thanks for some other informative blog. Where else may I am getting that type of information written in such an ideal way? I have a undertaking that I am just now operating on, and I have been at the look out for such info.

  7. you’re really a good webmaster. The web site loading speed is incredible. It seems that you’re doing any unique trick. Also, The contents are masterpiece. you’ve done a excellent job on this topic!

  8. One more thing. I think that there are numerous travel insurance web sites of reputable companies that permit you to enter your trip details and obtain you the prices. You can also purchase the particular international holiday insurance policy on the net by using your own credit card. All you have to do is usually to enter your travel particulars and you can understand the plans side-by-side. Just find the system that suits your budget and needs after which use your credit card to buy them. Travel insurance online is a good way to start looking for a respected company for international travel insurance. Thanks for sharing your ideas.