Subject: Adobe Connect - Chat Transcript from Finna-kokoushuone (OCR Webinar) Tekniikka: (3/5/2014 12:46) Good afternoon and welcome to the webinar!' Tekniikka: (12:50) If you encounter any technical difficulties, please execute the audio setup wizard via the meeting menu (and install the Adobe Connect Add-in) first. If the difficulties persist, please inform me about it in this chat. Tekniikka: (12:50) Let's test the voice soon as well! mpiwg_kthoden: (12:50) nice to be here, hyvää päivää Tekniikka: (12:52) If there are any questions or comments, i'll write them down from this chat window and read them aloud here (after the presentations). mpiwg_kthoden: (12:56) should I hear voice already? Tech. hosting: (12:56) Not quite yet, just a moment.. mpiwg_kthoden: (12:56) ok Esa-Pekka Keskitalo: (12:56) loud and clear mpiwg_kthoden: (12:56) very good Tech. hosting: (12:58) Thanks! The mics are quiet for a while until we begin! Michael Rießler: (13:00) is there a problem with the audio? Michael Rießler: (13:00) now it works Jeremy Bradley: (13:00) OK, audio here now Tech. hosting: (13:01) Yep, the mics were still off. Anis Moubarik: (13:01) Good afternoon, I'm Anis, one of the developers of the ocrui editor. I'll answer questions here in the chat room. Julia: (13:12) Is the conference beeing recorded? Can we get the record after the event? Tech. hosting: (13:12) Yes, we'll publish the link afterwards! Jeremy Bradley: (13:19) Not sure how to do this! Tech. hosting: (13:19) You can write your questions directly in this chat Tech. hosting: (13:19) And we'll read them aloud here. Jeremy Bradley: (13:19) is there a way to use my microphone? Tech. hosting: (13:20) Sorry, that's not possible. Jeremy Bradley: (13:20) OK: Do you know if there are any plans to digitize the dialect collections from the turn of the 19th/20th century? Jeremy Bradley: (13:22) Thanks! andrewboltachev: (13:24) I have the question to Jussi-Pekka Hakkarainen Tech. hosting: (13:24) You can write your question here, I'll pass it on to Jussi-Pekka! JPH: (13:25) Andrew - just type it over here, or on to the KIWI page, ok? andrewboltachev: (13:28) Do you have an opportunity to digitize the national text in the Roman language? andrewboltachev: (13:29) sorry. I meant latin script andrewboltachev: (13:29) i.e. one that was used by langs of Russia andrewboltachev: (13:29) in 1920-s-1930s andrewboltachev: (13:29) e.g. for Komi JPH: (13:29) Do you mean that we could make a translitteration out of Cyrillic text? Walter: (13:29) Python 2.x or Pyton 3.x version? TuulaP: (13:30) Is the python code openly available e.g. in github etc? Michael Rießler: (13:30) a…boltchaev probably means the so-called unified northern alphabet Michael Rießler: (13:31) which was used in the early SU for writing several of the indigenous languages Tech. hosting: (13:31) Walter and Tuula, I saved your questions and will ask them after the presentation, thanks! JPH: (13:31) To Walter and TuulaP: Check the comments of this blog entry: http://blogs.helsinki.fi/fennougrica/2014/02/21/ocr-text-editor/#more-223 Michael Rießler: (13:32) it is more or less one similar(latin-based) orthography used for several languages andrewboltachev: (13:32) JPH: I mean is core of your system, that does actual recognition, able to work with such exotic scripts like Komi latin yes, that even doesn't exists in Unicode JPH: (13:34) No, at the moment we are limited to ABBYY Recognition Server 3.5 and this is not an available option for us at the moment, but someting that could be developed in the course of Google Tesseract - gotta ask that from Jack Reuter too JPH: (13:34) > Rueter, not Reuter - sorry Michael Rießler: (13:35) a question related to it: what about the Finnougric Phonetic Alphabet niko: (13:35) or are you talking about Molodcov alphabet? http://kv.wikipedia.org/wiki/%D0%9C%D0%BE%D0%BB%D0%BE%D0%B4%D1%86%D0%BE%D0%B2_%D0%B0%D0%BD%D0%B1%D1%83%D1%80 JPH: (13:36) Michael R: No, not at the moment, but definately something that we should look in future Enye Lav: (13:37) The Komi Latin is the only question for us JPH: (13:37) rule of thumb: if it is in Unicode - it is possible to add in OCR editor Jeremy Bradley: (13:38) Finnougric Phonetic Alphabet is kind of a borderline case JPH: (13:38) Önö: we need to dig into this in details Jeremy Bradley: (13:38) In its simplest form, it is fully supported by Unicode. But, the idiosyncratic variants used by individual researchers will go way beyond Unicode at times. Tech. hosting: (13:39) I enlarged the chat window, so you can discuss better in here. JPH: (13:40) to Jeremy + KomiKyv staff: Tesseract could be a solution for this, but we need to investigate that more in future - Jack will have workshop on transducers in June, so maybe we could find a solution during that sesison Tech. hosting: (13:40) Please, if you have questions for the presenter, add something like "read aloud" in front of your comment/question, so I know which comments I should read aloud. Thank you! Tech. hosting: (13:41) This chat log will also be present in the recording of this event, by the way! JPH: (13:41) Thanks Mr. Tech Michael Rießler: (13:48) "read aloud" I am missing some very general intro in this presentation (before going into technical details): What is the editor? Where is it and how does it look like? What can I as a user do with it? Julia: (13:48) Where will I find the record of the conference? Tech. hosting: (13:49) Thanks Michael, your question has been stored! JPH: (13:49) Julia: We will link the recording to our blog site, but you'll be informed by e-mail too Julia: (13:51) Thanks Anis Moubarik: (13:53) Michael, the editor is available at http://ocrui-kk.lib.helsinki.fi/ you need login credentials to actually use the editor, we'll get to that later. Michael Rießler: (13:54) Anis: thanks! andrewboltachev: (13:54) What is most intellectual part of your system? Are you using machine learning or some suchlike stuff? Anis Moubarik: (13:55) We use ABBYY Recognition Server 3.5, so our editor itself doesn't use machine learning. I think the diffing algorithm is the most "intellectual" part of the system. Anis Moubarik: (13:57) and you can check the editor code at github, https://github.com/juhovuori/ocrui-client niko: (13:58) so is the idea that I can upload my scanned page to this editor and it does an OCR for me, which I then go and correct in the editor? if you say it uses ABBYY Recognition Server 3.5., what does that mean? What is the advantage of using this in comparison to doing OCR in ABBYY Finereader which you can anyway teach quite well to recognise the particularities of specific script etc. niko: (13:59) You can ignore my question if it is coming later! JPH: (14:00) NIKO: Sure, but have a look at the background article in our blog, it might give you some additional information Anis Moubarik: (14:01) niko - the basic idea is to correct already OCRd text, not to OCR new material. niko: (14:01) ok Anis Moubarik: (14:02) ABBYY is just the software we're using to OCR Jeremy Bradley: (14:06) I have a fairly long and abstract question relating to Jussi-Pekka's talk. If it's too lenghty for the already tight schedule, it's no problem, it's probably too vague for a webinar with a very concrete purpose. Can get back to it later. But, just in case: https://dl.dropboxusercontent.com/u/40043114/temp/mari-rights.html Tech. hosting: (14:07) OK, thanks! That is indeed a long question. Jeremy Bradley: (14:07) I'm a fast typer JPH: (14:09) Jeremy: Thanks for your question - I am trying to formulate my answer later today, ok? Will contact you by e-mail regarding your question, or alternatively we could discuss about this in our blog. Jeremy Bradley: (14:10) OK! JPH: (14:10) Correction: not in our blog, but in KIWI page JPH: (14:10) https://www.kiwi.fi/display/ocreditori/OCR+Webinar+2013-03-05 Anis Moubarik: (14:10) JPH is Jussi-Pekka if you're wondering Michael Rießler: (14:10) discussing JB's question in public would be much appreciated JPH: (14:11) Michael - agreed. andrewboltachev: (14:13) Does "subversioned" means you using SVN? Tech. hosting: (14:14) Thanks Andrew, your question has been saved Natalia: (14:21) Thanks. Stephanie Williams 2: (14:42) This is fantastic! Walter: (14:43) Can you mark some text blocks as being *originally* in italics? Anis Moubarik: (14:43) Not for now, italics are reserved as "modified" word Anis Moubarik: (14:44) but we're talking about it Show: (14:44) http://ocrui-kk.lib.helsinki.fi/editor/#04f3c3b56cbf94d634cc3aab5dbdfb4b/3/87x68x0.33071816535908266xH // line 4 // What's wrong with that? Why coordinates did disappear? Walter: (14:44) Can you mark certain blocks as being "added" in ... like the library call number that has been written on teh original? Jeremy Bradley 2: (14:46) Agreed! One thing I could think of to make the interface a bit quicker to use: Maybe there could be a key combination that replaces a basic Latin or Cyrillic character with a "modified" version of it? In many cases, there will be only one. Like Mari: н > ҥ, о > ӧ , ы > ӹ, а > ӓ , у > ӱ. If there was a quick way to access these special characters, it would be a lot quicker than needint to resort to the special keyboard Jeremy Bradley 2: (14:46) Agreed!: that it is fantastic Anis Moubarik: (14:46) Walter - not yet, we're working on the functionality Sampsa Holopainen: (14:47) I think Jeremy's suggestion is very important mpiwg_kthoden: (14:48) is there an undo button? Michael Rießler: (14:49) currently your project is working only with a closed set of languages. question 1) will there be added more languages later? question 2) if I already have scanned and text-recognized texts available for other languages (e.g. kildin saami), could i add these into your platform (although you have no OCR for them)? mpiwg_kthoden: (14:49) http://ocrui-kk.lib.helsinki.fi/editor/#04f3c3a1c24a0e85832e6a830ab93523/16/1490.5395199999994x1170.1607999999992x1.2636006611570245xH mpiwg_kthoden: (14:50) added a hyphen there. Should the box on the image reflect this? Can I enlarge the box there? mpiwg_kthoden: (14:51) Can I see the XML source? Anis Moubarik: (14:52) We have talked about an advanced-mode, where you can see the ALTO XML and edit it, but for now, no. Stephanie Williams 2: (14:52) can you talk about the process for loading materials into the editor? are materials loaded local to the editor software (can it work with remote sources?) andrewboltachev: (14:53) If we see e.g. page in Russian (cyrillic) in Veps book (latin) can we switch to version, OCRed as cyrillic? Michael Rießler: (14:53) Anis: but can I export the ALTO XML? Jack: (14:53) Jeremy: if you are asking about feeding unicode characters, that is not a poblem when I access the editor on mac. My unicode input is automatically accepted by the editor. Anis Moubarik: (14:54) Michael - we have the files on the server so exporting is trivial, but we need to work on importing the edited XML files back to the server Anis Moubarik: (14:55) So you can't yet export it unfortunately Jeremy Bradley 2: (14:56) Hm, no, not quite. I was more thinking that ... if the interface is designed to be maximally accessible with a Latin-only keyboard, It would be nice to have a way to access them with a key combination, rather than having to find them on the screen keyboard. Of course, it seems better to just have the right keyboard layout on the machine you use. Tech. hosting: (14:57) I apologize in advance that some questions might get overlooked in this 'live' session because of all the activity, but I every comment here and to the Kiwi page will be saved! andrewboltachev: (14:57) 2. Are you using support dictionary for any languages? Sampsa Holopainen: (14:58) I am editing the Moksha text now and there is a character ç the program can't recognize - but I cannot find this particular character in the inventory of Moksha special characters mpiwg_kthoden: (14:58) yes, thanks. Question answered Anis Moubarik: (14:58) Stephanie - we have really rudimentary importing system, and it's worked on also, in the future it should be possible to import from remote soures Jeremy Bradley 2: (14:58) Some way to get the interface to accept that if you are using a Mari layout, that Alt-Gr + н = ҥ. Language specific shortcuts? Michael Rießler: (14:58) Anis: So there is no export function built into the editor? If you want the texts to be useful for corpus linguistics your potential users need to bring them into their own database structure Show: (14:59) http://blogs.helsinki.fi/fennougrica/ Anis Moubarik: (15:00) Michael - exactly, but it's basically as easy as just putting a link in the editor Jeremy Bradley 2: (15:01) (to my question above: Latin-only, or core Cyrillic-only) Michael Rießler: (15:03) Anis: and are you planning to put in the link? Anis Moubarik: (15:04) Yes Michael Rießler: (15:05) Anis: (connected to my other questions above) and would you be interested to receive more data from languages you are not (yet) working on? Michael Rießler: (15:06) … like Kildin Saami, which I am working on right now Anis Moubarik: (15:06) Maybe JP can answer that, but I'd say yes Jeremy Bradley 2: (15:06) I think I've more than exceeded my quota of vague questions .. but, are there any plans to have older orthographies machine-transcribed into new ones, OR, a unified transcription system? To allow the analysis of different materials from different eras with the same analysis tools.. Jack: (15:09) Just put in a filter Jack: (15:10) The language code on the newspaper should be mrj for Hill Mari Enye Lav: (15:11) What's about support dictionary Walter: (15:13) General Feedback 1: Don't forget the value that corrected transcriptions have for visually impaired users that depend on text-to-speech General Feedback 2: The major ocr tools all struggle with major languages if the scan if from a poor or second-generation (e.g. microform) version. This tool would still bring significant value for improving discovery. Jack: (15:15) The front end is relatively open. The Giellatekno webdictionaries have been give an enhancement bug to allow NDS use on the texts in the browser. At Present that solution has not been actualized. (cf. muter.oahpa.no; valks.oahpa.no; sanat.oahpa.no; vada.oahpa.no; kyv.oahpa.no, sanit.oahpa.no, saan.oahpa.no etc.) Anis Moubarik: (15:17) If you have any comments or feedback, please send them to kk-fennougrica@helsinki.fi Jack: (15:17) Enye: we are already looking into the use of fst:s in dictionary=wordform lists. Stephanie Williams 2: (15:19) Great--thank you! joshua wilbur: (15:20) thanks everyone for the interesting webinar! Reima Välimäki: (15:21) Thank you very much, this has been very inspiring. I'm going to stay in contact mpiwg_kthoden: (15:21) yes, thanks for showing this stuff. Kiitos hyvä! mpiwg_kthoden: (15:21) I'll discuss that. Stephanie Williams 2: (15:22) Thanks everyone! Anna Biström: (15:22) Thank you, great to get a hands-on experience on how it works. Tech. hosting: (15:23) Thank you for participating! The recording will most likely be available tomorrow!