DATeCH2014 & Digitisation Days, 19-20 May, Madrid: Day 2

I was heading to the second day of Digitisation Days full of energy. Being a father for a 6-months-old child, a long sleep at the hotel’s bed was a blessing in deed. A morning coffee with some nice bread without any interruptions made me feel like eating a forbidden fruit.

Family affairs aside, the Tuesday morning started with the round-table workshop on a roadmap to shape future research in digitization, so I wasn’t participating the first session of the day. At the beginning of our round-table, we listened two brief project presentations held by Rosi Anatassova of British Library, who introduced us the recent projects at the British Library Labs and Elsbeth Kwant of KB in the Netherlands, who shared the future approaches in the Dutch context.

After the introductions, we were split into two groups to discuss the major challenges and future prospects in future research. We were asked to answer to following questions:

  • what stops you from making the progress?
  • what would help you to make the progress?

Not surpiringsly, the both groups ended up into similar proposals, which could have their effect on the EU policies on research support. The bottom issue was the diversity of the infrastructure: not all libraries (other institutions) can provide a “full-scale service” for all. Especially among the university libraries, the tasks seems to be too demanding. The small-scale players are important locally, but cannot play the big boys’ game, so they need to consider the use of their forces more than twice. The allocation of available resource and competence cannot cover all aspects what are required by the researchers, especially when you should appease the wider audiences as well. Despite the fact that I love the diversity of Europe in many ways, some harmonisation of infrastructures on the European level would make sense. These discussions draw most of my attention, despite the fact that this was just one aspect discussed. Of course, naturally and by a rule, the issues related to funding and compulsory tasks at the institutions were under the discussion too.

The triple-c approach of KB concluded the discussion – the future research in digitization is about the purposeful combination of competence (skills, tools, communication etc), content (our cultural heritage, and its digital quality and quantity) and connectivity (linking, sharing, utilizing). Access to the content and data must be provided for all users and the bridges between the providers and users (scholars?) must be torn. I think it is all about the three c’s as well: communication, co-operation and cohesion.

After the coffer break we had a chance to follow very prestigeous panel, which was led by Milagros del Corral of UNESCO. The panel was titled as vague as “The digitisation of cultural heritage: modern utopia?”, maybe a too broad topic to tackle in this sort of a conference, but I was glad that the hosting institution organized this panel, because it revealed the approaches, which are under the discussion on the top-level of the decision makers in European level. In addition to mme Milagros del Corral the panel had four other speakers (from the ivory tower?): Jill Cousins of Europeana, Michael Keller of the Stanford University Library, Steven Krauwer of the Utrecht University and Giuseppe Abbamonte of the European Commission. There were a couple of comments, worth of observing and replying here too:

Krauwer: “there’s a need from the EU side to support more the development of soft sciences too.” Being a humanist myself, I don’t always agree with the whining in my flied of research: the hard sciences get more and we don’t get anything. In my opinion, such a dichotomy is waste of time and doesn’t lead anywhere at the end of the day. But from the infrastructure’s point of view, this demand is exactly what we indicated during the roundtable when we were speaking about the more flexible infrastructure. We (the soft science) don’t probably need fully equipped laboratories to do our work, but some sort of a support in all-European infrastructure would be nice. And I am not referring to Europeana now.

This leads us to the brief presentation of Jill Cousins of Europeana. She was underlining that new business models are needed in order get money back to the institutions which are producing the digitized content and publishing them online for free. The cash back, according to Cousins, should come from those commercial initiatives which are benefitting of free content. I do understand this view perfectly and I have no problem with that, since it feels a bit odd when companies are making profit out of free content, which is eventually provided by the European taxpayers. Should the GLAM sector become more aggressive when it comes to the building commercial services on the top of their core tasks (collections, preservation, digitisation etc.)? If yes, where are resources to do so?

The media and data director, Giuseppe Abbamonte of European Commission had also a brief presentation and I was frightened when he indicated that the there are no excuses for low-quality of materials in digitized collections, since “36 Mpx cameras available in the market for 2500 euros.” – Well, digitization of any given material is a bit different case than taking your holiday photos with your high resolution camera. I think this was a wrong message for the audience. And I am sure that the Rijksmuseum has not taken the high-quality pictures with a multi-megapixels camera which is available at MediaMarkt for all consumers. On the contrary, I am sure that if the high-resolution pictures out of van Gogh’s painting would have been taken with those “36 Mpx” devices, the increase in the number of visitors would have been lower. But I do agree with him that in general, as he also did indicate, that the good online collection lures and secudes more audience, like in the case of Rijskmuseum.

The Awards and Mario Vargas Llosa! Great!

The Session number 5 did begin with a presentation of Wittgenstein’s archive and tools to explore it. Florian Fink kept a presentation under the title “Wittgenstein’s Nachlass: WITTfind and Wittgenstein Advanced Search (WAST)”. He is, like many other experts in this conference from CIS – Munich. I liked the idea that you can search Wittgenstein’s archive based on words and phrases, which are normalised and you can retrieve the both, the facsimile and the annotated text. This is something which needs to be understood from a researcher’s point of view – you gotta see the context of your enquiry, otherwise there will be an eternal hesitation on the authenticity of the source. Many of us, who are trying to make their research in humanities, the context, albeit in digital expression, is a key for validation of the source. And in this case, there’s no doubt of it. Also, it is always worth of listening what the last avantgardist in Finnish art says about Wittgenstein. You may look at the WittFind here.

Beatrice Alex: “Estimating and Rating the Quality of Optically Character Recognised Text” This was a really refreshing paper. Since it dealt with really big data in humanities. The Trading Consequences project is part of the Digging Into Data programme and it goes deep into combining the text mining, data extraction and information visualisation to explore big historical datasets. The core question was to discover, how the historians are dealing with big datasets. It was good hear from a project, where the researchers (ie. historians) had a heavy involvement in developing the needed tools. Data for the research’s needs was processed from many different source, like House of Commons Parliamentary Papers and Early Canadiana online, etc. – the datasets consisted of 10 million document pages and 7 billion word tokens – big data for computing linguists, but good enough for historians 😉 The OCR-ed created the problem (old data, a poor quality of OCRing, even documents which were scanned upside down), so it was wise to ask to what extent OCR errors affect text mining? What text is of sufficiently high quality to be understood? How bad is too bad? They simply offered options for the researchers to exclude the “bad enough” data out of scope. Their work goes on, since there’s a new big data project about to kick off: Palimpset,  which is supported by Arts & Humanities Research Council as well. Looking forward the results.

The last paper of the conference was given by Alicia Fornés, who presented a “Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts” The paper was linked with the European Research Council funded project of Five Centuries of Marriages (5CofM). By crowdsouring and purposeful tools, they had managed to divert images into texts and text into contextual space. See more about the project here. I am sure that the many genealogists, for instance, would benefit from this this project in Finland too. The good thing is that the tool is scalable for other projects too. It’s been used for the analysis of the census documents too.

The poster session was a cherry on the cake. Most of my attention was given to OCR enhancement tools and problematics on columnization in different cases. That’s the field where we are struggling with our own project. I think there were a couple of good examples, which could be beneficial for us too, so our fellows from Poland and Germany, check your e-mails soon.


I have to give a credit to the Czech team. Their poster went into astray during the flight from Prague to Madrid and they had a chance to showcase their project only verbally. I had a long chat with pan Kučera on their project and I can convince you that the textual corpus on the 19th century Czech printings is a bashing case!!! At least I was eager to use it.


Digitisation Days and DATeCH2014 are over for me now and it is a high time to make a some sort of a conclusion and give some feedback for the orgkomitet. The day 1 was probably rather technical for those attendees, who weren’t that familiar with the technical issues. On the second day, this approach was more in front and it was nice hear about the cases, where the technical tools are actually used for the real. Today, in this afternoon, we had a couple of good examples of it. On the other side, it was good to listen about the solutions, not about the research itself too. For those, who are longing for some case-studies and examples on success stories from the researcher’s point of view, there are numbered amount of seminars dedicated to digital humanities in Europe this summer. I am not going to dump you there, but if you are in need of collegial support in your research, it is worth of participating in any given event in this field. There was also some criticism in the air about the absence of libraries in this conference. I think that is a matter of scope and interest-groups. Some of libraries are always present in these conferences, but how about those who are missing the digitisation plans and large-scale projects in the home countries? Maybe Succeed and IMPACT could do something for them too and get them involved in these conferences too? I reckon that discussion over the Romanian or Slovak practices, for instance, on the digitized content are equally important to the experiences in BL and KB, aren’t they?

ps. sorry for spoiling the digidays hashtag with too many uninformative tweets 😉

Yours &c.,

