DH 2019 – notes

Thanks to Mikkeli University Consortium help, the National Library of Finland was present at the DH 2019 conference in Utrecht.  The conference is organized by the The Alliance of Digital Humanities Organizations (ADHO), and organized annually, usually every other year in Europe or in Americas. Netherlands was a good choice of venue, since the National Library of Netherlands is doing excellent work with the library lab development as with the researcher-in-residence program, having then much of calls for visiting them, as we got to hear in the library labs session during one lunch break. At building library labs pre-conference workshop we had a paper, outlining current work we do with the researchers (morning coffees, data clinics etc) and briefly outlining the possible future directions for improving tools and services forwards.

DH in European research libraries

A poster of recent survey done about DH in European Research Libraries

About the conference

Conference was large and very popular, at the very last minute the participant amount grew to the bit over 1000 so the venue was packed. This despite there were four parallel tracks that nicely divided the participants to the different locations of the venue. The conference venue was a music hall, with halls and rooms for 30-1000 people in different wings within the hall. Usually it was just simplest to experience a session in full to secure a spot there where you wanted it.

Researchers, scholars, academics, go!

I tended to pick most of the long paper sessions, where there were academics from various universities and research institutes from all over the world. Also the fields of studies varied: data, quantitative and qualitative, machine learning used and also with many different languages and collections were targeted. Mostly digital, but also mentions of microfilm were there.

The multitude of tracks meant that there was a cornucopia of good sessions to choose from, and regardless it felt that you were missing out something after picking some session, but that is how it usually goes in these events.

One side discussion in Twitter occurred on the woes of the lack of digitisation. The situation in countries is naturally varies, some have got some millions of pages of their collections digitized and some are only starting their way.

Heard a few , as expected, mentions of the OCR quality, and at least one, quick remark of the post correction of OCR.  Newspapers, books, libraries, archives of many kinds were used (even if not all of them were named). One interesting one was the Impresso project, which talks about “newspapers as an Eldorado”, that with an addition to the NewsEye and READ (Transcribus) development could be that even more in the future.

Lots of libraries were developing platforms, where the machine learning capabilities are part of the solution given to the researchers, British Library had ‘Curatr’ where there was word embeddings that work in the background, and allow reseacher to make their own small subcorpora of their research question as the tools were integrated within the presentation system. There also one main feature was to be able to variate between distant reading and close reading – you could get back to the original page which could explain why certain thing was included to the search results. Then there was authorship verification work, work with dictionaries and DH, maps, image capturing of various material types, segmentation, fake news, multi-language woes (i.e. why quite much of research is done on English corpora?) . But also to the concrete, how to get the new generation to read, how the collections should be put available online  and combine different media? What kind of features would be needed from the newspaper collections, which initiated an ad-hoc meetup on Friday to collect the researcher needs and work on those onwards in the coming DH2020 conference and between.

Finland

Finnish universities were present quite well in the conference. Comhis group had several papers and posters that talked about utilizing the national bibliography, or newspaper collections. They tweeted the slides and abstracts for anyone to utilize:

and the interesting ways how you can utilize the digitisation process metadata in novel ways to show and tell the changes in newspaper landscape.

University of Turku had also Comhis and Oceanic Exchanges related papers presented.

 

Transcribus seminar 2019

The air was cloudy, but at the final days of the Transcribus project there was a seminar in National Archives of Finland (NAF) going through the outcomes of the project . After welcome words of Maria Kallio and Vili Haukkovaara we went onwards with presentations from the developers and users.

Panel participants of the seminar.

Günther Mülberg  from Uni of Innsbruck tells about READ-COOP , about what was the beginning. In the Consortia there were archives doing digitisation, but also universities, some who did manual transcribing and those who are expert in transcribtion and pattern matching.

Facts of Transkribus

  • EU Horizon 2020 Project
  • 8,2M eur funding
  • 1.1.2016- 30.6.2019
  • 14 partners coordinated by Uni of Innsbruck

Focus:

  • research: 60% – Pattern recognition, machine learning, computer vision,…
  • network building: 20% – scientific computations, workshops, support,…
  • service:  20% – set up of a service platform.

 

Service platform:

  • digitisation, transcription, recognition and searching hisorical docs
  • research infrastructure

User amount has doubled every year.

In 4-11.4.2019

  • images uploaded  by users 98166; new users 344; active users 890, created docs 866, exported docs 230, layout analysis jobs: 1745, htr jobs 943
  • Model training is one major thing

Training data

  • Jan 2019: 228 HTR models trained by Transkribus user
  • Pages: 204.359; words 21.200.035
  • Value: about 120 person years, monetary value ca 2-3 Million EUR

 

Co-op society SCE presentation

  • enable collab of independent institution
  • distributed ownership and sharing of data, resources and knowledge in the focus

Main features of an SCE

  • Mixture between an association and a company
  • Members acquire a minimum share of the SCE – 1000/250 Eur
  • Open for new members, members can leave anytime

Foundation of READ-COOP 1.7.2019 : founding members are listed on webpage alreay.

“Every institution or private person willing to collaborate with Transkribus is warmly invited to take part”.

E.g. NewsEye project was seen as one collaborative project towards Transkribus in the layout analysis part. Also with NAF there is a name entity recognition effort where things continue.

During presentations from the developers of the Transkribus platform and the users it became evident that the software has indeed matured. The character error rates were at 10% of at best can be less than 3%, so system has evolved considerably. One interesting topic might also be keyword spotting (KWS), where even with poor OCR the search is still able to find the particular word of interest from the data.

Interesting also was the ‘text-to-image’ where it might/should be possible to “align” a clear text and its corresponding page image – this could help as in some cases blogs or even wikipedia can have corrected text but it is separated from the original collection.

Also the segmentation tool (dhSegment) looked promising, as sometimes the problem is there, and the tool supports diverse documents from different time periods. There via masks done to the page, the neural network can be thought to find anfangs , illustrations or whatever part in the page is the target. The speaker also told that if the collections are homogeneous they might not need even that much training data.

As of today, the Transkribus changes to a co-operative (society), so it needs to be seen how the project outcomes evolve after the project itself has ended.

 

Arkistopäivät 2019

Digitalia-väkeä oli myös paikalla arkistopäivillä 2019 yhdessä Finna-palveluiden kanssa. Arkistopäivät ovat arkisto-, museo- ja julkishallinnon toimijoiden yhteinen seminaari jossa päivitetään osaamista ja kerrotaan eri puolilla tapahtuvista kehityskohteista. Tämä neljän vuoden välein järjestettävä tapahtuma järjestettiin nyt yhdeksättä kertaa ja osallistujia oli yli 300.

Vieraita esittelypisteessä

Digitalia mainittiin Johanna Liljan esityksessä, kuten myös NewsEye -hanke, jossa jatketaan kansainvälisessä yhteistyössä myös sanomalehtien tutkimuskäytön parantamista.

Kautta päivien esitykset olivat hyviä ja osoitti arkistomaailman kehitystä digitaalisuuden suhteen. Puheenvuoroissa puhuttiin tietoturvasta yksilön ja organsaation näkökulmasta kuin myös siitä, ettei paikallisasiakkaita tule unohtaa. Arkistotyön arvoa tuotiin myös esille sekä käyttäjälöytöjen että tutkijoiden tarpeiden kautta. Digitaalisuus oli päivien sana ja siinä oli kiinnostavia kehityslinjoja jo olemassa, joista on kiinnostavaa nähde kuinka kaikesta saadaan loppukäyttäjäpalveluita. Tiedonhallintalaki oli myös monella mielessä ja seuraavien arkistopäivien aikaanhan se on jo toteutuksessa.

 

OECD:n tekoälyn peruspilarit

OECD julkaisi toukokuun lopulla tekoälykehitystyölle viisi periaatetta, joita sitä kehittäessä tulisi pitää lähtökohtina. Pidemmässä taustadokumentissa on myös avattu sanastoa ja annettu lisäselityksiä periaatteiden takana.

AI system: An AI system is a machine-based system that can, for a given set of human-defined  objectives, make predictions, recommendations, or decisions influencing real or virtual  environments. AI systems are designed to operate with varying levels of autonomy.([2], p. 7)

 

 

Alkuperäinen ja vapaamuotoinen käännös löytyvät ao. taulukosta.

AI should benefit people and the planet by driving inclusive growth, sustainable development and well-being. Tekoälyn tulisi tuoda ihmisille ja planeetoille hyötyä, kannustamalla mukaanottavaan kasvuun, kestävään kehitykseen ja hyvinvointiin.
AI systems should be designed in a way that respects the rule of law, human rights, democratic values and diversity, and they should include appropriate safeguards – for example, enabling human intervention where necessary – to ensure a fair and just society. Tekoälyjärestelmät tulisi suunnitella tavalla, joka kunnioittaa lain sääntöjä, ihmisoikeuksia, demokraattisia arvoja ja monimuotoisuutta, ja niiden pitäisi sisällyttää sopivat turvat – esimerkiksi ihmisen väliintulon tarvittaessa – mahdollistamaan rehellisen ja oikeudenmukaisen yhteiskunnan.
There should be transparency and responsible disclosure around AI systems to ensure that people understand AI-based outcomes and can challenge them. Tekoälyjärjestelmien tulisi olla  läpinäkyvia ja vastuullinen ilmoittaminen niistä, jotta ihmiset ymmärtävät tekoälypohjaiset lopputulokset ja voivat haastaa ne.
AI systems must function in a robust, secure and safe way throughout their life cycles and potential risks should be continually assessed and managed. Tekoälyjärjestelmien täytyy toimia vakaalla, tietoturvallisella ja turvallisella tavalla koko elämänkaarensa ajan ja mahdollisia riskejä tulisi koko ajan arvioida ja hallita.
Organisations and individuals developing, deploying or operating AI systems should be held accountable for their proper functioning in line with the above principles. Organisaatioiden ja yksilöiden jotka kehittävät, julkaisevat tai käyttävät tekoälyjärjestelmiä tulisi pitää vastuullisena siitä, että ne toimivat kunnollisesti ylläolevien periaatteiden mukaan.

Kiintoisaa nähdä, alkavatko nämä periaatteet nousta käyttöön. Kuitenkin jo 42 maata on omaksunut periaatteet ja asiantuntijaryhmä , joka määritelmät on luonut, on kerätty osallistujia eri puolilta.

Saksan tekoälystrategian 12 toimenpidealuetta

Saksan tekoälystrategia

Saksan tekoälystrategiassa liikutaan toimenpiteiden ja tavoitteiden välimaastossa. Vuoden 2018 lopussa tehdyssä strategiassa mainitaan 12 tavoitetta, jolla tekoälyn käyttöä valtiossa ja muuallakin yhteiskunnassa pyritään kasvattamaan.

Euroopassa

Euroopassa julkaistiin huhtikuussa 2019 ‘Ethics guidelines for trustworthy AI’ , jossa luotiin kehys, jonka puitteissa tekoälyä eri puolilla yhteiskuntaa tulisi kehittää, kuvattuna am. kuvassa [4].

Suomessa

Suomessa kansallinen tekoälyohjelma tunnetaan AuroraAI-nimellä jonka toimeenpanosuunnitelma pohjautuu ‘Tekoälyaika Suomessa’-raporttiin ja työryhmätyöhön jossa eri osioita on käsitelty laajemmin.

Antti Rinteen tuoreessa hallitusohjelmassa mainitaan tekoäly kahdeksan kertaa.

“Varmistetaan, ettei tekoälyjärjestelmissä hyödynnetä välittömästi tai välillisesti syrjiviä toimintamalleja.”

Joka tuntuisi olevan hyvin linjassa aiemmin esitettyjen tekoälyn periaatteiden kanssa.

 

Lähteet

[1] OECD. (2019b). OECD Principles on Artificial Intelligence. Osoitteessa http://www.oecd.org/going-digital/ai/principles/

[2] OECD. (2019a). OECD Legal Instruments. Osoitteessa https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449
[3] European Comission. (2019, huhtikuuta 8). Ethics guidelines for trustworthy AI [Text]. Noudettu 30. toukokuuta 2019, osoitteesta Digital Single Market – European Commission website: https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai
[4] The Federal Government. (ei pvm.). Home – KI Strategie. Noudettu 30. toukokuuta 2019, osoitteesta https://www.ki-strategie-deutschland.de/home.html?file=files/downloads/Nationale_KI-Strategie_engl.pdf

DHH19 – remote live notes

As has already been in earlier years, the DHH19 hackathon was due. This time there were:

  • 38 participants
  • 10 days
  • 4 teams and projects
  • Computer science + social science + humanities

This time DHH19 went international, and the group was versatile

 

There was communication part in each team, which explained at least the Twitter-using teams, and there were also interviews made from the participants.

13:20-14:30 Group presentations

DHH19-Brexit group

RQ: What is the relationship between social media and politics?

  • Types of tweets: retweet (massive amount), original, reply, quotes
  • Checked regional tweets, and who are the most dominant users.
  • 1 user can be responsible of certain combination of hashtags (whose account anyhow suspended)
  • Real-life events vs events on Twitter
  • Several hashtag groups can be identified and that was visualized and put to a timeline (as assumed that hashtag correlates to a time)

  • Research on subgroups: all users, verified users, members of parliament
  • Frequency analysis: most shared Twitter content vs. external content.
  • Blame : groups mobilise different content to futher their agenda
  • Who are people angry at?   “the anger scale”  (person id’s of angry tweets)
  • hashtags of MP’s ; hashtag avoidance by politicians? (removal of themselves from the discussion?)
  • Problems: how well the dataset matches the population, e.g. PM is missing from the data?
    • Experimented with NER, ‘May’ ends up as Date 🙂
    • tweet json contains so many fields, where rt’s and hashtags don’t get to correct field due to way how api was used
    • https://blogs.helsinki.fi/digital-humanities-hackathon/brexit
  • Couldn’t go further to data as historical data would have a cost of several thousands of euros.

 

 

DHH19-EU Parliament group

  • Update on communications
  • “We communicated”
    • the future, << political discourse & debate >>,  history, the past
  • What is EU?
  • Whole corpus
    • 249,955 speeches in English, 50M words, 2M sentences
    • metadata: speaker (name, country, political functions, gender, date, topic)
    • 2013 change in protocol, after 2013 , 99.7% of English speeches from UK and IRL
  • Created subcorpora:
    • picked keywords, to find suitable parts from data, brainstormed ideas for subset, observed data , topic modelling, word embedding… and selected keywords then
    • word ‘past’ might work if they talk about last session
    • Past: 13.230 speeches, future 5430 speeches, totaliariasmi 4517, climate change 12k
  • Key results
    • Past and future are very related
    • Construction of Future (network with ‘key hubs’)
    • Thinking conceptually what is the future
    • Topic modelling : which topics connected with the past or to the future
    • Sentiment analysis : speakers seem to be less emotional and more concrete while talking about future. (Utilized existing tools for doing analysis)

  • variance of sentiment analysis
  • analysis of sentiment analysis compared to gender, country -> by this noticed the 2013 issue
  • Qualitative analysis: close reading to extract ideas between past and the future. MP’s utilize ‘past’ elements as something to avoid or sometimes as some things to preserve & generic human values.

‘Future’ is something to plan , fear or hope for

* RQ: Case study 20thcenttot” How is the horseshoe theory present in discourse on the topic and how does it differ by region (E/W)?

  • strongly connected in discourse, with no significant differences between East and West
  • frequency analysis of specific terms, 2005 a spike visible .

* RQ: Case study: The climate crisis. How is climate change talked about in EP? Are there differences how diff countries talk about it?

  • Proportion of speeches in climate change, spike in 2007 to 2010 (combination of current events, 2008 – 2011 first impl of Kyoto protocol). Might be impacted by nbr of speeches
  • Sentiment analysis – by country , a problem found. E.g. Lux vs. IRL , sth in the data (number of speeches vs. climate speeches)
  • False positives and negatives in understanding the sentences.
  • Compared Finland & Spain: word clouds differ , in Finlandd emission, carbon at top, in Spain: water , mediterranean

Challenges

  • defining the past and the future: how to detect it in the corpus?
  • Missing English translated speeches and missing metadata
  • English corpus included other languages

More work to do: look at differences between political parties ; issues w dataset should be solved.

A break: coffee (&tea) consumption in DHH2019

15-16 Groups

DHH19-Genre and Style group

“Name-dropping in 18th century public discourse”

RQ: Which personal names are frequently mentioned in 18th century British publications?   On the basis of the freq and co-occ. of individual names, what kind of patterns can we detect that are characteristic of genres and time periods?

Data:

  • 18th century collections online (ECCO)
  • high representativeness: ca 50% of all 18th century British printed texts (180k titles, 32M pages)
  • OCR issues, unreliable metadata

Analysis

  • Keyword analysis
  • Three subset over time: hist, relig, social
  • Methods

Results:

  • “Even when data was good it was not standardized”
  • Name dropping was used in text (Queen elisabeth, Cicero,…)
  • Identified the most common people mentioned
  • Looked at this over time (20-year periods)
  • The people mentioned in books have some similarity context (network of books, clusters of religious texts, historical texts, social affairs where subclusters)
  • Origins of mentioned individuals (ancient, classical, early medieval, late medieval, renaissance, contemporary, NA)

Future research:

  • Automatic genre classification based on named entity networks
  • Improve NER and subsets by using domain-specific resources
  • examine less popular entities and their role
  • focus on 1st editions, inspecting specific sections of books
  • creating interactive visualizations

Communication: blogs medium & blogs.helsinki.fi, Twitter (1 post from everyone), Instagram.

DHH19 – Newspapers

Why newsppapers? “Most important public record from the past”

Research themes:

  • presence of ads
  • language of persuation

Approach

  • working in smaller groups, 2 meetings every day. Interlinking blogs

Results

  • Presence of ads in morning post 1800-1900
  • everything compared to the ads quantity
  • Åbo Underrättelser 1824-1890
    • Only a handful of finnish Newspapers have ad segmentation , none in 1892-…
  • The development of Ads in British Newspapers 1800-1900 : country, city and paper level
  • The pres. of ads in Morning post 1800-1900

The language of persuasion in advertisements

  • What linguistic strategies were used for persuading people into buying or using?
  • close reading of 1 newspaper issue per decade
  • Testimony : authority & ordinary people
  • Advertisements used adjectives, describing the high quality of the product – compared ratio of ads vs. articles -> no differences
  • Interesting was the ratio of words (ads vs. articles)   (ads used short sentences, easy to remember)

3. “The pills that cure everything – drug advertisements” ( a subcorpus)

  • compared gender
  • found out which illnesses the ads were targeted against
  • descriptive language compared between different ads, e.g. ‘nervous’ came less important in women ads, but not on males – treatment of the word maybe changed.
  • illustrations with the ads were interesting between diff genres (very different contexts, emotions visible for e.g. nervousness)

3. job advertisements, class and societal change?

  • most sought occupations in newspaper ads /HISCO categories
  • requirements connected to different occupations
  • social and cultural expectations regarding these jobs
  • differences between country in terminology, and even industrial changes
  • top 15 occupations in Åbo Underrättelser : e.g. teachers !

Problems

  • OCR…
  • article separation (british newspapers)

Plan for future research

  • 10 days, couldn’t do everything they wanted
  • typology, trajectories, rhetorical tropes, medical/drugs & hysteria, marketing strategies

Dashboards available online.

Summa summarum

Sounds like that the hackathon was very interesting and the teams got far. Tools were used in interesting ways. Need to be considered how the ideas here could be used in research and development work done in the Digitalia project, too. This gave good food for thought 🙂