The Permissive Digital Archive

Samuli Kaislaniemi (University of Helsinki)

[This is the paper I gave at The Permissive Archive conference at UCL in London on 9 November 2012. This versions includes sections that I skipped when giving the talk – these are indented in the text below. My apologies to those whose images I cribbed: I have linked to my sources, but will remove any and all borrowed images if asked.]

Let me start by saying how happy I am to be here. I don’t think I am the only one at this conference whose life has been positively changed by CELL. And I can’t think of any other academic institution that manages to host conferences that feel like parties!

0. Introduction

The digitisation revolution – for it is a revolution – has changed the way we do historical research. This applies equally to archaeologists and historical linguists, literary scholars and historians: anyone working on the past cannot but be affected by new digital tools and resources. They bring their own share of new challenges – many of which turn out to be old challenges. And they also promise – or seem to promise – to deliver new and exciting results.

I. Terminology: What is a digital archive?

What is a digital archive? The previous two presentations both talked about digital archives, but the term was not defined – so there seems to be a general understanding of what we mean by this term. Kenneth Price[1] has tried to tease out the nuances between different terms used for essentially similar digital resources, but discovered that distinctions are blurred. An Electronic Edition, according to Price, can mean almost anything. They certainly are not restricted to being digital versions of print editions. A digital project, on the other hand, is even more amorphous – but the word “project” has a sense of time, in that projects have a beginning and an end. Projects are either unfinished, or finished. In comparison, a database is usable from the moment it is set up. The term “database”, however, carries connotations of a technical nature – we think of relational databases – but when it is used as a word to describe a digital historical resource, it should be taken metaphorically. “In a digital environment”, says Price, “archive has gradually come to mean a purposeful collection of surrogates.” This is exactly what is more adequately implied by his last term, thematic research collection – and arguably, most digital resources are exactly this. But it doesn’t exactly roll off the tip of your tongue..

I’m afraid a discussion of what is an archive did not fit into this paper in the end, but to give you an idea, here is what archivist Kate Theimer[2] had to say about digital “archives”..

In other words, a digital “archive” is not an archive, but a collection. In contrast, here is Price’s comment again:

I think the use of the word archive is justifiable, sincefor the scholar, a repository is a repository: the details may differ from place to place, but any place you go to for access to original sources is, in essence, an archive.

Given this loose definition, “digital archives” include not only large-scale resources such as EEBO and State Papers Online, but also smaller resources such as the digital editions made here at CELL. And more importantly, I think one’s own personal research collection can be viewed as an archive. I work on archival materials, and my primary tool – after this laptop – is a digital camera. I have compiled a fairly large digital collection, having photographed almost a thousand manuscripts. These will never get published as a collection, of course, but they do form, in essence, my primary archive, which contains in essence surrogates of all the archival materials that I (think I) need.

What can be found in a digital archive? Digitised versions of original sources, of course, as well as metadata and all the other things Jenny Bann mentioned in her paper.

II. Digital dualism

We do not need to be constantly reminded that digitised books and manuscripts are not the same thing as looking at the original, material sources. However, this division into physical and electronic is not always useful, or even accurate.

Nathan Jurgenson[3] has coined the term digital dualism to refer to the false dichotomy between digital and physical worlds. (He actually differentiates between four “ideal” types of digital dualism, which you can see on the slide here – but which I don’t have time to go into.) Digital dualists are those who “believe that the digital world is ‘virtual’ and the physical world is ‘real’”. This is of course a familiar refrain to all of us, included in comments that disparage online communities in general, and the social web in particular. Facebook “is not real”, they say. But Jurgenson criticises the idea that time and energy spent in the digital world subtracts from the physical – he quotes Luciano Floridi: “we are probably the last generation to experience a clear difference between offline and online”. The digital and physical worlds may be ontologically separate, but they are both “real” in the sense of being authentic. That they have very different properties is of course true; but we live in both, and the two worlds interact. Reality, writes Jurgenson, “is always some simultaneous combination of materiality and the many different types of information, digital included.”

Jurgenson notes that “for the vast majority of writers, the relationship between the physical and digital looks like a big conceptual mess”. To remedy the situation, he provides a model of four ideal types of dualism, with “Strong Digital Dualism” at one end – which states that the physical and digital are different realities and do not interact – and Strong Augmented Reality at the other, which states that the realms are party of one single reality and have the same properties. Jurgenson himself takes a milder view, that of “Mild Augmented Reality” – same reality, different properties, interaction.

Lorna Hughes[4] has noted that digital tools and methodologies can well reveal more than traditional approaches: working “with a digital object (a surrogate created from a primary source that has been subject to a process of digitization, or data that were born digital) enables us to recover and challenge the ways in which our senses of time and place are historically and archaeologically understood, something that cannot be effectively communicated through traditional media.”

The usual “argument [is] that digital surrogates distance the scholar from the original sources. They do not. They give the scholar far greater control over the primary evidence, and therefore allow a previously unimaginable empowerment and democratization of source materials”. One great example of studying materiality with digital tools is Kathryn Rudy’s study of “dirty books” – using a densitometer to measure finger grease on pages of late medieval books of hours, revealing the reading habits of their readers, each unique and different from the others. And then there is multi-spectral analysis of palimpsests in order to read the erased text.

Archimedes Codex palimpsest, digitised by the Walters Art Museum

In the future, should we strive for haptic digital representations of manuscripts? Do we want to be able to feel the paper or parchment of a manuscript when viewing it on an iPad? I believe Alison Wiggins made a comment at the recent Digital Humanities Congress at Sheffield to the effect of, it is more useful for the scholar to know what kind of paper is used in a manuscript, than to have the feel of the paper recreated digitally. So perhaps haptic encoding would be more of a Turning-the-Pages –type show-off feature, than something that scholars would find useful. But I digress.

Arguably, then, the materiality of our sources does not get lost in the remediation from physical to digital format. But in any case, we are far more familiar with the visual and textual aspects of digital resources.

III. How using digital archives has changed the way we work and think

The first thing to note about digital archives is that they can be huge. SPOL contains digital images of some 2.2 million manuscripts. As they span 200 years, this comes to, on average, just over 100,000 manuscripts per year. EEBO, while significantly smaller, now has 15 or 20 thousand books available as full text. And the thing about full text is that you can conduct word-searches on it.

Tim Hitchcock[5] has noted that EEBO, ECCO, and other similar resources “have in ten years essentially made redundant 300 years of carefully structured and controlled systems for the categorization and retrieval of information. In the process these developments have also had a profound impact on the way … scholars go about doing research. … it is now possible to perform keyword searches on billions of words of printed text – both literary and historical.”

But what is more, scholars “are expected to search across a large number of electronic sources” – but the process strips them of the opportunity to get to understand the context from which individual elements of information come. (The problem may be also seen to be imposed upon them: scholars – especially students – need to look at “everything” in order not to be considered lazy or neglectful).

And keyword searches make new findings very easy indeed.

Here’s one I did earlier: I looked up the word archive in the Oxford English Dictionary. Then I did a simple keyword search in EEBO, and managed to find an instance of usage of the word 70 years before the first instance recorded by the OED.

(..This is not as amazing as it may seem: in fact, antedating the OED is very easy! But that is what I just showed you.)

But less superficially – to quote Tim Hitchcock[6] again: Keyword searching of printed text “radically transforms the nature of what historians do … in two ways. First, it fundamentally undermines several versions of our claim to social authority and authenticity as interpreters of the past. … If historians speak for the archives, their role is largely finished, as the material they contain is newly liberated and endlessly replicated.” … “Second, the development of searchable electronic archives challenges historians to re-examine the broad meta-narratives which have developed to explain social change. If historians no longer ‘ventriloquize’ on behalf of the archival clerk, then they are free to rethink the nature of social change.” That is to say, if publishing archival findings becomes unneccessary since “everything is accessible online”, then we are free to try to say something bigger.

That, in any case, is the theory: but in practice we are burdened by the curse of Convenience.

Peter Shillingsburg[7] recently wrote: “I was once told that the likelihood that a scholar or student will check the accuracy of a supposed fact is in inverse proportion to the distance that has to be travelled to do the checking. If it can be checked without getting up, high likelihood; across the room, probably but maybe not; out the door across the campus to the library, only if highly motivated. Why? Convenience.”

We are all guilty of this convenience. We say that physical books are better than digital, but we are increasingly likely to prefer online sources.

The constant refrain is that “it’s so much easier to work with whatever is online, and it means you don’t have to travel to see things”.[8] This is particularly true of younger generations, who may only have ever encountered early modern books in EEBO. So we should not be surprised when “[t]hey stay at home and expect archives to work like Google”.[9] And we are also biased towards convenience in using these online sources – if something doesn’t work, we will not do it. We can’t be bothered to learn to use features we don’t know exist. So quite often we end up using EEBO as an online repository of books, without even making full use of its search capabilities.

However, convenience means that we are limited by these convenient sources: our research questions end up being constrained by the digital sources – and by what you can search for in them! Keyword searching, however, falls on its face in front of Early Modern English spelling variation. And don’t get me started on the reliability and accuracy of the transcriptions in EEBO!

But there is a more serious problem with our convenient sources. Last week, at the meeting of the Consortium of European Research Libraries at the British Library, Tim Hitchcock[10] gave what he described on his blog as “a five minute rant”, in which he noted that most digitisation projects – such as EEBO, ECCO, Old Bailey Online, but also the papers of Darwin, Newton, and others – these projects are certainly transformative, but ironically they consist of the Western canon: texts written by the dead, white, male, elite. So, while digitisation projects have produced masses of data – well enough for sophisticated data-mining experiments – the problem is that this data is skewed.

Of course, the counter-argument is that in the humanities we are trained to be aware of the limitations of our sources. But we are also pressed for time and money, and going for the low-hanging fruit is only natural: we are designed for convenience. And in the process we often “forget” to approach our digital sources critically.

And when scholars and others from outside the humanities start to mine this data, for instance by using tools such as Google Ngrams, the results they produce are doubly skewed: first by a poor understanding of the data, and secondly by the limitations of the data itself. (This results in cases like ‘mining’ Google Ngrams for evidence of the history and development of English[11] – but in fact GBooks metadata (that the Ngrams tool uses) is atrocious, with modern editions are frequently mis-tagged as historical texts, and thus the results presented in the Ngram viewer in fact contain, for instance, 3-grams (frequently occuring strings of 3 words) from the “1540s” including 3-grams such as “an edition of” and “in the Bodleian” – which most certainly do not occur in texts from the 1540s).

This is familiar to us from the reporting of experiments in newspapers – all too often in the case of a social psychology experiment, where what has happened is that the researchers have only taken what is known as a “convenience sample” – ie. asked their students. This is not necessarily good or representative, but it sure is convenient! All too often the subjects of study in psychological tests are WEIRD –Western, Educated, Industrial, Rich and Democratic.[12] In biology, the same phenomenon is known as “taxonomic bias” – it is easier to decide to do research on big, cuddly mammals that are easy to find, than small beetles in the rainforest canopy. And in the case of biology, it is also, unfortunately, easier to get funding to do research on animals that seem more “important” to the layman.

(Another probelmatic issue relating to digital resources is that while they are used increasingly by scholars, they do not receive anything like the number of citations they should. Scholars will use EEBO to conduct their study, but then cite the original books – showing a preference for “the real thing” (in spite of their behaviour!).)

IV. The promissory nature of digital humanities and the permissive digital archive

I will wrap up my huge topic with a comment on the promissory nature of digital humanities, and the permissive nature of the digital archive.

Digital humanities is not a new discipline, but there remains a sense of newness and urgency. You might even call it millennialism – the revolution or paradigm shift is said to be “just around the corner”! But I would like to argue that in fact, we are there already. It is just a slow revolution, a revolution in small steps. When I started my studies, early modern English books could only be consulted in specialist collections, or as printed facsimiles. Students today have probably never even seen a printed facsimile – for them, the digital versions on EEBO are “Early Modern English books”.

Digital resources like EEBO are promissive in the sense that their scale and nature theoretically allow for entirely new research questions to be asked, thus paving the way for the promise of new and exciting results. The proliferation of digital resources and tools reflects this – there is a sense that if only we build enough of these things, we will figure out the meaning of it all.

This view has its critics. But as Steven Ramsay has pointed out, “I can now search for the word “house” (maybe “domus”) in every work ever produced in Europe during the entire period in question (in seconds). To suggest that this is just the same old thing with new tools, or that scholarship based on corpora of a size unimaginable to any previous generation in history is just “a fascination with gadgets,” is to miss both the epochal nature of what’s afoot, and the ways in which technology and discourse are intertwined”.[13]

The most striking feature of the digital archive in terms of how it can be permissive, is the way in which these archives can be connected to each other, using and reusing data, adding user-created content, and functioning like a database as well as like an edition, thanks to sophisticated digital analytical tools. There are already projects that have some or all of these features – most of them are relatively small-scale, but that does not detract from their worth. I have to conclude by saying how sorry I am that I had not the time to show you some examples! Luckily the previous two papers gave you some excellent examples.

Thank you very much.

 

—————————

Postscript 13.11.2012

This paper was, in part, about the dangers of using digital resources uncritically. At the same time, I tried to look at some of the ways in which the existence of these resources has affected our research habits. But the following day, thinking over all the excellent papers presented at the conference, and conversations with people during the day, I realized that in fact, I was not convinced that digital resources presented a serious problem, at least to this community of scholars. To be sure, almost everyone uses resources like EEBO – and many participate in the creation of other digitised or digital archives – but everyone makes use of them while being very conscious of their failings in comparison to the physical sources. Everyone is also aware of why we use them: because they greatly facilitate research (making it easier to do some ‘old’ kinds of research, and making it possible to look at new things); and because they are convenient. But convenience is not a bad thing when one has a good understanding of the compromises involved in creating the convenience. As long as we teach this to our students – which we demonstrably are indeed doing – the existence of these resources and tools is nothing less than a blessing.

I think, however, that we could all be more diligent in citing the digital sources we use – not only for scholarly integrity, but also in order to help raise the standing of and appreciation for digital resources. Those of us who create such resources well know how little credit we receive for our tasks, a matter particularly painful considering our output is linked to funding.

 


[1] Kenneth Price, “Edition, Project, Database, Archive, Thematic Research Collection: What’s in a Name?”. DHQ: Digital Humanities Quarterly Vol. 3 no. 3, 2009. http://digitalhumanities.org/dhq/vol/3/3/000053/000053.html

[2] Kate Theimer, “Archives in Context and as Context”. Journal of Digital Humanities Vol. 1 no. 2, 2012. http://journalofdigitalhumanities.org/1-2/archives-in-context-and-as-context-by-kate-theimer/.

[3] Nathan Jurgenson coined the term in “Digital duality versus augmented reality”, 24 Feb. 2011, on the Cybergology blog on the Society Pages website. The above discussion is drawn from “How to kill digital dualism without erasing differences” of 16 Sep. 2012, and “Strong and mild digital dualism”, 29 Oct. 2012, on the same blog. http://thesocietypages.org/cyborgology/.

[4] Lorna Hughes, “Conclusion: Virtual Representation of the Past – New Research Methods, Tools and Communities of Practice”, p. 192. In The Virtual Representation of the Past, ed. by Mark Greengrass and Lorna Hughes. Ashgate, 2007.

[5] Tim Hitchcock, “Digital Searching and the Re-formulation of Historical Knowledge”, pp. 84-85. In Virtual Representation of the Past.

[6] Hitchcock, ibid. p. 89.

[7] Peter Shillingsburg, “How Literary Works Exist: Convenient Scholarly Editions”, paragraph 25. DHQ: Digital Humanities Quarterly Vol. 3 no. 3, 2009. http://digitalhumanities.org/dhq/vol/3/3/000054/000054.html.

[8] Emma Huber, “Using digitised text collections in research and learning”, talk given at the JISC-funded workshop “Optical Character Recognition (OCR) for the mass digitisation of textual materials: Improving Access to Text”, Bath on 24 Sep. 2009. http://www.slideshare.net/ekhuber/using-digitised-text-collections-in-research-and-learning.

[9] Brooks, Stephen. (@Stephen_Brooks_). “@RuthNRoberts @UkNatArchives #digitaltrail they stay at home and expect archives to work like Google.” 30 Aug 2012, 2:21 PM. Tweet. Part of the #digitaltrail discussion hosted by TNA on 30 Aug. 2012, http://blog.nationalarchives.gov.uk/blog/beyond-paper-the-digital-trail. Twitter conversation archived at http://storify.com/LauraCowdrey/beyond-paper-the-digital-trail.

[10] Tim Hitchcock, “A Five Minute Rant for the Consortium of European Research Libraries” (given on 31.10.2012 at the British Library), 29 Oct. 2012, Histryonics blog. http://historyonics.blogspot.co.uk/2012/10/a-five-minute-rant-for-consortium-of.html.

[11] The following example is from John Lavagnino, “Scholarship in the EEBO-TCP Age”, talk by John Lavagnino at the conference Revolutionizing Early Modern Studies? The Early English Books Online Text Creation Partnership in 2012, Oxford, 17 September 201. http://www.slideshare.net/jlavagnino/scholarship-in-the-eebotcp-age.

[12] Samuel Arbesman, “Big data: Mind the gaps”. IDEAS column in The Boston Globe, 30 Sep. 2012. http://www.bostonglobe.com/ideas/2012/09/29/big-data-mind-gaps/QClupxdwdPWHtRrZO0259O/story.html.

[13] From Patrik Svensson, “Envisioning the Digital Humanities”, DHQ: Digital Humanities Quarterly 6.1, 2012. http://digitalhumanities.org/dhq/vol/6/1/000112/000112.html.

Published by

kaislani

No life, just PhD.

2 thoughts on “The Permissive Digital Archive”

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This blog is kept spam free by WP-SpamFree.