At the beginning of this week, I attended the two-day Big Data Approaches to Intellectual and Linguistic History symposium at the Helsinki Collegium for Advanced Studies, University of Helsinki. Since Tuesday, I’ve found myself pondering on topics that came up at the symposium. So I thought I would write up my thoughts in order to unload them somewhere (and thus hopefully stop thinking about them) (I have a chapter to finish, and not on digital humanities stuff), and also in order to try to articulate, more clearly than the jumbled form inside my head, my reflections upon what was discussed there. I.e. the usual refrain, ‘I need to hear what I say in order to find out what I think’.

So here goes.

NB this is not a conference report, in that I’m not going to talk about specific presentations given at the symposium. For that, check out slides from the presentations linked to from the conference conference website, and see also the Storify of the tweets from the event (both including those from the workshop that followed on Wednesday, Helsinki Digital Humanities Day).


I’ve been a part of the DH (Digital Humanities) community for about ten years now. I started off working on digital resources – linguistic corpora, digital scholarly editing; I’ve even fiddled with mapping things – but have in recent years not been actively engaged in resource- or tool-creation as such. Yet I use digital and digitised resources on a daily basis: EEBO frequently, the broad palette of resources available on British History Online all the time, and, when I have access to them, State Papers Online and Cecil Papers (Online). (I work on British state papers from around 1600, and am lucky in that much of the material I need has been digitised and put online in one form or another). I also keep an eye on what happens in the DH world: I attend DH-related conferences and seminars and whatnot when I can, subscribe to LLC (Literary & Linguistic Computing, about to be renamed DSH, Digital Scholarship in the Humanities), and hang out with DHers both online (Twitter, mostly) and in real life.

All this goes to say that I feel quite confident about my understanding of DH projects at the macro level. (Details, certainly not: implementation, encoding, programming, etc etc).

Thus, attending a DH symposium on ‘big data’, I expected to hear presentations about things I was already familiar with. And this turned out to be the case: there were descriptions of/results from projects, descriptions of methodologies (explaining to those from other disciplines ‘what is it we do’), and explorations of concepts that keep coming up in DH work.

Don’t get me wrong: I found all the presentations (that I saw) very good, and listening to talks by people in other disciplines does give you new perspectives. Maybe not profound ones, and often you end up thinking/feeling there’s little or no common ground so why do we even bother? But it’s not a completely useless exercise. Yet what I felt to be the take-away points from this symposium were ones I feel keep coming up at DH events that I have attended over the years, and ones that we – meaning the DH community – are well aware of. Such as (by no means a comprehensive list):

1. Issues with the data

  • “Big Data” in the humanities is not very big when compared to Big Data in some other fields
  • We know Big Data is good for quantity, but rubbish for quality
    • We are aware of the importance and value of the nitty-gritty details
  • We know that manual input is a required part of both processing/methodology – in order to fine-tune the automatic parts of the process – and more importantly, for the analysis of the results (Matti Rissanen’s maxim: “research begins where counting ends”)
  • We know that Our data – however Big it is – is never All data (our results are not God’s Truth)
    • We are aware of the limits of the historical record (“known unknowns, unknown unknowns”)

2. Sharing tools and resources

  • We need to develop better tools, cross-disciplinary ones
    • Our research questions may be different, but we are all accessing and querying text
  • We need to develop our tools as modular “building blocks”, ‘good enough’ is good enough
  • We need to share data/sources/databases/corpora/materials – open access; copyright is an issue, but we’re all (painfully) aware of this

Clearly, these are important points that we need to keep in mind, and challenges that we want to address. And repetitio mater studiorum est. So why do I feel that their reiteration on Monday and Tuesday only served to make me grumpier than usual?*

In the pub after Wednesday’s workshop, we talked a little bit about how pessimistically these points tend to be presented. “We can’t (yet) do XYZ”. “We need to understand that our tools and resources are terrible”. …which now reminds me of what I commented in a previous discussion, on Twitter, early this year:

One element in how I feel about the symposium could be the difficulty of cross-disciplinary communication. This, too, is familiar to me seeing as I straddle several disciplines, hanging out with historical linguists on the one hand, historians on th’other, and then DHers too. I once attended a three-day conference convened by linguists where the aim was to bring linguists and historians together. I think only one of the presentations was by a historian…  So yeah, we don’t talk – as disciplines, that is: I know many individuals who talk across disciplinary borders.  …and, come to think of it, I know a number of scholars who straddle such borders. But perhaps it’s just that at interdisciplinary events there’s a required level of dumbing-down on the part of the presentators on the one hand, and inevitable incomprehension on the part of the audience on the other. Admittedly, it is incredibly difficult to give a interdisciplinary paper.

A final point, perhaps, in these meandering reflections, is of course the wee fact that I don’t, in fact, work on research questions that require Big Data.† (At the moment, anyway). So I’m just not particularly interested in learning how to use computers to tell me something interesting about large amounts of texts – something that it would be impossible to see without using computational power. It’s not that the methodologies, or indeed the results produced, are not fascinating. It’s just that I guess I lack a personal connection to applying them.  ..but then, I suppose this can be filed under the difficulty of interdisciplinary communication! ‘I see what you’re doing but I fail to see how it can help me in what I do’.


So how to conclude? I guess, first of all, kudos to HCAS for putting the symposium together – and, judging from upcoming events, for playing an important part in getting DH in Finland into motion. It’s not as if there’s been nothing previously, and HCAS definitely cannot be credited for ‘starting’ DH activities in Finland in any way – some of us have been doing this for 10 years, some for 30 years or more, and along the way, there have been events which fall under the DH umbrella. But only in the past year or so has DH become established institutionally in the University of Helsinki: we have a professor of DH now, and 4 fully-funded DH-related PhD positions. Perhaps it was the lack of institutional recognition that made previous efforts at organizing DH-related activities here for the large part intermittent and disconnected. But we’ll see how things proceed: certainly many of us are glad to see DH becoming established in Finnish academia as an entity. And judging by the full house at the symposium and the workshop that followed, it would appear that there are many of us in the local scholarly community interested in these topics. The future looks promising.

It should also be said that DH has come a long way from what it was ten years ago. The resources and tools we have today allow us to do amazing things. Just about all of the presentations at the symposium described and discussed projects that use complicated tools to do complex things. I am seriously impressed by what is being done in various fields – and simply, by what can be done today. And there is no denying that there is a Lot of work being done across and between disciplines: DH projects are often multidisciplinary by design, and many are working on and indeed producing tools and resources that can be useful to different disciplines.

Maybe it’s just the season making me cranky. You’ll certainly see me at the next local DH event. Watch this space..



* ..Maybe it’s just conference fatigue that I’m struggling with? There are only so many conference papers one can listen to attentively, and almost without exception there is never time to do much but scratch the surface and provide but a thin sketch of the material/problem/results/etc. (It’s rather like watching popular history/science documentaries/programs on tv: oh look, here’s Galileo and the heliocentric model again, ooh with pictures of the Vatican and dramatic Hollywood-movie music, for chrissakes). (I mean, yes it’s interesting and cool and all that but oh we’re out of time and have to go to commercials/questions). (So in order to retain my interest there needs to be some seriously new and exciting material/results to show, like those baby snow geese jumping off a 400ft cliff (!!!!) in David Attenborough’s fantastic new documentary Life Story, or all the fantastic multilingual multiscriptal stuff in historical manuscripts that we have only just started to look at in more detail. If it’s yet another documentary about Serengeti lions / paper about epistolary formulae in Early Modern English letters, I’m bound to skip it. I’m willing to be surprised, but these are well-trodden ground).   /rant

† Incidentally, I disagree with the notion that in the Humanities we don’t have Big Data – I would say that this depends on your definition of “big”. While historical text corpora may at best only be some hundreds of millions of words, and this pales in comparison to the petabytes (or whatever) produced by, say, CERN, or Amazon, every minute, I see (historical) textual data as fractal: the closer you look at it, the more detail emerges. Admittedly, a lot of the detail does not usually get encoded in the digitised corpora (say, material and visual aspects of manuscript texts), but there’s more there than per byte than in recordings of the flight paths of electrons or customer transactions. Having said this, I’m sure someone can point out how wrong I am! But really, “my data > your data”? I don’t find spitting contests particularly useful in scholarship, any more than in real life.

Did English spelling variation end in the 1630s?

1. Early Modern English spelling variation

Yesterday, rather late in the evening, I followed a link on Twitter:

This led to the great Early Modern Print : Text Mining Early Printed English website where there was an interface like the Google Books Ngram Viewer but for the EEBO-TCP corpus, called EEBO Spelling Browser (or more technically, EEBO-TCP Ngram Browser). With the delight of a researcher falling upon a new toy I started to play with it – but hadn’t even started when I was struck by the figure that is displayed when you navigate to the EEBO-TCP Ngram Browser page. It looks like this:

1630s spelling change EEBO 1

The idea of an ngram viewer – as per Google – is to look at the frequency of occurrences over time, of a word (a 1-gram) or a phrase (2-, 3-, 4- … N-gram). Frequency here means the proportion of the search phrase to all the words in the corpus, plotted over time. So for instance, the frequency of the word “war” rises during wartime, and falls in peacetime. But things get much more interesting when you look at less obvious things.

Anyway. The point of the EEBO-TCP spelling variant ngram viewer is to compare the change and development of spelling variants over time: for instance, plotting “spell” against “spelle”, “spel”, etc:


English spelling only became standardized in the 18th century, and anyone who wants to read earlier texts has to learn to deal with the fact that apparently all spellings were equally acceptable, and that writers haphazardly used the first spelling that came to their mind* – one of the most famous (or notorious) examples being how William Shakespeare signed his name in six different ways. Despite eventual standardization, spelling variation in English has not completely disappeared today, for although varying how you spell your name today sounds outrageous and unthinkable, all students of English as a foreign language have to learn that there are British and American spellings for many familiar words: colour and color, standardize and standardise, etc.


2. What the hell happened in 1625?

But to return to the EEBO-TCP Spelling Browser, what struck me was the dramatic change in the 1630s. If you look back to the first figure above, you can see that of two spellings of the word above, the spelling “aboue” is essentially the given form until 1625, when it rapidly loses to the alternative spelling “above”, which is firmly established by about 1640.

..Hang on, what? The centuries-old practice of not differentiating between the graphemes <u> and <v> according to the phonemes they indicate – /u/ and /v/ – is replaced, over the stunningly short period of 15 years – across the board (!?) in printed texts by consistent mapping of <u> to /u/ and <v> to /v/..!?

Sooo many questions.

My very first thought was that it must be an artefact of the dataset. One word, of course, hardly tells the whole story. Did this change hold for other words that show u/v variation? What about i/j variation? Or perhaps the EEBO-TCP material was somehow skewed?

But however much I fiddled with the browser, the period between 1620 and 1640 remained the significant factor. And it also applied for i/j-words:

1630s spelling change EEBO 2

But I did also check EEBO proper – knowing that the results may well be different from those of the EEBO-TCP Ngram Browser. However, it turned out once again that the Browser had been right:

joy ioy in EEBO proper

So what on earth happened in the 1620s and 1630s to explain this dramatic shift into standardized spelling?

…actually, I don’t know. Googling revealed that, on the one hand, this is a known phenomenon – although I so far have not found a definitive study of the phenomenon nor a good explanation. (Clearly it has something to do with what’s going on in printing houses). But for instance in her article in the Cambridge History of the English Language vol 3 (2000), Vivian Salmon discusses historical variation in using <u> and <v> to indicate both /u/ and /v/, and then quite casually mentions how “the distinction was made in the 1630s” (p. 39). I think that a corpus-based study of this change remains to be done – although I could be wrong.

Yet rather than starting to look at this point in more detail, I pursued another question that had come to mind: how did this shift in orthographical practices manifest in non-printed texts, such as letters?


3. Non-printed texts and manuscripts

Happily, I am in a perfect position to ask this question, being part of the team who have compiled the Corpus of Early English Correspondence (CEEC). The CEEC is a corpus of English personal letters, spanning 1400-1800 and presently containing about 12,000 letters (5.2m words). It was designed for historical sociolinguistics – to apply modern sociolinguistics methods on historical texts.

Of course, there’s a caveat: CEEC is based on printed editions of letters. “Hang on”, you might say, “is a corpus built from such sources linguistically reliable? Shouldn’t the corpus have been compiled from manuscript texts?” Well, yes – but we have been careful not to use editions that modernize the letter texts, as well as editions that normalize the texts extensively. For the kinds of linguistic queries that the corpus was designed for, the normalization of features such as u/v variation was deemed acceptable. And we have always been careful to stress that the CEEC is not suitable for studying English orthography.

Anyway, I nonetheless rushed right in to see what the CEEC threw up. Not having fancy tools (like DICER) to reveal the proper extent of variant spellings in the corpus, I used a short list of sample words (euer/ever, ouer/over, aboue/above, vp/up). But the results were underwhelming:

CEEC u.v. 17C

In this figure, the ratio of the old form of u/v-spelling variants was far too low through the whole period – it should have been at least around 80%, if the EEBO data was indicative of English spelling practices overall, rather than just those restricted to printed texts.

In order to have better data, I spent some time extracting a subcorpus from the CEEC† consisting of texts only from editions of 17th-century letters in which I could find u/v and i/j-variation. This time, the results were more interesting:

CEEC_part u.v. 17C

Although the ratio of old spelling variants is still much lower than I had expected, in this figure there is a sharp decline from the 1640s on – which would be in accordance to a prescribed change. (For example, if all schoolchildren are taught to spell according to certain rules, it takes a while for the older generations of writers to die out (or change their spelling habits). Similarly, it makes sense that the influence of a standard orthography in printed texts would reflect in manuscript texts with a slight time lag.)

Yet I remained unhappy with this data. In EEBO, the shift is from nearly 100% old form to 100% new form. Clearly the texts of the editions used for CEEC were normalized more than I had thought. Even given that this was a quick pilot study, the discrepancy was simply too large to accept as a difference between orthographical practices of manuscript and print.

I had one last trick up my sleeve: I did have a fairly good-sized corpus of letters from the first decade of the 1600s transcribed from manuscript, which retained original spellings and other orthographical features. It wouldn’t show me change over time, but it would give me a control figure for how much, exactly, were letter-writers using the old forms in their letters.‡

The result can be seen in the figure above – it is the red X, marking a whopping 87.9% old forms. Finally, something resembling the situation in EEBO.

There was a fair bit of variation between different words in the manuscript sources, and in some cases the new form was dominant:

euer/ever adu*/adv* haue/have
old spelling 35 90 1017
new spelling 28 191 20
% old 56% 32% 98%

The greatest discrepancy between the manuscript sources and CEEC (namely the second extracted subcorpus) could be seen in the fact that in the manuscripts, words beginning /un-/ were spelled with a <v> 99.6% of the time (of 987 tokens), whereas in CEEC, the <v>-form occurred only 31% of the time (of 295 tokens). Even editions which claim to retain original spellings clearly cannot be taken at face value.

4. Summing up

So what can we say about that dramatic end to spelling variation in the 1630s seen in the figures from the EEBO-TCP Ngram Browser? Actually, not much.

1. It would appear that in the EEBO corpus, u/v variation became standardized between 1620 and 1640. However, without a comprehensive survey even this conclusion may be wrong – cf. for instance i/j variation in the proper name James, where it takes longer for the <i>-form to start declining, nor is it gone by the end of the century (this might have to do with capitalisation):

iames james

2. In manuscript texts, it looks like the spelling standardization process occurred 20 or more years after it took place in print. But without a broader survey, even this estimate may be well wrong.

3. The EEBO Spelling Browser is awesome!

I do remain curious about what happened in the 1620s & 30s. Particularly in whether the standardization of spelling was something more than a development in printing house practices. But I think I’ve done my share of midnight rabbit chasing for the moment.


* Students of Early Modern English beware: this is not true! There are methods in the apparent madness, although the rules may be subtle, and they do vary between writers.

† My first search of CEEC material was of c. 5,000 letters (2.2m words), finding 6,657 tokens of which 700 were old spellings. My second dataset consisted of c. 1,900 letters (just under 800k words), and 3,613 tokens of which 887 were old forms  (types: euer/ever, ouer/over, aboue/above, vp/up, vs/us).

‡ This manuscript-based corpus contained about 200 letters (130k words). I expanded my sample word list  (types: euer/ever, ouer/over, aboue/above, haue/have, giue/give, vp/up, vs/us, adu*/adv*, vn*/un*), extracting 2,712 tokens – of which 2,385 were old forms.

How should you cite a book viewed in EEBO?

Earlier today, there was a discussion on Twitter on citing Early Modern English books seen on EEBO. But 140 characters is not enough to get my view across, so here ’tis instead.

The question: how should you cite a book viewed on EEBO in your bibliography?

When it comes to digitized sources, many if not most of us probably instinctively cite the original source, rather than the digitized version. This makes sense – the digital version is a surrogate of the original work that we are really consulting. However, digitizing requires a lot of effort and investment, so when you view a book on EEBO but only cite the original work, you are not giving credit where it is due. After all, consider how easy it now is to access thousands of books held in distant repositories, simply by navigating to a website (although only if your institution has paid for access). This kind of facilitation of research should not be taken for granted.

(What’s more, digital scholarship is not yet getting the credit it deserves – and as a creator of digital resources myself, I feel quite strongly that this needs to change.)

Anyway; so how should you cite a work you’ve read in EEBO, then?

This is what the EEBO FAQ says (edited slightly; bold emphasis mine):

When citing material from EEBO, it is helpful to give the publication details of the original print source as well as those of the electronic version. You can view the original publication details of works in EEBO by clicking on the Full Record icon that appears on the Search Results, Document Image and Full Text page views, as well as on the list of Author’s Works.

Joseph Gibaldi’s MLA Handbook for Writers of Research Papers, 7th ed. (New York: The Modern Language Association of America, 2009), deals with citations of online sources in section 5.6, pp.181-93. For works on the web with print publication data, the MLA Handbook suggests that details of the print publication should be followed by (i) the title of the database or web site, (ii) the medium of publication consulted (i.e. ‘Web’), and (iii) the date of access (see 5.6.2.c, pp. 187-8).

… When including URLs in EEBO citations, use the blue Durable URL button that appears on each Document Image and Full Record display to generate a persistent URL for the particular page or record that you are referencing. It is not advisable to copy and paste URLs from the address bar of your browser as these will not be persistent.

Here is an example based on these guidelines:

  • Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. Early English Books Online. Web. 13 May 2003. <>.

If you are citing one of the keyed texts produced by the Text Creation Partnership (TCP), the following format is recommended:

  • Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. Text Creation Partnership digital edition. Early English Books Online. Web. 13 October 2010. <>.

Here’s why I think this is a ridiculous way to cite a book viewed on EEBO:

  1. Outrageous URL. Bibliographies should be readable by humans: the above URL is illegible. Further, while the URL may indeed be persistent, no-one outside the University of Helsinki network can check the validity of this particular URL. And to quote Peter Shillingsburg on giving web addresses in your references, “All these sites are more reliably found by a web search engine than by URLs mouldering in a footnote”. If you’d want to find this resource, you’d use a web search engine and look for “Spenser Faerie Queen EEBO”. Or go directly to EEBO and search there – in any case, you wouldn’t ever use this URL.
  2. Redundant information. Both “Early English Books Online” and “Web”? Don’t be silly.
  3. Access date. If the digital resource you are accessing is stable, there’s no need for this. If it’s a newspaper or a blog, dating is necessary (especially if the contents of the target are likely to change). In the case of resources such as the Oxford English Dictionary – which, though largely stable, undergoes constant updates – each article (headword entry) is marked with which edition of the dictionary it belongs to, which information is enough (and which explains notations like OED2 and OED3, for 2nd and 3rd ed. entries, respectively).

Instead, I suggest and recommend a citation format something like the following:

  • Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. EEBO. Huntington Library.

With a separate entry in your bibliography for EEBO:

  • EEBO = Early English Books Online. Chadwyck-Healey. <>.

And if you’ve used the TCP version, add “-TCP” to the book reference, and include a separate entry for the Text Creation Partnership (EEBO-TCP).

This makes for much shorter entries in your bibliography, and clears away pages of redundant clutter which doesn’t tell the reader anything.

Why cite the source library?

Book historians will tell you – at some length – that there is no such thing as an edition of a hand-printed book. No two books printed by hand are exactly identical (in the way that modern printed books are identical) – due to misprints and the like, but also because for instance the paper they are printed on will be different from one codex to another (since a printer’s paper stock came from many different paper mills). So two copies of an Early Modern book (the same work, the same ‘edition’) will always differ from each other – sometimes in significant ways.

For this reason, really we should cite books-as-artefacts rather than books-as-works. Happily, EEBO gives the source library of each book, and including that information is straightforward and simple enough.

Problems and questions – can you not cite EEBO?

Some of the books on EEBO are available as images digitized from different microfilm surrogates of the source book. That is, there is more than one microfilm of the same book. Technically, these surrogate images are different artefacts and we should really reference the microfilm too…  I see that this could be a problem, but have not come across an issue where citing the microfilm would have been relevant to the work I was doing.

Q: Which brings us to another important point: if you are only interested in the work, is it really necessary to cite the format, never mind the artefact?

A: Well, yes, for the reasons outlined above – and simply because it is good scholarly practice.

Q: What if you only use EEBO to double-check a page reference or the correct quotation of something you’d made a note of when you viewed (a different copy of) the work in a library?

A: Ah. Well, if you are feeling conscientious, maybe make a note that you’ve viewed the work in EEBO as well as a physical copy – say, use parentheses: “(EEBO. Huntington Library.)”.

Incidentally, since Early Modern books-as-artefacts differ from each other, technically we should always state in the bibliography which copy of the work we have seen. But I’m not sure anyone is quite that diligent – book historians perhaps excepted – and I can’t be bothered to check right now.

Q: Argh. Look, can’t we just go back to not quoting the work and not bother with all this?

A: No. Sorry.

However, I think we’ve drifted a bit far from our departure point.

All this serves to illustrate how citing Early Modern books – be it as physical copies, printed editions or facsimiles, or digital surrogates – is no simple matter. (And we haven’t discussed whether good practice should also include giving the ESTC number in order to identify the work…)  So no wonder no standard practice has emerged on how to cite a work seen on EEBO.

Yet in sum, if you consult books on EEBO, I strongly urge you to give credit to EEBO in your bibliography. 


ETA 27.2.2014 8am:

Another argument for why to make sure to cite EEBO is the rather huge matter of what, exactly, is EEBO, and how what it is affects scholarship. In the words of others:

Daniel Powell notes that:

[I]t seems important to realize that EEBO is quite prone to error, loss, and confusion–especially since it’s based on microfilm photographed in the 1930s-40s based on lists compiled in some cases the 1880s.

And Jacqueline Wernimont adds:

EEBO isn’t a catalogue of early modern books – it’s a catalogue of copies. More precisely, it is a repository of digital images of microfilms of single copies of books, and, if your institution subscribes to the Text Creation Partnership (TCP) phases one and/or two, text files that are outsourced transcriptions of microfilm images of single texts.

These points are particularly relevant if you treat EEBO as a library of early modern English works, but they apply equally when you access one or two books to check a reference. As Sarah Werner (among others) has shown us, digital facsimiles of (old) microfilms of early books can miss a lot of details that are clearly visible when viewing the physical books (like coloured ink). While in many cases the scans in EEBO are perfectly serviceable surrogates of the original printed book – black text on white paper tends to capture well in facsimile – the exceptions drive across the point that accessing a book as microfilm images is not the same as looking at a physical copy of the book.

This is not to say that all surrogates, and especially microfilms, are bad as such. In many cases it is the copy that survives whereas the original has been lost. And I have come across cases where the microfilm retains information that has been lost when the manuscripts have been cleaned by conservators and archivists some time after being microfilmed. (Pro tip for meticulous scholars: have a look at all the surrogates, even if you don’t need to!) Also, modern digital imaging is enabling us to read palimpsests and other messy texts with greater ease than before (or indeed at all).

In essence, then, you should make sure to cite EEBO when you use it – not only because of things you may miss due to problems with the images in EEBO, but also because digital resources enable us to do things which are simply impossible or would take forever when using physical copies.


Ok this was a long rant. But I hope this might be of use to someone!

The Permissive Digital Archive

Samuli Kaislaniemi (University of Helsinki)

[This is the paper I gave at The Permissive Archive conference at UCL in London on 9 November 2012. This versions includes sections that I skipped when giving the talk – these are indented in the text below. My apologies to those whose images I cribbed: I have linked to my sources, but will remove any and all borrowed images if asked.]

Let me start by saying how happy I am to be here. I don’t think I am the only one at this conference whose life has been positively changed by CELL. And I can’t think of any other academic institution that manages to host conferences that feel like parties!

0. Introduction

The digitisation revolution – for it is a revolution – has changed the way we do historical research. This applies equally to archaeologists and historical linguists, literary scholars and historians: anyone working on the past cannot but be affected by new digital tools and resources. They bring their own share of new challenges – many of which turn out to be old challenges. And they also promise – or seem to promise – to deliver new and exciting results.

I. Terminology: What is a digital archive?

What is a digital archive? The previous two presentations both talked about digital archives, but the term was not defined – so there seems to be a general understanding of what we mean by this term. Kenneth Price[1] has tried to tease out the nuances between different terms used for essentially similar digital resources, but discovered that distinctions are blurred. An Electronic Edition, according to Price, can mean almost anything. They certainly are not restricted to being digital versions of print editions. A digital project, on the other hand, is even more amorphous – but the word “project” has a sense of time, in that projects have a beginning and an end. Projects are either unfinished, or finished. In comparison, a database is usable from the moment it is set up. The term “database”, however, carries connotations of a technical nature – we think of relational databases – but when it is used as a word to describe a digital historical resource, it should be taken metaphorically. “In a digital environment”, says Price, “archive has gradually come to mean a purposeful collection of surrogates.” This is exactly what is more adequately implied by his last term, thematic research collection – and arguably, most digital resources are exactly this. But it doesn’t exactly roll off the tip of your tongue..

I’m afraid a discussion of what is an archive did not fit into this paper in the end, but to give you an idea, here is what archivist Kate Theimer[2] had to say about digital “archives”..

In other words, a digital “archive” is not an archive, but a collection. In contrast, here is Price’s comment again:

I think the use of the word archive is justifiable, sincefor the scholar, a repository is a repository: the details may differ from place to place, but any place you go to for access to original sources is, in essence, an archive.

Given this loose definition, “digital archives” include not only large-scale resources such as EEBO and State Papers Online, but also smaller resources such as the digital editions made here at CELL. And more importantly, I think one’s own personal research collection can be viewed as an archive. I work on archival materials, and my primary tool – after this laptop – is a digital camera. I have compiled a fairly large digital collection, having photographed almost a thousand manuscripts. These will never get published as a collection, of course, but they do form, in essence, my primary archive, which contains in essence surrogates of all the archival materials that I (think I) need.

What can be found in a digital archive? Digitised versions of original sources, of course, as well as metadata and all the other things Jenny Bann mentioned in her paper.

II. Digital dualism

We do not need to be constantly reminded that digitised books and manuscripts are not the same thing as looking at the original, material sources. However, this division into physical and electronic is not always useful, or even accurate.

Nathan Jurgenson[3] has coined the term digital dualism to refer to the false dichotomy between digital and physical worlds. (He actually differentiates between four “ideal” types of digital dualism, which you can see on the slide here – but which I don’t have time to go into.) Digital dualists are those who “believe that the digital world is ‘virtual’ and the physical world is ‘real’”. This is of course a familiar refrain to all of us, included in comments that disparage online communities in general, and the social web in particular. Facebook “is not real”, they say. But Jurgenson criticises the idea that time and energy spent in the digital world subtracts from the physical – he quotes Luciano Floridi: “we are probably the last generation to experience a clear difference between offline and online”. The digital and physical worlds may be ontologically separate, but they are both “real” in the sense of being authentic. That they have very different properties is of course true; but we live in both, and the two worlds interact. Reality, writes Jurgenson, “is always some simultaneous combination of materiality and the many different types of information, digital included.”

Jurgenson notes that “for the vast majority of writers, the relationship between the physical and digital looks like a big conceptual mess”. To remedy the situation, he provides a model of four ideal types of dualism, with “Strong Digital Dualism” at one end – which states that the physical and digital are different realities and do not interact – and Strong Augmented Reality at the other, which states that the realms are party of one single reality and have the same properties. Jurgenson himself takes a milder view, that of “Mild Augmented Reality” – same reality, different properties, interaction.

Lorna Hughes[4] has noted that digital tools and methodologies can well reveal more than traditional approaches: working “with a digital object (a surrogate created from a primary source that has been subject to a process of digitization, or data that were born digital) enables us to recover and challenge the ways in which our senses of time and place are historically and archaeologically understood, something that cannot be effectively communicated through traditional media.”

The usual “argument [is] that digital surrogates distance the scholar from the original sources. They do not. They give the scholar far greater control over the primary evidence, and therefore allow a previously unimaginable empowerment and democratization of source materials”. One great example of studying materiality with digital tools is Kathryn Rudy’s study of “dirty books” – using a densitometer to measure finger grease on pages of late medieval books of hours, revealing the reading habits of their readers, each unique and different from the others. And then there is multi-spectral analysis of palimpsests in order to read the erased text.

Archimedes Codex palimpsest, digitised by the Walters Art Museum

In the future, should we strive for haptic digital representations of manuscripts? Do we want to be able to feel the paper or parchment of a manuscript when viewing it on an iPad? I believe Alison Wiggins made a comment at the recent Digital Humanities Congress at Sheffield to the effect of, it is more useful for the scholar to know what kind of paper is used in a manuscript, than to have the feel of the paper recreated digitally. So perhaps haptic encoding would be more of a Turning-the-Pages –type show-off feature, than something that scholars would find useful. But I digress.

Arguably, then, the materiality of our sources does not get lost in the remediation from physical to digital format. But in any case, we are far more familiar with the visual and textual aspects of digital resources.

III. How using digital archives has changed the way we work and think

The first thing to note about digital archives is that they can be huge. SPOL contains digital images of some 2.2 million manuscripts. As they span 200 years, this comes to, on average, just over 100,000 manuscripts per year. EEBO, while significantly smaller, now has 15 or 20 thousand books available as full text. And the thing about full text is that you can conduct word-searches on it.

Tim Hitchcock[5] has noted that EEBO, ECCO, and other similar resources “have in ten years essentially made redundant 300 years of carefully structured and controlled systems for the categorization and retrieval of information. In the process these developments have also had a profound impact on the way … scholars go about doing research. … it is now possible to perform keyword searches on billions of words of printed text – both literary and historical.”

But what is more, scholars “are expected to search across a large number of electronic sources” – but the process strips them of the opportunity to get to understand the context from which individual elements of information come. (The problem may be also seen to be imposed upon them: scholars – especially students – need to look at “everything” in order not to be considered lazy or neglectful).

And keyword searches make new findings very easy indeed.

Here’s one I did earlier: I looked up the word archive in the Oxford English Dictionary. Then I did a simple keyword search in EEBO, and managed to find an instance of usage of the word 70 years before the first instance recorded by the OED.

(..This is not as amazing as it may seem: in fact, antedating the OED is very easy! But that is what I just showed you.)

But less superficially – to quote Tim Hitchcock[6] again: Keyword searching of printed text “radically transforms the nature of what historians do … in two ways. First, it fundamentally undermines several versions of our claim to social authority and authenticity as interpreters of the past. … If historians speak for the archives, their role is largely finished, as the material they contain is newly liberated and endlessly replicated.” … “Second, the development of searchable electronic archives challenges historians to re-examine the broad meta-narratives which have developed to explain social change. If historians no longer ‘ventriloquize’ on behalf of the archival clerk, then they are free to rethink the nature of social change.” That is to say, if publishing archival findings becomes unneccessary since “everything is accessible online”, then we are free to try to say something bigger.

That, in any case, is the theory: but in practice we are burdened by the curse of Convenience.

Peter Shillingsburg[7] recently wrote: “I was once told that the likelihood that a scholar or student will check the accuracy of a supposed fact is in inverse proportion to the distance that has to be travelled to do the checking. If it can be checked without getting up, high likelihood; across the room, probably but maybe not; out the door across the campus to the library, only if highly motivated. Why? Convenience.”

We are all guilty of this convenience. We say that physical books are better than digital, but we are increasingly likely to prefer online sources.

The constant refrain is that “it’s so much easier to work with whatever is online, and it means you don’t have to travel to see things”.[8] This is particularly true of younger generations, who may only have ever encountered early modern books in EEBO. So we should not be surprised when “[t]hey stay at home and expect archives to work like Google”.[9] And we are also biased towards convenience in using these online sources – if something doesn’t work, we will not do it. We can’t be bothered to learn to use features we don’t know exist. So quite often we end up using EEBO as an online repository of books, without even making full use of its search capabilities.

However, convenience means that we are limited by these convenient sources: our research questions end up being constrained by the digital sources – and by what you can search for in them! Keyword searching, however, falls on its face in front of Early Modern English spelling variation. And don’t get me started on the reliability and accuracy of the transcriptions in EEBO!

But there is a more serious problem with our convenient sources. Last week, at the meeting of the Consortium of European Research Libraries at the British Library, Tim Hitchcock[10] gave what he described on his blog as “a five minute rant”, in which he noted that most digitisation projects – such as EEBO, ECCO, Old Bailey Online, but also the papers of Darwin, Newton, and others – these projects are certainly transformative, but ironically they consist of the Western canon: texts written by the dead, white, male, elite. So, while digitisation projects have produced masses of data – well enough for sophisticated data-mining experiments – the problem is that this data is skewed.

Of course, the counter-argument is that in the humanities we are trained to be aware of the limitations of our sources. But we are also pressed for time and money, and going for the low-hanging fruit is only natural: we are designed for convenience. And in the process we often “forget” to approach our digital sources critically.

And when scholars and others from outside the humanities start to mine this data, for instance by using tools such as Google Ngrams, the results they produce are doubly skewed: first by a poor understanding of the data, and secondly by the limitations of the data itself. (This results in cases like ‘mining’ Google Ngrams for evidence of the history and development of English[11] – but in fact GBooks metadata (that the Ngrams tool uses) is atrocious, with modern editions are frequently mis-tagged as historical texts, and thus the results presented in the Ngram viewer in fact contain, for instance, 3-grams (frequently occuring strings of 3 words) from the “1540s” including 3-grams such as “an edition of” and “in the Bodleian” – which most certainly do not occur in texts from the 1540s).

This is familiar to us from the reporting of experiments in newspapers – all too often in the case of a social psychology experiment, where what has happened is that the researchers have only taken what is known as a “convenience sample” – ie. asked their students. This is not necessarily good or representative, but it sure is convenient! All too often the subjects of study in psychological tests are WEIRD –Western, Educated, Industrial, Rich and Democratic.[12] In biology, the same phenomenon is known as “taxonomic bias” – it is easier to decide to do research on big, cuddly mammals that are easy to find, than small beetles in the rainforest canopy. And in the case of biology, it is also, unfortunately, easier to get funding to do research on animals that seem more “important” to the layman.

(Another probelmatic issue relating to digital resources is that while they are used increasingly by scholars, they do not receive anything like the number of citations they should. Scholars will use EEBO to conduct their study, but then cite the original books – showing a preference for “the real thing” (in spite of their behaviour!).)

IV. The promissory nature of digital humanities and the permissive digital archive

I will wrap up my huge topic with a comment on the promissory nature of digital humanities, and the permissive nature of the digital archive.

Digital humanities is not a new discipline, but there remains a sense of newness and urgency. You might even call it millennialism – the revolution or paradigm shift is said to be “just around the corner”! But I would like to argue that in fact, we are there already. It is just a slow revolution, a revolution in small steps. When I started my studies, early modern English books could only be consulted in specialist collections, or as printed facsimiles. Students today have probably never even seen a printed facsimile – for them, the digital versions on EEBO are “Early Modern English books”.

Digital resources like EEBO are promissive in the sense that their scale and nature theoretically allow for entirely new research questions to be asked, thus paving the way for the promise of new and exciting results. The proliferation of digital resources and tools reflects this – there is a sense that if only we build enough of these things, we will figure out the meaning of it all.

This view has its critics. But as Steven Ramsay has pointed out, “I can now search for the word “house” (maybe “domus”) in every work ever produced in Europe during the entire period in question (in seconds). To suggest that this is just the same old thing with new tools, or that scholarship based on corpora of a size unimaginable to any previous generation in history is just “a fascination with gadgets,” is to miss both the epochal nature of what’s afoot, and the ways in which technology and discourse are intertwined”.[13]

The most striking feature of the digital archive in terms of how it can be permissive, is the way in which these archives can be connected to each other, using and reusing data, adding user-created content, and functioning like a database as well as like an edition, thanks to sophisticated digital analytical tools. There are already projects that have some or all of these features – most of them are relatively small-scale, but that does not detract from their worth. I have to conclude by saying how sorry I am that I had not the time to show you some examples! Luckily the previous two papers gave you some excellent examples.

Thank you very much.



Postscript 13.11.2012

This paper was, in part, about the dangers of using digital resources uncritically. At the same time, I tried to look at some of the ways in which the existence of these resources has affected our research habits. But the following day, thinking over all the excellent papers presented at the conference, and conversations with people during the day, I realized that in fact, I was not convinced that digital resources presented a serious problem, at least to this community of scholars. To be sure, almost everyone uses resources like EEBO – and many participate in the creation of other digitised or digital archives – but everyone makes use of them while being very conscious of their failings in comparison to the physical sources. Everyone is also aware of why we use them: because they greatly facilitate research (making it easier to do some ‘old’ kinds of research, and making it possible to look at new things); and because they are convenient. But convenience is not a bad thing when one has a good understanding of the compromises involved in creating the convenience. As long as we teach this to our students – which we demonstrably are indeed doing – the existence of these resources and tools is nothing less than a blessing.

I think, however, that we could all be more diligent in citing the digital sources we use – not only for scholarly integrity, but also in order to help raise the standing of and appreciation for digital resources. Those of us who create such resources well know how little credit we receive for our tasks, a matter particularly painful considering our output is linked to funding.


[1] Kenneth Price, “Edition, Project, Database, Archive, Thematic Research Collection: What’s in a Name?”. DHQ: Digital Humanities Quarterly Vol. 3 no. 3, 2009.

[2] Kate Theimer, “Archives in Context and as Context”. Journal of Digital Humanities Vol. 1 no. 2, 2012.

[3] Nathan Jurgenson coined the term in “Digital duality versus augmented reality”, 24 Feb. 2011, on the Cybergology blog on the Society Pages website. The above discussion is drawn from “How to kill digital dualism without erasing differences” of 16 Sep. 2012, and “Strong and mild digital dualism”, 29 Oct. 2012, on the same blog.

[4] Lorna Hughes, “Conclusion: Virtual Representation of the Past – New Research Methods, Tools and Communities of Practice”, p. 192. In The Virtual Representation of the Past, ed. by Mark Greengrass and Lorna Hughes. Ashgate, 2007.

[5] Tim Hitchcock, “Digital Searching and the Re-formulation of Historical Knowledge”, pp. 84-85. In Virtual Representation of the Past.

[6] Hitchcock, ibid. p. 89.

[7] Peter Shillingsburg, “How Literary Works Exist: Convenient Scholarly Editions”, paragraph 25. DHQ: Digital Humanities Quarterly Vol. 3 no. 3, 2009.

[8] Emma Huber, “Using digitised text collections in research and learning”, talk given at the JISC-funded workshop “Optical Character Recognition (OCR) for the mass digitisation of textual materials: Improving Access to Text”, Bath on 24 Sep. 2009.

[9] Brooks, Stephen. (@Stephen_Brooks_). “@RuthNRoberts @UkNatArchives #digitaltrail they stay at home and expect archives to work like Google.” 30 Aug 2012, 2:21 PM. Tweet. Part of the #digitaltrail discussion hosted by TNA on 30 Aug. 2012, Twitter conversation archived at

[10] Tim Hitchcock, “A Five Minute Rant for the Consortium of European Research Libraries” (given on 31.10.2012 at the British Library), 29 Oct. 2012, Histryonics blog.

[11] The following example is from John Lavagnino, “Scholarship in the EEBO-TCP Age”, talk by John Lavagnino at the conference Revolutionizing Early Modern Studies? The Early English Books Online Text Creation Partnership in 2012, Oxford, 17 September 201.

[12] Samuel Arbesman, “Big data: Mind the gaps”. IDEAS column in The Boston Globe, 30 Sep. 2012.

[13] From Patrik Svensson, “Envisioning the Digital Humanities”, DHQ: Digital Humanities Quarterly 6.1, 2012.

Rant about code (“MS Office uses XML”)

The new .docx etc formats of the newer versions of Microsoft Office are done in XML. Hence the -x in the extension. The problem with this, however, is something we all know: all MS programs are bloated pieces of shit. Those of you who occasionally fiddle with HTML will probably have experimented with the oh-this-is-convenient “save as HTML” function in Word, only to look at the resulting code in stunned admiration of the amount of resulting crap MS has managed to program Office to include.

This is a different version of the same story (ie. a rant):

When I transcribe documents, I do this using plain text editors, and then save the resulting transcriptions as rtfs (ie. “rich text format”, plain text + moderate formatting). Mostly, I use the TextEdit program that comes in macs; occasionally I have used MS Word and saved as rtfs. The results look alike, but are different underneath the hood – and the fact that gives this away is the size of the text file, as the same document can be either 4KB or 49KB, depending on whether I’ve done it in TextEdit or MS Word.

Right, so here’s a clip of the text document in question – as you can see, there is very little formatting to deal with:

Looking at the code of this rtf file (I use a nifty little code editor called Smultron :D ), you can see that there’s not much code in there – as it should be:

But when I save it as rtf in MS Word, the difference is obvious and pronounced. This is the beginning of the resulting document viewed in Smultron:

..and this is the section of the text corresponding to those in images 1 and 2 above.

Check out and compare the stats at the bottom of images 2 and 4. Clearly MS Word is insane. The rtf saved from Word is twelve times the size it need be, and fifteen times the length in characters. The amount of code in the sane version is about 1,000 characters: in the MS Word version, it’s about 44,000 characters. Fourty-four thousand characters!!

So what’s my point? I guess this: Know Your Tools. At the very least, learn their failings, weaknesses and limitations.

This is what digital humanities can be

Ok, so this isn’t part of my NaTheWriMo, just something I’d forgotten to post earlier:

One of the coolest – if not the coolest presentation I saw in England last spring was given by an art historian and medievalist named Kathryn M. Rudy, and entitled “Dirty Books”. Not dirty as in naughty, but dirty as in soiled – it was about the grime left by readers’ hands in the margins of 16th-century prayer books, and what that can tell us. Basically, she used a densitometer to measure how thick the grime was, and the measurements revealed the reading habits of the owner of the prayer book. I thought it was great stuff!

Well, now it’s been published and is available freely on the Journal of Historians of Netherlandish Art website. Well worth a perusal.

There’s something in the hands-on approach in her study that I really like, and the way it brings the past closer – not only through said hand grime, but in revealing that person x only read the optimistic bits, whereas person y was a true zealot and fundamentalist (to use modern parlance). I’m sure her method can be used to good effect on other sources too – alas, however, not on mine. But that’s not really the point either. Rather I’m enthusing about the way her study combines tools from one discipline with the sources of another – modern technology and medieval manuscripts – in order to bring out something which neither discipline could reveal on its own. That’s really something I’m trying to do with my work, too.

NaTheWriMo November 2010
  Day:      2
  Word count:   0
  Transcription word count:   2,400
  Blog entries:   2

Digital Humanities* and “Digital Humanities 2.0”

Back in June, I attended the Digital Humanities 2008 conference. Digital humanities, for those not in the know (although I’m sure the term is hardly opaque), is the ridiculously wide field covering all humanities disciplines which use computers. So it includes everyone from corpus linguists to software engineers interested in solutions for humanists, and from librarians to educators who create computer games in order to teach kids history. And more.

There are a few reasons why I attended DH 2008. The main reason is that my PhD thesis will be a digital edition. To that end I’ve started a project with two fellow PhD students for developing linguistically oriented digital editions; the paper I gave at DH 2008 was for the project, part of our Spring 2008 promo tour. Our project falls into the digital humanities sphere, rather than the field of corpus linguistics where we come from. Next, I like things digitally humanistic – particularly digital resources such as corpora, editions, databases, textbases,(1) and to a lesser degree portals and their kin. Also software and websites and gadgetry (ok so that’s software, but if you’ll allow the distinctions between Word and Tofu, I’ll call the latter gadgets. This term is liable to change.). Finally, this year the DH conference was in Oulu, so I could hardly not attend (for our non-academic readers: yes, since travel money is hard to find, location can be decisive in terms of which conferences to attend!).

It was a great conference, overall. Good papers, nice people, exciting projects, excellent contacts, and everything went smoothly and I may have given my best presentation yet (I really should make notes for my next one, though – although I suspect there is always room for improvement). Those interested can go read the abstracts of the papers and posters given at the conference on the DH 2008 website.

However – and this is why I’m writing this post – I must say I am rather dissatisfied with all too many digital resources and tools. Since I’m working on a digital edition, this is what I am perhaps most familiar with (that, and corpora of course), and so I’ll voice my disapproval at digital editions in particular. But more generally, too; here’s the thing:

digital resources are not digitised print resources (2)

Nonetheless, people keep creating these things as if they were.

I think this is mostly a generational thing. That is, older people have more difficulties in envisioning born-digital tools and resources. The prime examples come from outside academia – kids today have grown up with exponentially more powerful digital tools than previous generations. This has given them infinitely better intuition in working these tools and machines – mainly computers of course, but this includes dvd players, mp3 players, mobile phones, game consoles, gps devices, and so on and so forth: all electronic devices, in other words. And the same holds for interfaces: kids are much more adept than adults in using the software in all of these devices.

None of this is new, none of this is surprising: younger generations are more at home with technological advances.

But my main gripe concerns digital editions, which I call (for now, at least) re-born digital resources. Reborn because the digital editions I am talking about are resources created out of extant documents – manuscripts or printed works. What I am concerned with is the amount of wasted resources – time and money – put into developing obsolete solutions in digital humanities.

What do I mean, specifically?

Well. I think digital resources should take full advantage of the medium. That is, one should not try to re-create an object from another medium as if it was still in that medium. You can come up with your own examples, such as films based on books that got it wrong, or vice versa. Here, I am talking about so-called “digital editions” which, first of all, are nothing more than scanned images of the original document,(3) and second, don’t even try to make the edition user-friendly – i.e., provide it with an intuitive and simple interface which allows easy and fun (yes, fun!) browsing of the images.

As a concrete example, take PicLens/cooliris, which transforms browsing images into an awesome experience. It’s freely available online, and very easy to install on your website. I’m not saying cooliris is the solution, I’m just saying that people have created and released great tools which facilitate working with computers. How about using them? Guys? Anyone? Bueller? Anyone?

Ok yes, so there’s Turning the Pages at BL etc, but come one – while it’s cool, there’s a limit to its usability – I mean for research purposes in particular, but also just general stuff too! ..not that TtP wouldn’t be a cool feature to have, provided you could do other things with it than just, um, turn the pages…

Anyway, my primary point was user interfaces, specifically image browsing. However, this really can be extended to a whole slew of solutions created by all and sundry, bascially all of which can be classified under what is commonly known as “Web 2.0”.

See? Old news, really.

This is something I mean to push in academia, in my small way. And when I create something groovy, I shall post a link to it. (Like dipity – you guys seen dipity? Melikes it.)


* Does anyone know a good term for “digital humanities” in Finnish? “Humanistit ja tietokoneet” .. “humanistinen tietojenkäsittelytiede” (??).. nothing I can think of really cuts it.

(1) Text databases. Yes, really!

(2) Unless they are, of course. And in fairness, there are loads of these: e.g. all things available on EEBO; all the digitised books on Google Books and in the Internet Archive; also most thing digitised by libraries etc (like those on the Finnish National Library website. But even then, I think just scanning the books and putting simple images online isn’t really enough.

(3) Yes, there are arguments for this kind of editions. I think the only one to carry any real weight is for projects aiming at quantity, rather than quality (of the edition’s interface, that is, not of the images). Even then, though, I find it difficult to allow for any excuses in, say, ordering the images in the edition, or renaming the damn things. From experience I can say that while this can be a time-consuming task, actually not doing so is a real pain in the arse later on. This means you, Anglo-American Legal Tradition project!