Datamoaning

At the beginning of this week, I attended the two-day Big Data Approaches to Intellectual and Linguistic History symposium at the Helsinki Collegium for Advanced Studies, University of Helsinki. Since Tuesday, I’ve found myself pondering on topics that came up at the symposium. So I thought I would write up my thoughts in order to unload them somewhere (and thus hopefully stop thinking about them) (I have a chapter to finish, and not on digital humanities stuff), and also in order to try to articulate, more clearly than the jumbled form inside my head, my reflections upon what was discussed there. I.e. the usual refrain, ‘I need to hear what I say in order to find out what I think’.

So here goes.

NB this is not a conference report, in that I’m not going to talk about specific presentations given at the symposium. For that, check out slides from the presentations linked to from the conference conference website, and see also the Storify of the tweets from the event (both including those from the workshop that followed on Wednesday, Helsinki Digital Humanities Day).


 

I’ve been a part of the DH (Digital Humanities) community for about ten years now. I started off working on digital resources – linguistic corpora, digital scholarly editing; I’ve even fiddled with mapping things – but have in recent years not been actively engaged in resource- or tool-creation as such. Yet I use digital and digitised resources on a daily basis: EEBO frequently, the broad palette of resources available on British History Online all the time, and, when I have access to them, State Papers Online and Cecil Papers (Online). (I work on British state papers from around 1600, and am lucky in that much of the material I need has been digitised and put online in one form or another). I also keep an eye on what happens in the DH world: I attend DH-related conferences and seminars and whatnot when I can, subscribe to LLC (Literary & Linguistic Computing, about to be renamed DSH, Digital Scholarship in the Humanities), and hang out with DHers both online (Twitter, mostly) and in real life.

All this goes to say that I feel quite confident about my understanding of DH projects at the macro level. (Details, certainly not: implementation, encoding, programming, etc etc).

Thus, attending a DH symposium on ‘big data’, I expected to hear presentations about things I was already familiar with. And this turned out to be the case: there were descriptions of/results from projects, descriptions of methodologies (explaining to those from other disciplines ‘what is it we do’), and explorations of concepts that keep coming up in DH work.

Don’t get me wrong: I found all the presentations (that I saw) very good, and listening to talks by people in other disciplines does give you new perspectives. Maybe not profound ones, and often you end up thinking/feeling there’s little or no common ground so why do we even bother? But it’s not a completely useless exercise. Yet what I felt to be the take-away points from this symposium were ones I feel keep coming up at DH events that I have attended over the years, and ones that we – meaning the DH community – are well aware of. Such as (by no means a comprehensive list):

1. Issues with the data

  • “Big Data” in the humanities is not very big when compared to Big Data in some other fields
  • We know Big Data is good for quantity, but rubbish for quality
    • We are aware of the importance and value of the nitty-gritty details
  • We know that manual input is a required part of both processing/methodology – in order to fine-tune the automatic parts of the process – and more importantly, for the analysis of the results (Matti Rissanen’s maxim: “research begins where counting ends”)
  • We know that Our data – however Big it is – is never All data (our results are not God’s Truth)
    • We are aware of the limits of the historical record (“known unknowns, unknown unknowns”)

2. Sharing tools and resources

  • We need to develop better tools, cross-disciplinary ones
    • Our research questions may be different, but we are all accessing and querying text
  • We need to develop our tools as modular “building blocks”, ‘good enough’ is good enough
  • We need to share data/sources/databases/corpora/materials – open access; copyright is an issue, but we’re all (painfully) aware of this

Clearly, these are important points that we need to keep in mind, and challenges that we want to address. And repetitio mater studiorum est. So why do I feel that their reiteration on Monday and Tuesday only served to make me grumpier than usual?*

In the pub after Wednesday’s workshop, we talked a little bit about how pessimistically these points tend to be presented. “We can’t (yet) do XYZ”. “We need to understand that our tools and resources are terrible”. …which now reminds me of what I commented in a previous discussion, on Twitter, early this year:

One element in how I feel about the symposium could be the difficulty of cross-disciplinary communication. This, too, is familiar to me seeing as I straddle several disciplines, hanging out with historical linguists on the one hand, historians on th’other, and then DHers too. I once attended a three-day conference convened by linguists where the aim was to bring linguists and historians together. I think only one of the presentations was by a historian…  So yeah, we don’t talk – as disciplines, that is: I know many individuals who talk across disciplinary borders.  …and, come to think of it, I know a number of scholars who straddle such borders. But perhaps it’s just that at interdisciplinary events there’s a required level of dumbing-down on the part of the presentators on the one hand, and inevitable incomprehension on the part of the audience on the other. Admittedly, it is incredibly difficult to give a interdisciplinary paper.

A final point, perhaps, in these meandering reflections, is of course the wee fact that I don’t, in fact, work on research questions that require Big Data.† (At the moment, anyway). So I’m just not particularly interested in learning how to use computers to tell me something interesting about large amounts of texts – something that it would be impossible to see without using computational power. It’s not that the methodologies, or indeed the results produced, are not fascinating. It’s just that I guess I lack a personal connection to applying them.  ..but then, I suppose this can be filed under the difficulty of interdisciplinary communication! ‘I see what you’re doing but I fail to see how it can help me in what I do’.

Hmm.

So how to conclude? I guess, first of all, kudos to HCAS for putting the symposium together – and, judging from upcoming events, for playing an important part in getting DH in Finland into motion. It’s not as if there’s been nothing previously, and HCAS definitely cannot be credited for ‘starting’ DH activities in Finland in any way – some of us have been doing this for 10 years, some for 30 years or more, and along the way, there have been events which fall under the DH umbrella. But only in the past year or so has DH become established institutionally in the University of Helsinki: we have a professor of DH now, and 4 fully-funded DH-related PhD positions. Perhaps it was the lack of institutional recognition that made previous efforts at organizing DH-related activities here for the large part intermittent and disconnected. But we’ll see how things proceed: certainly many of us are glad to see DH becoming established in Finnish academia as an entity. And judging by the full house at the symposium and the workshop that followed, it would appear that there are many of us in the local scholarly community interested in these topics. The future looks promising.

It should also be said that DH has come a long way from what it was ten years ago. The resources and tools we have today allow us to do amazing things. Just about all of the presentations at the symposium described and discussed projects that use complicated tools to do complex things. I am seriously impressed by what is being done in various fields – and simply, by what can be done today. And there is no denying that there is a Lot of work being done across and between disciplines: DH projects are often multidisciplinary by design, and many are working on and indeed producing tools and resources that can be useful to different disciplines.

Maybe it’s just the season making me cranky. You’ll certainly see me at the next local DH event. Watch this space..

 


 

* ..Maybe it’s just conference fatigue that I’m struggling with? There are only so many conference papers one can listen to attentively, and almost without exception there is never time to do much but scratch the surface and provide but a thin sketch of the material/problem/results/etc. (It’s rather like watching popular history/science documentaries/programs on tv: oh look, here’s Galileo and the heliocentric model again, ooh with pictures of the Vatican and dramatic Hollywood-movie music, for chrissakes). (I mean, yes it’s interesting and cool and all that but oh we’re out of time and have to go to commercials/questions). (So in order to retain my interest there needs to be some seriously new and exciting material/results to show, like those baby snow geese jumping off a 400ft cliff (!!!!) in David Attenborough’s fantastic new documentary Life Story, or all the fantastic multilingual multiscriptal stuff in historical manuscripts that we have only just started to look at in more detail. If it’s yet another documentary about Serengeti lions / paper about epistolary formulae in Early Modern English letters, I’m bound to skip it. I’m willing to be surprised, but these are well-trodden ground).   /rant

† Incidentally, I disagree with the notion that in the Humanities we don’t have Big Data – I would say that this depends on your definition of “big”. While historical text corpora may at best only be some hundreds of millions of words, and this pales in comparison to the petabytes (or whatever) produced by, say, CERN, or Amazon, every minute, I see (historical) textual data as fractal: the closer you look at it, the more detail emerges. Admittedly, a lot of the detail does not usually get encoded in the digitised corpora (say, material and visual aspects of manuscript texts), but there’s more there than per byte than in recordings of the flight paths of electrons or customer transactions. Having said this, I’m sure someone can point out how wrong I am! But really, “my data > your data”? I don’t find spitting contests particularly useful in scholarship, any more than in real life.

How should you cite a book viewed in EEBO?

Earlier today, there was a discussion on Twitter on citing Early Modern English books seen on EEBO. But 140 characters is not enough to get my view across, so here ’tis instead.

The question: how should you cite a book viewed on EEBO in your bibliography?

When it comes to digitized sources, many if not most of us probably instinctively cite the original source, rather than the digitized version. This makes sense – the digital version is a surrogate of the original work that we are really consulting. However, digitizing requires a lot of effort and investment, so when you view a book on EEBO but only cite the original work, you are not giving credit where it is due. After all, consider how easy it now is to access thousands of books held in distant repositories, simply by navigating to a website (although only if your institution has paid for access). This kind of facilitation of research should not be taken for granted.

(What’s more, digital scholarship is not yet getting the credit it deserves – and as a creator of digital resources myself, I feel quite strongly that this needs to change.)

Anyway; so how should you cite a work you’ve read in EEBO, then?

This is what the EEBO FAQ says (edited slightly; bold emphasis mine):

When citing material from EEBO, it is helpful to give the publication details of the original print source as well as those of the electronic version. You can view the original publication details of works in EEBO by clicking on the Full Record icon that appears on the Search Results, Document Image and Full Text page views, as well as on the list of Author’s Works.

Joseph Gibaldi’s MLA Handbook for Writers of Research Papers, 7th ed. (New York: The Modern Language Association of America, 2009), deals with citations of online sources in section 5.6, pp.181-93. For works on the web with print publication data, the MLA Handbook suggests that details of the print publication should be followed by (i) the title of the database or web site, (ii) the medium of publication consulted (i.e. ‘Web’), and (iii) the date of access (see 5.6.2.c, pp. 187-8).

… When including URLs in EEBO citations, use the blue Durable URL button that appears on each Document Image and Full Record display to generate a persistent URL for the particular page or record that you are referencing. It is not advisable to copy and paste URLs from the address bar of your browser as these will not be persistent.

Here is an example based on these guidelines:

  • Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. Early English Books Online. Web. 13 May 2003. <http://gateway.proquest.com.libproxy.helsinki.fi/openurl?ctx_ver=Z39.88-2003&res_id=xri:eebo&rft_val_fmt=&rft_id=xri:eebo:image:29269:1>.

If you are citing one of the keyed texts produced by the Text Creation Partnership (TCP), the following format is recommended:

  • Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. Text Creation Partnership digital edition. Early English Books Online. Web. 13 October 2010. <http://gateway.proquest.com.libproxy.helsinki.fi/openurl?ctx_ver=Z39.88-2003&res_id=xri:eebo&rft_val_fmt=&rft_id=xri:eebo:image:29269:1>.

Here’s why I think this is a ridiculous way to cite a book viewed on EEBO:

  1. Outrageous URL. Bibliographies should be readable by humans: the above URL is illegible. Further, while the URL may indeed be persistent, no-one outside the University of Helsinki network can check the validity of this particular URL. And to quote Peter Shillingsburg on giving web addresses in your references, “All these sites are more reliably found by a web search engine than by URLs mouldering in a footnote”. If you’d want to find this resource, you’d use a web search engine and look for “Spenser Faerie Queen EEBO”. Or go directly to EEBO and search there – in any case, you wouldn’t ever use this URL.
  2. Redundant information. Both “Early English Books Online” and “Web”? Don’t be silly.
  3. Access date. If the digital resource you are accessing is stable, there’s no need for this. If it’s a newspaper or a blog, dating is necessary (especially if the contents of the target are likely to change). In the case of resources such as the Oxford English Dictionary – which, though largely stable, undergoes constant updates – each article (headword entry) is marked with which edition of the dictionary it belongs to, which information is enough (and which explains notations like OED2 and OED3, for 2nd and 3rd ed. entries, respectively).

Instead, I suggest and recommend a citation format something like the following:

  • Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. EEBO. Huntington Library.

With a separate entry in your bibliography for EEBO:

  • EEBO = Early English Books Online. Chadwyck-Healey. <http://eebo.chadwyck.com/home>.

And if you’ve used the TCP version, add “-TCP” to the book reference, and include a separate entry for the Text Creation Partnership (EEBO-TCP).

This makes for much shorter entries in your bibliography, and clears away pages of redundant clutter which doesn’t tell the reader anything.

Why cite the source library?

Book historians will tell you – at some length – that there is no such thing as an edition of a hand-printed book. No two books printed by hand are exactly identical (in the way that modern printed books are identical) – due to misprints and the like, but also because for instance the paper they are printed on will be different from one codex to another (since a printer’s paper stock came from many different paper mills). So two copies of an Early Modern book (the same work, the same ‘edition’) will always differ from each other – sometimes in significant ways.

For this reason, really we should cite books-as-artefacts rather than books-as-works. Happily, EEBO gives the source library of each book, and including that information is straightforward and simple enough.

Problems and questions – can you not cite EEBO?

Some of the books on EEBO are available as images digitized from different microfilm surrogates of the source book. That is, there is more than one microfilm of the same book. Technically, these surrogate images are different artefacts and we should really reference the microfilm too…  I see that this could be a problem, but have not come across an issue where citing the microfilm would have been relevant to the work I was doing.

Q: Which brings us to another important point: if you are only interested in the work, is it really necessary to cite the format, never mind the artefact?

A: Well, yes, for the reasons outlined above – and simply because it is good scholarly practice.

Q: What if you only use EEBO to double-check a page reference or the correct quotation of something you’d made a note of when you viewed (a different copy of) the work in a library?

A: Ah. Well, if you are feeling conscientious, maybe make a note that you’ve viewed the work in EEBO as well as a physical copy – say, use parentheses: “(EEBO. Huntington Library.)”.

Incidentally, since Early Modern books-as-artefacts differ from each other, technically we should always state in the bibliography which copy of the work we have seen. But I’m not sure anyone is quite that diligent – book historians perhaps excepted – and I can’t be bothered to check right now.

Q: Argh. Look, can’t we just go back to not quoting the work and not bother with all this?

A: No. Sorry.

However, I think we’ve drifted a bit far from our departure point.

All this serves to illustrate how citing Early Modern books – be it as physical copies, printed editions or facsimiles, or digital surrogates – is no simple matter. (And we haven’t discussed whether good practice should also include giving the ESTC number in order to identify the work…)  So no wonder no standard practice has emerged on how to cite a work seen on EEBO.

Yet in sum, if you consult books on EEBO, I strongly urge you to give credit to EEBO in your bibliography. 

 

ETA 27.2.2014 8am:

Another argument for why to make sure to cite EEBO is the rather huge matter of what, exactly, is EEBO, and how what it is affects scholarship. In the words of others:

Daniel Powell notes that:

[I]t seems important to realize that EEBO is quite prone to error, loss, and confusion–especially since it’s based on microfilm photographed in the 1930s-40s based on lists compiled in some cases the 1880s.

And Jacqueline Wernimont adds:

EEBO isn’t a catalogue of early modern books – it’s a catalogue of copies. More precisely, it is a repository of digital images of microfilms of single copies of books, and, if your institution subscribes to the Text Creation Partnership (TCP) phases one and/or two, text files that are outsourced transcriptions of microfilm images of single texts.

These points are particularly relevant if you treat EEBO as a library of early modern English works, but they apply equally when you access one or two books to check a reference. As Sarah Werner (among others) has shown us, digital facsimiles of (old) microfilms of early books can miss a lot of details that are clearly visible when viewing the physical books (like coloured ink). While in many cases the scans in EEBO are perfectly serviceable surrogates of the original printed book – black text on white paper tends to capture well in facsimile – the exceptions drive across the point that accessing a book as microfilm images is not the same as looking at a physical copy of the book.

This is not to say that all surrogates, and especially microfilms, are bad as such. In many cases it is the copy that survives whereas the original has been lost. And I have come across cases where the microfilm retains information that has been lost when the manuscripts have been cleaned by conservators and archivists some time after being microfilmed. (Pro tip for meticulous scholars: have a look at all the surrogates, even if you don’t need to!) Also, modern digital imaging is enabling us to read palimpsests and other messy texts with greater ease than before (or indeed at all).

In essence, then, you should make sure to cite EEBO when you use it – not only because of things you may miss due to problems with the images in EEBO, but also because digital resources enable us to do things which are simply impossible or would take forever when using physical copies.

 


Ok this was a long rant. But I hope this might be of use to someone!