Datamoaning

At the beginning of this week, I attended the two-day Big Data Approaches to Intellectual and Linguistic History symposium at the Helsinki Collegium for Advanced Studies, University of Helsinki. Since Tuesday, I’ve found myself pondering on topics that came up at the symposium. So I thought I would write up my thoughts in order to unload them somewhere (and thus hopefully stop thinking about them) (I have a chapter to finish, and not on digital humanities stuff), and also in order to try to articulate, more clearly than the jumbled form inside my head, my reflections upon what was discussed there. I.e. the usual refrain, ‘I need to hear what I say in order to find out what I think’.

So here goes.

NB this is not a conference report, in that I’m not going to talk about specific presentations given at the symposium. For that, check out slides from the presentations linked to from the conference conference website, and see also the Storify of the tweets from the event (both including those from the workshop that followed on Wednesday, Helsinki Digital Humanities Day).


 

I’ve been a part of the DH (Digital Humanities) community for about ten years now. I started off working on digital resources – linguistic corpora, digital scholarly editing; I’ve even fiddled with mapping things – but have in recent years not been actively engaged in resource- or tool-creation as such. Yet I use digital and digitised resources on a daily basis: EEBO frequently, the broad palette of resources available on British History Online all the time, and, when I have access to them, State Papers Online and Cecil Papers (Online). (I work on British state papers from around 1600, and am lucky in that much of the material I need has been digitised and put online in one form or another). I also keep an eye on what happens in the DH world: I attend DH-related conferences and seminars and whatnot when I can, subscribe to LLC (Literary & Linguistic Computing, about to be renamed DSH, Digital Scholarship in the Humanities), and hang out with DHers both online (Twitter, mostly) and in real life.

All this goes to say that I feel quite confident about my understanding of DH projects at the macro level. (Details, certainly not: implementation, encoding, programming, etc etc).

Thus, attending a DH symposium on ‘big data’, I expected to hear presentations about things I was already familiar with. And this turned out to be the case: there were descriptions of/results from projects, descriptions of methodologies (explaining to those from other disciplines ‘what is it we do’), and explorations of concepts that keep coming up in DH work.

Don’t get me wrong: I found all the presentations (that I saw) very good, and listening to talks by people in other disciplines does give you new perspectives. Maybe not profound ones, and often you end up thinking/feeling there’s little or no common ground so why do we even bother? But it’s not a completely useless exercise. Yet what I felt to be the take-away points from this symposium were ones I feel keep coming up at DH events that I have attended over the years, and ones that we – meaning the DH community – are well aware of. Such as (by no means a comprehensive list):

1. Issues with the data

  • “Big Data” in the humanities is not very big when compared to Big Data in some other fields
  • We know Big Data is good for quantity, but rubbish for quality
    • We are aware of the importance and value of the nitty-gritty details
  • We know that manual input is a required part of both processing/methodology – in order to fine-tune the automatic parts of the process – and more importantly, for the analysis of the results (Matti Rissanen’s maxim: “research begins where counting ends”)
  • We know that Our data – however Big it is – is never All data (our results are not God’s Truth)
    • We are aware of the limits of the historical record (“known unknowns, unknown unknowns”)

2. Sharing tools and resources

  • We need to develop better tools, cross-disciplinary ones
    • Our research questions may be different, but we are all accessing and querying text
  • We need to develop our tools as modular “building blocks”, ‘good enough’ is good enough
  • We need to share data/sources/databases/corpora/materials – open access; copyright is an issue, but we’re all (painfully) aware of this

Clearly, these are important points that we need to keep in mind, and challenges that we want to address. And repetitio mater studiorum est. So why do I feel that their reiteration on Monday and Tuesday only served to make me grumpier than usual?*

In the pub after Wednesday’s workshop, we talked a little bit about how pessimistically these points tend to be presented. “We can’t (yet) do XYZ”. “We need to understand that our tools and resources are terrible”. …which now reminds me of what I commented in a previous discussion, on Twitter, early this year:

One element in how I feel about the symposium could be the difficulty of cross-disciplinary communication. This, too, is familiar to me seeing as I straddle several disciplines, hanging out with historical linguists on the one hand, historians on th’other, and then DHers too. I once attended a three-day conference convened by linguists where the aim was to bring linguists and historians together. I think only one of the presentations was by a historian…  So yeah, we don’t talk – as disciplines, that is: I know many individuals who talk across disciplinary borders.  …and, come to think of it, I know a number of scholars who straddle such borders. But perhaps it’s just that at interdisciplinary events there’s a required level of dumbing-down on the part of the presentators on the one hand, and inevitable incomprehension on the part of the audience on the other. Admittedly, it is incredibly difficult to give a interdisciplinary paper.

A final point, perhaps, in these meandering reflections, is of course the wee fact that I don’t, in fact, work on research questions that require Big Data.† (At the moment, anyway). So I’m just not particularly interested in learning how to use computers to tell me something interesting about large amounts of texts – something that it would be impossible to see without using computational power. It’s not that the methodologies, or indeed the results produced, are not fascinating. It’s just that I guess I lack a personal connection to applying them.  ..but then, I suppose this can be filed under the difficulty of interdisciplinary communication! ‘I see what you’re doing but I fail to see how it can help me in what I do’.

Hmm.

So how to conclude? I guess, first of all, kudos to HCAS for putting the symposium together – and, judging from upcoming events, for playing an important part in getting DH in Finland into motion. It’s not as if there’s been nothing previously, and HCAS definitely cannot be credited for ‘starting’ DH activities in Finland in any way – some of us have been doing this for 10 years, some for 30 years or more, and along the way, there have been events which fall under the DH umbrella. But only in the past year or so has DH become established institutionally in the University of Helsinki: we have a professor of DH now, and 4 fully-funded DH-related PhD positions. Perhaps it was the lack of institutional recognition that made previous efforts at organizing DH-related activities here for the large part intermittent and disconnected. But we’ll see how things proceed: certainly many of us are glad to see DH becoming established in Finnish academia as an entity. And judging by the full house at the symposium and the workshop that followed, it would appear that there are many of us in the local scholarly community interested in these topics. The future looks promising.

It should also be said that DH has come a long way from what it was ten years ago. The resources and tools we have today allow us to do amazing things. Just about all of the presentations at the symposium described and discussed projects that use complicated tools to do complex things. I am seriously impressed by what is being done in various fields – and simply, by what can be done today. And there is no denying that there is a Lot of work being done across and between disciplines: DH projects are often multidisciplinary by design, and many are working on and indeed producing tools and resources that can be useful to different disciplines.

Maybe it’s just the season making me cranky. You’ll certainly see me at the next local DH event. Watch this space..

 


 

* ..Maybe it’s just conference fatigue that I’m struggling with? There are only so many conference papers one can listen to attentively, and almost without exception there is never time to do much but scratch the surface and provide but a thin sketch of the material/problem/results/etc. (It’s rather like watching popular history/science documentaries/programs on tv: oh look, here’s Galileo and the heliocentric model again, ooh with pictures of the Vatican and dramatic Hollywood-movie music, for chrissakes). (I mean, yes it’s interesting and cool and all that but oh we’re out of time and have to go to commercials/questions). (So in order to retain my interest there needs to be some seriously new and exciting material/results to show, like those baby snow geese jumping off a 400ft cliff (!!!!) in David Attenborough’s fantastic new documentary Life Story, or all the fantastic multilingual multiscriptal stuff in historical manuscripts that we have only just started to look at in more detail. If it’s yet another documentary about Serengeti lions / paper about epistolary formulae in Early Modern English letters, I’m bound to skip it. I’m willing to be surprised, but these are well-trodden ground).   /rant

† Incidentally, I disagree with the notion that in the Humanities we don’t have Big Data – I would say that this depends on your definition of “big”. While historical text corpora may at best only be some hundreds of millions of words, and this pales in comparison to the petabytes (or whatever) produced by, say, CERN, or Amazon, every minute, I see (historical) textual data as fractal: the closer you look at it, the more detail emerges. Admittedly, a lot of the detail does not usually get encoded in the digitised corpora (say, material and visual aspects of manuscript texts), but there’s more there than per byte than in recordings of the flight paths of electrons or customer transactions. Having said this, I’m sure someone can point out how wrong I am! But really, “my data > your data”? I don’t find spitting contests particularly useful in scholarship, any more than in real life.

How should you cite a book viewed in EEBO?

Earlier today, there was a discussion on Twitter on citing Early Modern English books seen on EEBO. But 140 characters is not enough to get my view across, so here ’tis instead.

The question: how should you cite a book viewed on EEBO in your bibliography?

When it comes to digitized sources, many if not most of us probably instinctively cite the original source, rather than the digitized version. This makes sense – the digital version is a surrogate of the original work that we are really consulting. However, digitizing requires a lot of effort and investment, so when you view a book on EEBO but only cite the original work, you are not giving credit where it is due. After all, consider how easy it now is to access thousands of books held in distant repositories, simply by navigating to a website (although only if your institution has paid for access). This kind of facilitation of research should not be taken for granted.

(What’s more, digital scholarship is not yet getting the credit it deserves – and as a creator of digital resources myself, I feel quite strongly that this needs to change.)

Anyway; so how should you cite a work you’ve read in EEBO, then?

This is what the EEBO FAQ says (edited slightly; bold emphasis mine):

When citing material from EEBO, it is helpful to give the publication details of the original print source as well as those of the electronic version. You can view the original publication details of works in EEBO by clicking on the Full Record icon that appears on the Search Results, Document Image and Full Text page views, as well as on the list of Author’s Works.

Joseph Gibaldi’s MLA Handbook for Writers of Research Papers, 7th ed. (New York: The Modern Language Association of America, 2009), deals with citations of online sources in section 5.6, pp.181-93. For works on the web with print publication data, the MLA Handbook suggests that details of the print publication should be followed by (i) the title of the database or web site, (ii) the medium of publication consulted (i.e. ‘Web’), and (iii) the date of access (see 5.6.2.c, pp. 187-8).

… When including URLs in EEBO citations, use the blue Durable URL button that appears on each Document Image and Full Record display to generate a persistent URL for the particular page or record that you are referencing. It is not advisable to copy and paste URLs from the address bar of your browser as these will not be persistent.

Here is an example based on these guidelines:

  • Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. Early English Books Online. Web. 13 May 2003. <http://gateway.proquest.com.libproxy.helsinki.fi/openurl?ctx_ver=Z39.88-2003&res_id=xri:eebo&rft_val_fmt=&rft_id=xri:eebo:image:29269:1>.

If you are citing one of the keyed texts produced by the Text Creation Partnership (TCP), the following format is recommended:

  • Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. Text Creation Partnership digital edition. Early English Books Online. Web. 13 October 2010. <http://gateway.proquest.com.libproxy.helsinki.fi/openurl?ctx_ver=Z39.88-2003&res_id=xri:eebo&rft_val_fmt=&rft_id=xri:eebo:image:29269:1>.

Here’s why I think this is a ridiculous way to cite a book viewed on EEBO:

  1. Outrageous URL. Bibliographies should be readable by humans: the above URL is illegible. Further, while the URL may indeed be persistent, no-one outside the University of Helsinki network can check the validity of this particular URL. And to quote Peter Shillingsburg on giving web addresses in your references, “All these sites are more reliably found by a web search engine than by URLs mouldering in a footnote”. If you’d want to find this resource, you’d use a web search engine and look for “Spenser Faerie Queen EEBO”. Or go directly to EEBO and search there – in any case, you wouldn’t ever use this URL.
  2. Redundant information. Both “Early English Books Online” and “Web”? Don’t be silly.
  3. Access date. If the digital resource you are accessing is stable, there’s no need for this. If it’s a newspaper or a blog, dating is necessary (especially if the contents of the target are likely to change). In the case of resources such as the Oxford English Dictionary – which, though largely stable, undergoes constant updates – each article (headword entry) is marked with which edition of the dictionary it belongs to, which information is enough (and which explains notations like OED2 and OED3, for 2nd and 3rd ed. entries, respectively).

Instead, I suggest and recommend a citation format something like the following:

  • Spenser, Edmund. The Faerie Qveene: Disposed into Twelue Books, Fashioning XII Morall Vertues. London, 1590. EEBO. Huntington Library.

With a separate entry in your bibliography for EEBO:

  • EEBO = Early English Books Online. Chadwyck-Healey. <http://eebo.chadwyck.com/home>.

And if you’ve used the TCP version, add “-TCP” to the book reference, and include a separate entry for the Text Creation Partnership (EEBO-TCP).

This makes for much shorter entries in your bibliography, and clears away pages of redundant clutter which doesn’t tell the reader anything.

Why cite the source library?

Book historians will tell you – at some length – that there is no such thing as an edition of a hand-printed book. No two books printed by hand are exactly identical (in the way that modern printed books are identical) – due to misprints and the like, but also because for instance the paper they are printed on will be different from one codex to another (since a printer’s paper stock came from many different paper mills). So two copies of an Early Modern book (the same work, the same ‘edition’) will always differ from each other – sometimes in significant ways.

For this reason, really we should cite books-as-artefacts rather than books-as-works. Happily, EEBO gives the source library of each book, and including that information is straightforward and simple enough.

Problems and questions – can you not cite EEBO?

Some of the books on EEBO are available as images digitized from different microfilm surrogates of the source book. That is, there is more than one microfilm of the same book. Technically, these surrogate images are different artefacts and we should really reference the microfilm too…  I see that this could be a problem, but have not come across an issue where citing the microfilm would have been relevant to the work I was doing.

Q: Which brings us to another important point: if you are only interested in the work, is it really necessary to cite the format, never mind the artefact?

A: Well, yes, for the reasons outlined above – and simply because it is good scholarly practice.

Q: What if you only use EEBO to double-check a page reference or the correct quotation of something you’d made a note of when you viewed (a different copy of) the work in a library?

A: Ah. Well, if you are feeling conscientious, maybe make a note that you’ve viewed the work in EEBO as well as a physical copy – say, use parentheses: “(EEBO. Huntington Library.)”.

Incidentally, since Early Modern books-as-artefacts differ from each other, technically we should always state in the bibliography which copy of the work we have seen. But I’m not sure anyone is quite that diligent – book historians perhaps excepted – and I can’t be bothered to check right now.

Q: Argh. Look, can’t we just go back to not quoting the work and not bother with all this?

A: No. Sorry.

However, I think we’ve drifted a bit far from our departure point.

All this serves to illustrate how citing Early Modern books – be it as physical copies, printed editions or facsimiles, or digital surrogates – is no simple matter. (And we haven’t discussed whether good practice should also include giving the ESTC number in order to identify the work…)  So no wonder no standard practice has emerged on how to cite a work seen on EEBO.

Yet in sum, if you consult books on EEBO, I strongly urge you to give credit to EEBO in your bibliography. 

 

ETA 27.2.2014 8am:

Another argument for why to make sure to cite EEBO is the rather huge matter of what, exactly, is EEBO, and how what it is affects scholarship. In the words of others:

Daniel Powell notes that:

[I]t seems important to realize that EEBO is quite prone to error, loss, and confusion–especially since it’s based on microfilm photographed in the 1930s-40s based on lists compiled in some cases the 1880s.

And Jacqueline Wernimont adds:

EEBO isn’t a catalogue of early modern books – it’s a catalogue of copies. More precisely, it is a repository of digital images of microfilms of single copies of books, and, if your institution subscribes to the Text Creation Partnership (TCP) phases one and/or two, text files that are outsourced transcriptions of microfilm images of single texts.

These points are particularly relevant if you treat EEBO as a library of early modern English works, but they apply equally when you access one or two books to check a reference. As Sarah Werner (among others) has shown us, digital facsimiles of (old) microfilms of early books can miss a lot of details that are clearly visible when viewing the physical books (like coloured ink). While in many cases the scans in EEBO are perfectly serviceable surrogates of the original printed book – black text on white paper tends to capture well in facsimile – the exceptions drive across the point that accessing a book as microfilm images is not the same as looking at a physical copy of the book.

This is not to say that all surrogates, and especially microfilms, are bad as such. In many cases it is the copy that survives whereas the original has been lost. And I have come across cases where the microfilm retains information that has been lost when the manuscripts have been cleaned by conservators and archivists some time after being microfilmed. (Pro tip for meticulous scholars: have a look at all the surrogates, even if you don’t need to!) Also, modern digital imaging is enabling us to read palimpsests and other messy texts with greater ease than before (or indeed at all).

In essence, then, you should make sure to cite EEBO when you use it – not only because of things you may miss due to problems with the images in EEBO, but also because digital resources enable us to do things which are simply impossible or would take forever when using physical copies.

 


Ok this was a long rant. But I hope this might be of use to someone!

The reliability of “Winwood’s Memorials”

The three-volume Memorials of affairs of state in the reigns of Q. Elizabeth and K. James I, collected (chiefly) from the original papers of … Sir Ralph Winwood, edited by Edmund Sawyer, published in 1725 (2nd ed. 1727), is a hugely convenient work for those working on late Elizabethan and early Stuart State Papers, since it prints hundreds of letters from the English diplomatic correspondence with Ambassadors on the continent. However, it is not a reliable source of information, and one should bear this in mind. Not that there is such a thing, of course! But my gripe to you today is a simple one. Observe:

Winwood, vol. 2, p. 357 (1727 ed.):

Winwood no. 541

TNA SP 94/14 f. 214, draft of the same letter:

Cecil to Cornwallis, SP 94.14.214r

Now, regular (?) readers will know that I have a Thing About Dates, and here, too, it is a date that is the matter at question. In Winwood, Sawyer has Cecil say his last letter to Cornwallis was of the “6th of September.” In the draft in the State Papers, however, the date for the same letter is given as “27. of Septembre”.

Okay, you say, so Sawyer printed the wrong date. So what? Well, if you are charting a correspondence – consisting primarily of when letters were sent and received, and to some extent of the paths they took and who they were carried by – and particularly if you are trying to establish the transmitting of information, three weeks makes a world of difference, even in the Early Modern period, even for distances such as that between London and Madrid. For instance, when there was a real emergency at either end, news of which would be transmitted immediately, reconstructing the sequence of events today becomes frustrating to the extreme if dates do not match. Particularly since the matter is compounded by the usage of Old Style and New Style dates, which add a ten-day bracket to every date anyway..

In the end, I suppose more important than one antiquarian-minded scholar’s griping at imperfect editions, is the fact that it is cases like this – coming across and resolving conflicting sources – which helps develop one’s understanding of the inherent unreliability of the surviving records of the past.

 

(By the way, ironically, the letter from Cecil to Cornwallis of 27.9.1607 does not survive as a draft in TNA SP 94/14, but it is printed in Winwood (1727 ed.), vol. 2, p. 340…)

Woo, palaeography! (argh)

So, I’m transcribing bits of documents I photographed at the Staffordshire Record Office and the William Salt Library in Stafford last September. Most of the docs are older than the ones I usually deal with, which means I have to struggle a bit to read the handwriting. Today’s post is a celebration of the idiocy of features of hands and scripts, and a puzzle for you, dear reader (dear spambot? dear me?):

What is the second word in the phrase in the image above?

Tip: the phrase reads “Stafford [something] et stafford Redd”

Continue reading Woo, palaeography! (argh)

more on the vexing matter of assigning dates to documents

TNA SP 94/12 ff.144-147 is a 4-page document entitled:

“Note of my letters of aduertisments from spayne, Italy & other parts from Jan 1605 vntill [missing]”

In other words, it is a list of letters received during 1605. It was compiled by Thomas Wilson, secretary to Sir Robert Cecil, in charge of intelligencing relating to Spain.

On the surface, this document is unproblematic, as it is indeed a list of letters starting in January 1605, and continuing through December 1605. However, my astute readers will already remember what I wrote in my previous entry on Early Modern dates, dates are never as simple as they seem. In 1605, dates on the continent were usually marked in New Style (NS), and dates in England in Old Style (OS). A careful investigation of this document led me to realize that the list basically follows OS dating, but with complications worthy of being labelled inanities:

– dates between 1 January and 25 March are given the year OS
– this means that those letters are actually from 1606, and that the list begins in 1606 and then jumps back a whole year when it passes 25 March
– that is to say, the list is not chronological

Let me illustrate:

– however, all letters in the list are in fact dated New Style (a comparison with extant documents in SP 94 confirms this), and it is the NS dates which are copied in the list, rather than OS conversions, but with OS years!
– this means that the dates before 25 March are bizarre hybrids of NS day and month but OS year

In this case, I was lucky enough for most of the letters in the list to have survived so that I could determine what was going on in this list. But really this leaves me to conclude that much work remains to be done on Early Modern dating practices. For instance:

– why would Wilson compile a list of letters in this manner – surely a chronological (real-time) arrangement would be the obvious one? i.e.
    a) the OS year 1604-5: 1.1.-24.3.1604 OS (= 1605 NS*) + 25.3.-31.12.1605 OS (=NS), or
    b) the OS year 1605: 25.3.-31.12.1605 OS (=NS) + 1.1.-24.3.1605 OS (= 1606 NS), or
    c) the modern = NS calendar year 1605: 1.1.-31.12.1605 NS

– did Wilson compile the list later, being confused with OS/NS discrepancies? did he bundle up his letters by date and year, or were bundles/lists created from loose, unordered documents, but in any case in returning to them later the years became fuzzy? (Wilson’s endorsements usually mark the year, but this changes between marking it NS and OS, which would suggest this was a lapse in systematicity)

– what about the matter of when did the year begin? Judging by this list 1 January was the beginning of the year for practical purposes, but then the year count did not change until 25 March

– what does this document say about other Early Modern lists of dates or dated documents? Is this kind of hybrid dating more common than we realize? Is this a frequent pitfall for people working on historical material? – I mean, I was of course aware of the OS/NS schism, but I would never have thought that anyone would create a list like this except by mistake

– what does all this say about Early Modern conceptions of dates and years and time?

There’s probably a book out there that explains all this.

* These dates are simplified: in 1600 NS dates were 10 days ahead of OS, so in fact 1.1.-24.3.1604 OS = 11.1.-3.4.1605 NS, etc.

Rant about code (“MS Office uses XML”)

The new .docx etc formats of the newer versions of Microsoft Office are done in XML. Hence the -x in the extension. The problem with this, however, is something we all know: all MS programs are bloated pieces of shit. Those of you who occasionally fiddle with HTML will probably have experimented with the oh-this-is-convenient “save as HTML” function in Word, only to look at the resulting code in stunned admiration of the amount of resulting crap MS has managed to program Office to include.

This is a different version of the same story (ie. a rant):

When I transcribe documents, I do this using plain text editors, and then save the resulting transcriptions as rtfs (ie. “rich text format”, plain text + moderate formatting). Mostly, I use the TextEdit program that comes in macs; occasionally I have used MS Word and saved as rtfs. The results look alike, but are different underneath the hood – and the fact that gives this away is the size of the text file, as the same document can be either 4KB or 49KB, depending on whether I’ve done it in TextEdit or MS Word.

Right, so here’s a clip of the text document in question – as you can see, there is very little formatting to deal with:

Looking at the code of this rtf file (I use a nifty little code editor called Smultron :D ), you can see that there’s not much code in there – as it should be:

But when I save it as rtf in MS Word, the difference is obvious and pronounced. This is the beginning of the resulting document viewed in Smultron:

..and this is the section of the text corresponding to those in images 1 and 2 above.

Check out and compare the stats at the bottom of images 2 and 4. Clearly MS Word is insane. The rtf saved from Word is twelve times the size it need be, and fifteen times the length in characters. The amount of code in the sane version is about 1,000 characters: in the MS Word version, it’s about 44,000 characters. Fourty-four thousand characters!!

So what’s my point? I guess this: Know Your Tools. At the very least, learn their failings, weaknesses and limitations.

Digital Humanities* and “Digital Humanities 2.0”

Back in June, I attended the Digital Humanities 2008 conference. Digital humanities, for those not in the know (although I’m sure the term is hardly opaque), is the ridiculously wide field covering all humanities disciplines which use computers. So it includes everyone from corpus linguists to software engineers interested in solutions for humanists, and from librarians to educators who create computer games in order to teach kids history. And more.

There are a few reasons why I attended DH 2008. The main reason is that my PhD thesis will be a digital edition. To that end I’ve started a project with two fellow PhD students for developing linguistically oriented digital editions; the paper I gave at DH 2008 was for the project, part of our Spring 2008 promo tour. Our project falls into the digital humanities sphere, rather than the field of corpus linguistics where we come from. Next, I like things digitally humanistic – particularly digital resources such as corpora, editions, databases, textbases,(1) and to a lesser degree portals and their kin. Also software and websites and gadgetry (ok so that’s software, but if you’ll allow the distinctions between Word and Tofu, I’ll call the latter gadgets. This term is liable to change.). Finally, this year the DH conference was in Oulu, so I could hardly not attend (for our non-academic readers: yes, since travel money is hard to find, location can be decisive in terms of which conferences to attend!).

It was a great conference, overall. Good papers, nice people, exciting projects, excellent contacts, and everything went smoothly and I may have given my best presentation yet (I really should make notes for my next one, though – although I suspect there is always room for improvement). Those interested can go read the abstracts of the papers and posters given at the conference on the DH 2008 website.

However – and this is why I’m writing this post – I must say I am rather dissatisfied with all too many digital resources and tools. Since I’m working on a digital edition, this is what I am perhaps most familiar with (that, and corpora of course), and so I’ll voice my disapproval at digital editions in particular. But more generally, too; here’s the thing:

digital resources are not digitised print resources (2)

Nonetheless, people keep creating these things as if they were.

I think this is mostly a generational thing. That is, older people have more difficulties in envisioning born-digital tools and resources. The prime examples come from outside academia – kids today have grown up with exponentially more powerful digital tools than previous generations. This has given them infinitely better intuition in working these tools and machines – mainly computers of course, but this includes dvd players, mp3 players, mobile phones, game consoles, gps devices, and so on and so forth: all electronic devices, in other words. And the same holds for interfaces: kids are much more adept than adults in using the software in all of these devices.

None of this is new, none of this is surprising: younger generations are more at home with technological advances.

But my main gripe concerns digital editions, which I call (for now, at least) re-born digital resources. Reborn because the digital editions I am talking about are resources created out of extant documents – manuscripts or printed works. What I am concerned with is the amount of wasted resources – time and money – put into developing obsolete solutions in digital humanities.

What do I mean, specifically?

Well. I think digital resources should take full advantage of the medium. That is, one should not try to re-create an object from another medium as if it was still in that medium. You can come up with your own examples, such as films based on books that got it wrong, or vice versa. Here, I am talking about so-called “digital editions” which, first of all, are nothing more than scanned images of the original document,(3) and second, don’t even try to make the edition user-friendly – i.e., provide it with an intuitive and simple interface which allows easy and fun (yes, fun!) browsing of the images.

As a concrete example, take PicLens/cooliris, which transforms browsing images into an awesome experience. It’s freely available online, and very easy to install on your website. I’m not saying cooliris is the solution, I’m just saying that people have created and released great tools which facilitate working with computers. How about using them? Guys? Anyone? Bueller? Anyone?

Ok yes, so there’s Turning the Pages at BL etc, but come one – while it’s cool, there’s a limit to its usability – I mean for research purposes in particular, but also just general stuff too! ..not that TtP wouldn’t be a cool feature to have, provided you could do other things with it than just, um, turn the pages…

Anyway, my primary point was user interfaces, specifically image browsing. However, this really can be extended to a whole slew of solutions created by all and sundry, bascially all of which can be classified under what is commonly known as “Web 2.0”.

See? Old news, really.

This is something I mean to push in academia, in my small way. And when I create something groovy, I shall post a link to it. (Like dipity – you guys seen dipity? Melikes it.)

Notes

* Does anyone know a good term for “digital humanities” in Finnish? “Humanistit ja tietokoneet” .. “humanistinen tietojenkäsittelytiede” (??).. nothing I can think of really cuts it.

(1) Text databases. Yes, really!

(2) Unless they are, of course. And in fairness, there are loads of these: e.g. all things available on EEBO; all the digitised books on Google Books and in the Internet Archive; also most thing digitised by libraries etc (like those on the Finnish National Library website. But even then, I think just scanning the books and putting simple images online isn’t really enough.

(3) Yes, there are arguments for this kind of editions. I think the only one to carry any real weight is for projects aiming at quantity, rather than quality (of the edition’s interface, that is, not of the images). Even then, though, I find it difficult to allow for any excuses in, say, ordering the images in the edition, or renaming the damn things. From experience I can say that while this can be a time-consuming task, actually not doing so is a real pain in the arse later on. This means you, Anglo-American Legal Tradition project!