ERRATAS: Investigating the orthographical reliability of historical corpora

Project members: Anni Sairio, Samuli Kaislaniemi, Anna Merikallio and Terttu Nevalainen (University of Helsinki)

This subproject charts the philological reliability of printed editions of English historical letters, in order to expand the scope of research possible on edition-based corpora.


The Corpora of Early English Correspondence (CEEC) were compiled to access Early Modern English manuscript texts. However, the cost of transcribing and editing manuscripts for a larger historical corpus is prohibitive, and therefore the CEEC, following the example of other historical corpora (e.g. the Helsinki Corpus), was based on printed editions. Such philological outsourcing comes at a cost: although all care was taken to only use editions that  retain original spellings, all editions differ in the application of editorial principles. Thus, the compilers of CEEC have been careful to note that the corpora are not suited for studying spelling, since transcription conventions are not consistent across the corpus texts.

However, variation between editions does not in fact preclude using an edition-based corpus like CEEC for orthographical research – provided that this variation can be identified and avoided or compensated for. The challenge has been that there has been no systematic way to determine the philological reliability of an edited text.

The ERRATAS subproject develops a methodology for the evaluation of the orthographical quality of edition-based corpora. The immediate goal is to evaluate the editions used in CEEC, in order to allow the use of the corpora for research in historical spelling. The aim in the long term is to create a methodology for assessing the philological reliability of editions, and to see if it is possible to rank editions by this reliability. Such an evaluation tool would allow including information on philological reliability as metadata in an edition-based historical corpus. This would increase the scope of research of edition-based corpora, and enable scholars to use such resources to study spelling, among other philological topics.

We are interested in spelling because our current understanding of the development of spelling standardization in English is based primarily on printed texts. While manuscript texts have not been neglected, a long-term survey of private (manuscript) spelling practices has hitherto been unfeasible due to the lack of a suitable corpus, and studies of manuscript spelling have been restricted to superficial overviews of long-term developments, or in-depth surveys of specific texts.

Making it possible to use CEEC to study spelling will not only reveal the history of English private spelling practices. CEEC has the added benefit of being socially stratified, so that the results of corpus queries can be subjected to sociolinguistic analysis. Orthographical studies of the CEEC will thus make it possible to study, for the first time, the social history of English spelling.


The aim of the ERRATAS project is nothing less than to achieve the impossible:

Can we determine editorial changes made to a historical manuscript text from the edition – without recourse to the source manuscript?

This goal requires a lot of manual work: our primary challenge is the absence of not only any overview of printed editions of historical texts, but also the absence of baseline data in the form of surveys or lists of textual features (not only spelling variation) that we can expect to find in manuscript texts from the time periods covered by CEEC (1400-1800).

The primary questions in establishing the philological quality of printed editions are:

  1. What do editors say they have done to the text?
  2. What have they actually done to the text?

Charting the answers to these questions began with writing a checklist of editorial practices: guidelines containing concrete examples of the kinds of textual features to be looked for in editions. This is accompanied by a database in which to record the results of surveys. To facilitate use, a data entry form was created as an interface to the database.

Very soon we discovered that the range of textual features which editors’ practices affect was wider than anticipated. Or, more accurately, in order to accurately capture editorial practices, it was necessary to check for – and thus include in the checklist and database – features which were not present in all manuscript texts. The process has therefore been iterative: the contents of the checklist and database have been affected by what we have found in the editions.

ERRATAS first tackled editions of seventeenth-century letters. On moving on to look at editions of eighteenth-century texts, it was found necessary to add to and adjust the checklist and database futher, in order to accommodate changes that took place in the English language over time.

Work-in-progress versions of the checklist and database data entry forms have been made available on Zenodo.