NATAS: Social embedding of neologisms in early English correspondence

Project members: Tanja Säily, Eetu Mäkelä and Mika Hämäläinen (University of Helsinki)

Collaborating scholar: Jukka Suomela (Aalto University)

This subproject focuses on sociolinguistic variation in the spread of new words, tracing lexical productivity and creativity over time and space.

Aims

Previous research into neologisms in the history of English has mostly relied on lexicographical data such as the Oxford English Dictionary, while corpus-based work has largely been restricted to recent developments or individual derivational affixes. Although the quotations in the OED have been extracted from a variety of sources, there is no sociolinguistic metadata attached to them beyond date and the name of the author, and the dictionary is admittedly biased towards certain well-known authors.

By contrast, our Corpora of Early English Correspondence represent a wide social spectrum, richly documented in the metadata associated with the corpus, including information on e.g. socioeconomic status, gender, age, domicile and the relationship between the writer and recipient. Coupled with the new interface to be developed in the project, this will enable us to ask new kinds of questions about neologisms in personal letters, taking into account sociolinguistic variation and change:

Who are the innovators? Which social groups do they represent?
How do the new words spread socially, geographically and diachronically?
Which semantic domains do the neologisms represent?
Why are the neologisms created and established? Can they be linked to specific historical events or changes in culture and society? What kinds of social meanings are associated with them?

We aim to make the history of English neologisms more balanced by taking into account a wider range of social groups, including women and the lower ranks. The diachronic diffusion of lexical innovation is for the first time studied in a maximally representative social setting that resembles spoken conversation, the hotbed of language change, which may lead to surprising discoveries. The tools and methods developed in the subproject will enable others to conduct similar types of research.

Tools and methods

The identification of neologisms in a moderately sized historical corpus is a challenging task that sets a number of requirements to the toolkit developed in the project, from detecting spelling variants to retrieving related lexicographical data from the OED and its Historical Thesaurus. Previous collaboration between some of the project participants has already produced tools for analysing lexical diversity (types2) as well as for filtering and categorising corpus search results (FiCa), which may be used as starting points for further development.

Our pipeline for identifying neologisms begins with comparing the earliest year a given word appeared in the corpus to the earliest attestations in the OED and various databases of published texts with dates attached (e.g. Eighteenth Century Collections Online, British Library Newspapers). Neither of these are straightforward tasks. The CEEC records words in their original spelling, which varies a great deal, and existing normalization tools such as VARD2 and MorphAdorner do not provide full coverage. Therefore, we are developing additional methods to normalize and lemmatize rare spelling variants to map the words in the corpus to the OED and gain a wider understanding of the earliest attestations of the neologism candidates. As for the text databases, most of them have been automatically transcribed from microfilms, so, in addition to spelling variation, they contain a large number of OCR errors which further complicate dating comparisons.

As detecting neologisms is here just a step in a larger process of historical-sociolinguistic research where accuracy is of paramount importance, we are not aiming for a fully automatic pipeline to process such noisy data. Instead, we will develop an open-source environment where information on neologism candidates is gathered from a variety of algorithms and sources, pooled, and presented to a human evaluator for verification. The interface developed allows the user to quickly organize and group candidates based on various criteria, source additional information, and make mass filtering and categorization decisions on them. With the verified data the user can then start exploring social aspects of the neologisms. To do this, the researcher needs simultaneous access to text, visualization and sociolinguistic metadata, which is a fundamental aim of the STRATAS project as a whole.