Hapax legomena in diachronic corpora, 12 May 2017

Here is the abstract of the talk given by Dr Mark Kaunisto at our Annual Meeting in 2017.

Hapax legomena in diachronic corpora: neologisms or just rare words?

Mark Kaunisto, University of Tampere

The aim to quantify the productivity of derivational affixes, i.e. the potential that affixes have to produce new words, has attracted a great deal of attention from linguists in the last couple of decades. The same goes for the interest in tracking historical trends in the popularity of affixes; at some points, affixes may have been used more to produce new words than at other times. One method that has been used to this end is to examine the numbers of first citations of words with a given affix in the Oxford English Dictionary (OED). However, this method has been criticized because of the inaccuracies and gaps in the dictionary. Instead, Baayen (1991) and others have focused on the occurrences of so-called hapax legomena or hapaxes in corpora (i.e., items occurring only once), the basic rationale behind this being the fact that neologisms in corpora are most likely found among low-frequency items. Different indexes indicating the degree of productivity of affixes have been proposed, with the number of hapax legomena playing a major role.

It has been observed that hapaxes in corpora are not necessarily new words, but that they may also include items which are just rare or even archaic. If we consider this from the point of view of diachronic studies, it is relevant to ask whether hapaxes in corpora representing different periods contain genuine neologisms in proportions which are similar from one period to another. The Early Modern English period (1500-1700), for example, is known for the great experimentation in the application of different word-formational patterns, resulting in a high number of superfluous lexical items, and many unnecessary words were later dropped from use. One could then assume that the ratio between recent coinages and other words among the hapaxes is not necessarily constant, but trends of lexical expansion and obsolescence may be reflected among such words in varying degrees. To examine the issue more closely, subsets of hapax legomena were examined in the three subperiods of the Corpus of Late Modern English Texts 3.0 (1710-80, 1780-1850, and 1850-1920). From each section of the corpus, hapaxes beginning with the letter m were first examined, and the dates of publication of the texts in which the hapaxes were found were compared with the dates of first citations of the words in the OED. In addition, hapaxes ending in ‑ity and ‑ness as well as ‑ic and ‑ical were analysed in a similar way. The proportions of hapaxes occurring in texts within 50 years of the first OED citations of the words OED were observed in each subperiod of the corpus.

Based on the results of the study, it is argued that the study of hapaxes in corpora and their first citations in the OED in combination could add to the accuracy in examining productivity of affixes, as attention is primarily paid to those hapaxes which were verifiably recent in the corpus. Although the OED is used as the basis of this verification, the result of such a process does not simply mean a general step towards embracing the OED data as such. As corpus hapaxes include many antedatings to the OED entries, combining the observations from corpora and the OED can be regarded as bringing the benefits of both resources together.

