Big Data is a great new medium, but it is no silver bullet

For four exciting days in October the d2e: From Data to Evidence conference took place in the University Main Building. As conference assistants, we were able to take part in the adventure and absorb a multitude of information on the trends of current linguistic research.

Photo by Tanja Säily

Photo by Jukka Suomela.

The conference themes were Big Data, Rich Data, Uncharted Data. Their aim was to draw more discussion to the benefits and challenges of past, current and future linguistic research done in the field of corpus linguistics. The conference was hosted by the University of Helsinki and the Research Unit for Variation, Contacts and Change in English (VARIENG).

More is better – also when it comes to data?

In her opening words the director of VARIENG, Terttu Nevalainen, quoted the now almost clichéd truism: “there’s no data like more data” while simultaneously questioning the traditional meaning of “more data”.

In corpus linguistics, researchers seem to be continuously striving for ever vaster datasets and larger corpora. However, it is a worthy reminder that, in addition to big data and its huge volumes, rich data and uncharted data also meet the definition of “more data” in every sense of the expression.

Exploring big data, rich data and uncharted data

The numerous benefits of using big data and how it allows for research into areas and phenomena that were previously inaccessible were at the heart of many sessions. All of the plenary speakers (Mark Davies, Tony McEnery, Päivi Pahta and Jane Winters) as well as many others, such as Antoinette Renouf and Jack Grieve, touched on the topic. They also draw attention to the shortcomings of big data and the importance of close reading – even while working with large datasets.

An innovative example of utilizing rich data was Marie-Louise Brunner, Stefan Diemer and Selina Schmidt’s corpus of Skype conversations, where they had enriched their corpus of informal, academic dialogue by including the video component, orthographic transcription and pragmatic annotations.  This kind of study and corpus can be very useful in studying for example communication in English as Foreign Language and English as Lingua Franca as well as diverse types of multimodal research.

Previously uncharted data, which has not yet been systematically mapped, was also used in inventive ways in several research projects. For example, no one has done such detailed a study as Lucia Siebers on African American letters from the 18th and 19th century. Susanna Mäkinen on the other hand examined how slaves were characterized in the advertising sections of the newspapers in the same time period. She presented her findings on how the terminology and characterization varied in Massachusetts, New York and South Carolina newspapers.

Close reading takes time, money and effort, but is indispensable

In many ways the conference was finished off in similar sentiments as it was opened.

In his demonstration of the process of compiling and editing The Historical Thesaurus of English Marc Alexander touched upon all the three themes of the d2e conference. He reiterated the strong claim heard in many of the sessions and plenaries that big data does not excuse from close reading. His team led by example by having used a bottom-up approach without a predetermined theory that might warp the results while working on the colossal thesaurus, despite it being a huge undertaking.

Alexander reminded the audience that it is our job as experts to convince financiers that spending time getting fully acquainted with your data is a worthwhile use of research hours. Simultaneously researchers themselves must possess the energy and integrity necessitated by such labour-intensive a task.

As Tony McEnery articulated, a corpus is not just a lot of data: the context, the meaning and the structure of the data are of crucial importance. However interesting, voluminous and rich the data, it remains meaningless unless it is contextualised and translated into evidence.

Uncharted data waiting to be discovered

The variety of research topics presented at the conference was remarkable, as was the prevalence and enthusiasm for multidisciplinary collaborations, especially with historians and digital humanities developers.

As seen in this conference, today’s linguists can study a plethora of different topics, ranging from the use of the word “perhaps” between 1500-1850 to Zimbabwe’s current language situation, from Middle English alchemist texts to gender differences in Twitter English in Finland or the use of “pliis” in Finnish discourse.

Due to the relative novelty of the field, there is a whole uncharted world to explore and anyone interested in corpus linguistics will find as many opportunities for research as one could wish for.

Text: Johanna Hirvensalo & Sofia Bergman

More about the d2e conference: