The difficult relationship between Brexit and Twitter data

During the Digital Humanities Hackathon our group is researching data collected from Twitter. Our data consists of all the tweets, retweets and mentions connected to the hashtag #Brexit from time period starting from 26 February to 15 April 2019.  

The timeline for our data is interesting since the 29 March was the original date United Kingdom was supposed to leave European Union, but as we all know that did not happen. This means we have a possibility to look in detail something that failed to happen and how people reacted to it on this particular social media platform.

One important thing to remember when doing research with data like this is that you need to know your data, how it was collected and what are its limitations. Twitter represents only one social media among others – not everyone uses it.

We only have people from certain demographics of general population and even the Twitter users who participate in the political discussion on Twitter and who knows if they share their true feelings as openly as they would in “real life” on social media.

Problem with using singular hashtag as a parameter for collecting data is that hashtags are always self-selected markers and not all Twitter users add them in their tweets. In our case there could also be other hashtags connected to Brexit as a phenomenon that are used individually without the addition of #Brexit. This sets certain limitations to interpreting our data.

Digital humanities methods have great advantages for doing fast and big research, but there is still something the computer cannot achieve. For example, yesterday Heng and Faiz were identifying the gender of 577 MP members manually because there wasn’t a good program that could detect the gender accurately by the twitter username.

Besides, the bigger the data is, the messier the data could be. It is easy to obtain big size of data, but the data could also have low reliability without close manual checking. We for example, have to vary of bot activity relating to political discussion. Bigger and faster is not always better, and we need to find a balance between quantity and quality when doing digital humanities.

With data coming from social media there are also ethical questions to be solved and thought about. Because Twitter data contains so much personal information from users, sharing the original datasets with your research would be highly unethical. What kind of information can we use, what is too personal? This means that anonymisation of usernames, tweets and other critical information is extremely important and a lot of time needs to be dedicated to these practices.

Written by Minna Turunen MA student of the Cultural Heritage program at the University of Helsinki and Heng Gong MA student of English studies at the University of Helsinki.