Preprocessing data for analysis

While it isn’t necessarily straightforward to acquire data from social media as has been communicated in previous blog posts (!), there’s always the challenge of starting to do something with the data once one has it. Spreadsheets offer one way to proceed as one often needs to pre-process data. While spreadsheets aren’t convenient in every endeavour, there’s a lot that can be done with them. Spreadsheets are good in partially automated stagewise working processes. Their outcomes can also be easily checked and re-checked, which is great.

We’ve used the freely available LibreOffice Calc to preprocess our data (https://www.libreoffice.org/discover/calc/). As we wish to examine co-occurrences between hashtags in posts, we wanted to process the post-by-post data into a csv file, which we could later import in Gephi (https://gephi.org/). Gephi includes a very useful plugin for creating networks from co-occurring hashtags so our job was simply to create a csv file with lines of hashtags. Another option would have been to establish pairwise co-occurrences in Calc.

Creating such a csv file was simple with API-acquired data as that was already very conveniently organized to our liking (i.e. the hashtags were already brought together in a dedicated spreadsheet column). Manually organized data were quite a different matter and required numerous steps, which we wouldn’t want to do with large data sets. The lesson learned is that it’s a good idea to work so that one doesn’t have to overly labour intensive preprocessing. With large data sets it may be even considered an essential prerequisite.

Leave a Reply Cancel reply