Preprocessing data for analysis

While it isn’t necessarily straightforward to acquire data from social media as has been communicated in previous blog posts (!), there’s always the challenge of starting to do something with the data once one has it. Spreadsheets offer one way to proceed as one often needs to pre-process data. While spreadsheets aren’t convenient in every endeavour, there’s a lot that can be done with them. Spreadsheets are good in partially automated stagewise working processes. Their outcomes can also be easily checked and re-checked, which is great.

We’ve used the freely available LibreOffice Calc to preprocess our data ( As we wish to examine co-occurrences between hashtags in posts, we wanted to process the post-by-post data into a csv file, which we could later import in Gephi ( Gephi includes a very useful plugin for creating networks from co-occurring hashtags so our job was simply to create a csv file with lines of hashtags. Another option would have been to establish pairwise co-occurrences in Calc.

Creating such a csv file was simple with API-acquired data as that was already very conveniently organized to our liking (i.e. the hashtags were already brought together in a dedicated spreadsheet column). Manually organized data were quite a different matter and required numerous steps, which we wouldn’t want to do with large data sets. The lesson learned is that it’s a good idea to work so that one doesn’t have to overly labour intensive preprocessing. With large data sets it may be even considered an essential prerequisite.

Acquiring data from Tumblr

We used Tumblr’s API to gather the data. It means, that we technically have access to the server of the platform. To gather the data and use the API, a Python script was created. The script allows us to perform several tasks: Automate the process; Combine separate chunks of data into a single .csv file; Continue the data collection process after the last checkpoint whenever the rate limit has been reached. It included a lot more manual work than expected. Sometimes the server would randomly jump from year, let’s say 2020 to 2018 with nothing in between. This would require to search for a date in between so that the server would again return the posts normally. For some tags, the server would only return old data (e.g., posts starting from 2015). Unfortunately nothing could be done about it, because we have no access to the server and can’t change the way it returns the data. Sometimes the server would return malformed data. This would result in invalid .csv file that would require manual inspection. Although, the API provides the capability to specify how many records to return on each request (up to 20), it often was not constant. Because of the reasons above, instead of leaving the script alone to gather the data, the process had to be constantly monitored.

Instagram data

There are over 12 600 posts with the hashtag ‘elevatorlife’ on Instagram. Not only does the #elevatorlife reflect consumer’s interpretation of elevator experience better than #elevator, but also the number of posts published stays rather reasonable. To ensure the quality of the material the data acquisition was performed manually.

On the workshop held in April we presented some of the qualified Instagram posts and also a few of the ones that didn’t meet the requirements. The qualified posts were publicly available, had at least one hashtag on their description in addition to #elevatorlife and published by an adult individual. Therefore, all posts with marketing interest or by commercial Instagram accounts were excluded. Also posts by elevator mechanics and meme accounts were not approved. Some posts met all the other requirements but the pictures had nothing to do with an elevator. These were not included as well as posts by accounts that rate elevators or collect other account’s posts.

As we noticed, consumers are inventive with their hashtags and similar looking posts differ from each other by the hashtags attached.

Next we’ll continue pre-processing the data and move forward with the other social media platforms.

Method development

Updates from the methodology team

Various tools to extract Instagram data has been used in previous research. However, these also include methods that are questionable, subsequently banned for research use or otherwise not fit for our research project objectives.

Fortunately, weekly methodology café sessions and discussions with university researchers on finding an applicable tool have proved to be helpful in every way and we will pursue on that.

While we continue our work with acquiring data from Instagram, we will prepare a demo with Twitter data for the meeting in May. A visualization tool will also be used to help the interpretation and to understand the scale of the results.

Methodologically advanced examination of consumer interpretations for refreshing sustainable urban life (MORE)

This research addresses the green transition and its social dimensions from a consumer perspective. It develops advanced network and script based methodologies for examining consumer interpretations of sustainability matters. The research task is to develop a proof-of-concept of how consumer generated interpretations of everyday sustainability in social media can be used in business settings. Current alternatives fall short because they focus on behavioral analytics or rely on insufficiently scalable qualitative approaches. The tested proof-of-concept connects to the roadmap of the ecosystem for sustainable solutions to support better people flow in urban environments led by KONE, and will be used in subsequent ecosystem co-innovation project focusing on refreshing sustainable urban life.

Advisory board appointed

We are glad and proud to announce our advisory board – together we are stronger!

Jaana Hyvärinen, Chair
Strategic Foresight Manager

Anna Martela
Kenno Consulting

Janne Korpi
Chief Analytics Officer

Topi Ahava
Account Director

Stay tuned!

Welcome to the web of the More project! We’re all about making sense of consumer interpretations of mundane technologies – for regeneration and circularity.

Päivi Timonen, +358 50 3433 138, research profile
Project visionary

Petteri Repo
petteri.repo@helsinki, +358 400 737 968, research profile
Methodological explorer

Maija Luoma
Data wizard