Acquiring data from Tumblr

We used Tumblr’s API to gather the data. It means, that we technically have access to the server of the platform. To gather the data and use the API, a Python script was created. The script allows us to perform several tasks: Automate the process; Combine separate chunks of data into a single .csv file; Continue the data collection process after the last checkpoint whenever the rate limit has been reached. It included a lot more manual work than expected. Sometimes the server would randomly jump from year, let’s say 2020 to 2018 with nothing in between. This would require to search for a date in between so that the server would again return the posts normally. For some tags, the server would only return old data (e.g., posts starting from 2015). Unfortunately nothing could be done about it, because we have no access to the server and can’t change the way it returns the data. Sometimes the server would return malformed data. This would result in invalid .csv file that would require manual inspection. Although, the API provides the capability to specify how many records to return on each request (up to 20), it often was not constant. Because of the reasons above, instead of leaving the script alone to gather the data, the process had to be constantly monitored.

Leave a Reply Cancel reply