Can we use Twitter data to estimate population distribution in Finland?

One of the main data processing steps before making use of novel data sources (e.g. Twitter data) for better understanding social processes and phenomena is the detection of users’ origins – be it at country, municipality or neighborhood level. This allows us to know whose Tweets in some geographical area (say, in a certain city or a neighborhood) we investigate. The most basic way is to distinguish locals from non-locals when examining mobility and activity locations of people. A more advanced analysis would require knowledge about origin countries in tourism studies and origin neighborhoods in segregation studies, for example.

This data processing step is also a prerequisite for cross-border mobility research – we need to know origins of people in order to categorize and analyze movements across country borders extracted from geotagged Tweets. Hence, it is the priority for our cross-border project. See, for example, the recent recent cross-border mobility analysis in the case of the Greater Region of Luxembourg from the MSc thesis by Samuli Massinen.

For analyzing cross-border mobility between Finland and Estonia, we first collected publicly available geotagged Tweets from Twitter Streaming API. We also got additional data from prof. Matthew Zook and Ate Poorthuis who collect Twitter data via the DOLLY project at Uni Kentucky. For extracting individual mobility trajectories and analyzing movements between Finland and Estonia, we used the Twitter Search API to collect Tweet histories (up to 3200 latest Tweets) for each Twitter user who had geotagged a Tweet in Finland and Estonia at least once. Finally, we collected digital traces of roughly 80,000 Twitter users (public accounts) between 2012 and 2019. Using Samuli’s algorithm to detect the country of origin, we found that some 34,000 Twitter users live in Finland.

Now we were excited to know where the Finnish Twitter users in our sample come from and does their spatial distribution at municipality level make sense. So, we slightly modified Samuli’s algorithm to detect their origin municipality in Finland and mapped the result. To evaluate our results, we applied simple regression model to estimate population derived from Twitter data and compared it with the official residential statistics at municipality level (Figure 1).



Figure 1. The comparison of Finnish population distribution by municipality between the official residential register by Statistics Finland and population estimation based on Twitter data using quadratic regression model.

What can we see from the comparison?

First, the positive side. The population estimation from Twitter at municipality level has very high correlation with the official residential registration statistics – the correlation coefficient is 0.98! This is a great outcome and gives the confidence to continue using the data extracted from Twitter for national scale analyses. In particular, given the applied simplistic modelling, there is a huge potential to further develop estimation models. For example, one could now evaluate domestic tourism between different regions or examine the network of central cities with their catchment areas based on people’s mobility derived from Tweets.

Second, the challenge. The exceptionally high concentration of Twitter users in the capital city of Helsinki compared to other municipalities is a challenge for the regression model – basically, it is an outlier in the model. According to our approach, 33% of all origins are located in Helsinki, although the proportion of Finnish population living in Helsinki is only 12% according to the residential register. We were able to minimize the over representation of Helsinki to some extent by using the best fit quadratic regression, instead of simple linear regression model (R2 = 0.83). In both models we also weighted Twitter data with age and gender according to Twitter users’ profile, but this increased the model performance marginally.

The city of Helsinki as an outlier can be explained by two issues, at least: 1) certain (other than age and gender related) social group using Twitter is more represented in Helsinki than in other municipalities; 2) information extraction from Twitter data (data enrichment) has not yet taken into account the detection of work locations – one of the main anchor points in our daily lives. Luckily, these issues can be tackled in future by advancing the data enrichment. For the former, one could enrich data to provide additional background attributes to profile Twitter users. One could also weight Twitter data with the proportion of higher education and/or university students in a municipality. For the latter issue, one could apply the framework of the anchor point model by Ahas et al. (2010) to reveal both the home and work (school) locations to better pinpoint residential locations. Currently, we believe that our simplistic model has assigned many Twitter users’ residential locations to Helsinki who actually are commuting from the Helsinki wider metropolitan area to Helsinki for work.

In conclusion, this comparison gives us confidence that we can detect users’ origins from social media data, and that we can use it as one background attribute in our work-in-progress cross-border mobility analysis between Finland and Estonia. We also continue with the origin detection algorithm development. Stay tuned!




Geotagged Twitter Data to Reveal Cross-Border Mobility of People

Overview of Samuli Massinen’s MSc thesis:

How can Big Data sources, such as georeferenced social media, be used in cross-border research? What kind of cross-border mobility patterns can be detected geographically over time? How can daily cross-border movements be separated from other movements? These were the main questions I was trying to find answers for in my Master’s thesis “Modeling Cross-Border Mobility Using Geotagged Twitter in the Greater Region of Luxembourg”.

The Greater Region of Luxembourg is the largest cross-border labor market in the European Union with the greatest number of cross-border workers in the area. European integration, the Schengen Area, and socio-economical divergences between neighboring countries have been the main factors facilitating human cross-border movements in the region and thus the emergence and expansion of the borderland community. Despite the freedom of movement, country borders still exist as well as their socio-economic differences. We witness the growing trend of people migrating to the other side of the country border while still working in Luxembourg. This actuates daily cross-border mobilities, which are not well known, to date. Thus, there is a distinct need to understand cross-border mobility dynamics in the region, especially border crossings on a daily basis.

Thus far, cross-border mobility studies have mainly leaned on national registers, surveys, and census data. However, these datasets have mostly been too scarce in trying to understand the complexities of cross-border mobility in time and space. Many studies have only focused on aggregate-level movement patterns, and the viewpoint of individuals has been missing due to a lack of suitable data. One promising option to provide individual-level data to cross-border mobility research is the implementation of novel data sources, such as mobile phone positioning and social media data.

In my thesis, I employed a person-based approach using geotagged Twitter data to study cross-border mobility in the Greater Region of Luxembourg. The aim was to examine how to implement social media in cross-border mobility research and how to move beyond aggregate-level inspections. Being one of the first studies of its kind, I utilized a heuristic programmatic approach. To my knowledge, social media data sources have not been previously applied to distinguish different cross-border mobility types. Thus, I have published all scripts and algorithms developed in the study on Digital Geography Lab’s GitHub -pages.

Figure 1. Daily cross-border mover activity location density distribution in

The results show that social media can be implemented in cross-border mobility research, and social media data can provide a relatively good proxy for daily cross-border mobility of people on a regional level. Aggregate-level cross-border mobility patterns and activity location densities (Figure 1) correspond closely with previous studies, and outcomes from temporal frequency inspections indicate that it’s a promising approach for identifying and classifying border crossers  – Twitter users classified as daily cross-border movers are more mobile on working days whereas infrequent border crossers (potentially leisure and tourism-related) are mobile on weekends (Figure 2). Daily cross-border mobility patterns also provided new information about the spatial extent of the movements, indicating home-work commuting (Figure 3).

Figure 2. Weekday variation of cross-border trips to Luxembourg (both directions) by cross-border mover type within the Greater Region of Luxembourg in 2010–2018.

To obtain meaningful information from a Big Data source, several data processing steps are required, one crucial stage being the origin detection of social media users. I used a heuristic approach in home location detection which resulted in high accuracy – the “unique weeks” algorithm introduced in this study gave an accuracy of 88.6 % concerning the ground truth (i.e. Twitter user-given information about the individual’s home location).

Figure 3. Cross-border trips for identified daily cross-border movers in 2010–2018. Movements cover both ways between Luxembourg and neighboring areas. Note: each map has a different flow intensity scale.

Although the applied approach is promising in providing new knowledge about spatio-temporal patterns and trends, the results should be considered with slight caution. For example, this study did not consider population densities or Twitter use activity that might cause bias – both attributes vary spatio-temporally and effect where and when Twitter data is being recorded. Also, the sample size was rather limited in this study (~3200 Twitter users).

Hence, further research and method development are needed with larger sample sizes to draw more sound conclusions about cross-border mobility. Still, in my research, I was able to identify that the coverage of geotagged Twitter data is dependent on data acquisition processes and that Twitter Streaming API can provide valuable information for cross-border mobility research. In future studies, I recommend multi-level data acquisition processes to be applied jointly with a person-based approach combining both spatio-temporal and content analysis methodologies.

My research was part of the cross-border mobilities project at the Digital Geography Lab.


Text by: Samuli Massinen

My MSc thesis can be found here: E-Thesis

All my developed scripts and algorithms can be found here: DGL@GitHub