The case for the societal benefit of user-generated big data research – DGL responds to EU on research data access

Authors: Tatu Leppämäki, Tuuli Toivonen, Olle Järv together with other Digital Geography Lab members

The Digital Services Act (DSA) is legislation by the European Union that aims at protecting the users of and mitigating risks caused by online platforms, covering anything from social media sites to search engines and online retailers. It does this by obligating the platforms to, for example, be transparent about content recommendation systems, and effectively tackling content manipulation and spreading of disinformation. Due to their significant effect on our societies, the legislation sets more obligations for very large online platforms (VLOP): this class of platforms include social media giants, such as Facebook, Youtube, Instagram, Twitter, and Tiktok.

As a research group that has successfully applied user-generated data to study multitude of topics, our interest in the legislation stems from its sections that obligate VLOPs to give means to access data uploaded on their platform for appropriate research purposes (Article 40 of the act). While these purposes are limited for scrutinizing the systemic risks caused by the platforms in the legislation, we believe there is much potential for social good through responsible research employing public user-generated data.

The European Commission recently asked for feedback on the implementation of researcher data access under the DSA. Drawing from a decade of big data research, our response argues for the benefits of researcher data access beyond studying systemic risks. The response is split into a short opinion text and direct responses to some of the questions posed by the Commission (find the guiding questions here). You can read our response below or via the feedback service. If you’re a researcher using or curious about data from online platforms, or just an interested citizen in Europe or elsewhere, you may give feedback until the midnight of Wednesday, 31st of May 2023.

Our feedback on the Digital Services Act

We, researchers of the Digital Geography Lab at the University of Helsinki, form an interdisciplinary research group that studies topics related to presence, mobility, diversity and interactions of people. The main aim of our research is to reveal and understand various impacts to societies and the environment stemming from human mobility and interactions. Research on these topics over the last decade, by us and others, has greatly benefitted from access to public user-generated data hosted by very large online platforms (VLOPs), especially user posts and metadata from social media platforms such as Twitter, Flickr and Instagram.

We welcome the increased researcher scrutiny on VLOPs and the research on societal risks posed by them, as enabled by the Digital Services Act. However, we stress the potential of user-generated big data for societal good through impactful research [1]. We urge the Commission to consider expanding well-grounded researcher access to public user-generated content hosted by VLOPs for purposes beyond understanding systemic risks.

User-generated big data is created at great volumes, and it is rich in information: the data may contain spatiotemporal metadata in addition to textual and visual content [2]. Public user-generated data has allowed us to research dynamic social phenomena and processes that are difficult to study with traditional data sources. Analyzing these data on an anonymized, aggregated level, and complementing with insight from other data sources has enabled societally relevant research on topics such as cross-border mobility [3], urban linguistic variation and segregation [4], access to services [5], and people-nature interactions in green areas [6, 7]. This is particularly emphasized with phenomena such as multi-local and transnational lifestyles of people [8], diurnal and seasonal urban dynamics [5] and the influence of unexpected events like the global COVID-19 pandemic [9]. User-generated big data is often the most detailed and timely source for studying dynamic phenomena, relevant for the development of sustainable, fair and healthy societies.

Data acquisition for our research has relied on Application Programming Interfaces (APIs) operated by the social media platforms. While these voluntary schemes have been very useful for research, they are also volatile, as access may be limited or terminated at any time by unilateral decision of the platform owners. As an example, Twitter reformed their API access conditions in early 2023, putting the future of the platform’s academic access programme and research building on it at an impasse. Earlier in 2014, Instagram changed the spatial accuracy of georeferenced content without informing the API users and later closed the API altogether [2]. Unreliability in VLOP data access hampers transparent and reproducible science based on user-generated big data. In addition, public discourse and policymaking fall short without knowledge from user-generated big data research.

We have argued for the potency of user-generated big data for societal good based on a decade of impactful research by our group and others. We strongly believe that responsible researcher access to public data hosted by VLOPs is crucial for achieving those benefits and, as the Digital Services Act ensures that access for limited purposes, extending that for purposes other than studying systemic risks is needed [1].

Addressing the guiding questions

Ethical research and user-generated big data (2)

Addressing the Commission’s questions on responsible research and the rights of research subjects (2c, 2d), we consider these matters of utmost importance when conducting research using user-generated content. Research communities have developed best practices in data acquisition, analysis and reporting [10, 11] for ensuring responsible data handling at the researcher or research group level. We implement measures on the stored data, including minimizing information to strictly necessary dimensions, pseudonymizing personal identifiers and only storing the data on secure databases. We also implement organizational measures, such as limiting access to the data to strictly necessary persons and time periods.

Data formats (3a)

Addressing the Commission’s question 3a, an extensive and well documented API with permissive rate limits that responds in standard formats, such as XML or JSON, is sufficient for many needs. An academic API, similar to the recently defunct solution by Twitter, would be a suitable technical approach for data access. Alternatively, and in addition, aggregated and transparently produced data products could be published by the VLOPs or appropriate governing bodies, such as Eurostat [1].

As stated in [1]: “we propose two strategical pathways for the future use of mobile Big Data for societal impact assessment, addressing access to both raw mobile Big Data as well as aggregated data products. Both pathways require careful considerations of privacy issues, harmonized and transparent methodologies, and attention to the representativeness, reliability and continuity of data.“

References

[1] A. Poom et al., ‘COVID-19 is spatial: Ensuring that mobile Big Data is used for social good’, Big Data & Society, Jul. 2020, https://doi.org/10.1177/2053951720952088

[2] T. Toivonen et al., ‘Social media data for conservation science: A methodological overview’, Biological Conservation, May 2019, https://doi.org/10.1016/j.biocon.2019.01.023

[3] O. Järv, H. W. Aagesen, T. Väisänen, and S. Massinen, ‘Revealing mobilities of people to understand cross-border regions: insights from Luxembourg using social media data’, European Planning Studies, Aug. 2022, https://doi.org/10.1080/09654313.2022.2108312

[4] T. Väisänen, O. Järv, T. Toivonen, and T. Hiippala, ‘Mapping urban linguistic diversity with social media and population register data’, Computers, Environment and Urban Systems, Oct. 2022, https://doi.org/10.1016/j.compenvurbsys.2022.101857

[5] O. Järv, H. Tenkanen, M. Salonen, R. Ahas, and T. Toivonen, ‘Dynamic cities: Location-based accessibility modelling as a function of time’, Applied Geography, vol. 95, pp. 101–110, Jun. 2018, doi: 10.1016/j.apgeog.2018.04.009

[6] V. Heikinheimo et al., ‘Understanding the use of urban green spaces from user-generated geographic information’, Landscape and Urban Planning, Sep. 2020, https://doi.org/10.1016/j.landurbplan.2020.103845

[7] H. Tenkanen et al., ‘Instagram, Flickr, or Twitter: Assessing the usability of social media data for visitor monitoring in protected areas’, Sci Rep, Dec. 2017, https://doi.org/10.1038/s41598-017-18007-4

[8] J. Raun et al., ‘New avenues for second home tourism research using big data: prospects and challenges’, Current Issues in Tourism, Mar. 2023, https://doi.org/10.1080/13683500.2022.2138282

[9] E. Willberg, O. Järv, T. Väisänen, and T. Toivonen, ‘Escaping from Cities during the COVID-19 Crisis: Using Mobile Phone Data to Trace Mobility in Finland’, IJGI, Feb. 2021, https://doi.org/10.3390/ijgi10020103

[10] E. Di Minin et al., ‘How to address data privacy concerns when using social media data in conservation science’, Conservation Biology, Apr. 2021, https://doi.org/10.1111/cobi.13708

[11] C. Sandbrook et al., ‘Principles for the socially responsible use of conservation monitoring technology and data’, Conservat Sci and Prac, May 2021, https://doi.org/10.1111/csp2.374

 

– – – – –

The Digital Geography Lab is an interdisciplinary research team focusing on spatial Big Data analytics for fair and sustainable societies at the University of Helsinki.