Meet Mildred 5: In the beginning, there was DMP

The starting point for Project Mildred was the need to improve the quality of research data management (see the University of Helsinki research data policy). Funders, for example the Academy of Finland and the EU, require proper data management. A data management plan (DMP) is the basis on which all Mildred services are built. And in turn, the Mildred services provide researchers with the tools to implement a DMP in practice.

Mari Elisa Kuusniemi aka MEK. Photo by Jussi Männistö.

Mildred’s Sub-project 5 focuses on data management planning. The tool designed for researchers’ data planning, DMPTuuli, connects all Mildred services.

“DMPTuuli will market and provide links to other Mildred services. For example, data management during the research process requires Mildred 2 and Mildred 4 services,” says Mildred 5’s Project Manager, Mari Elisa Kuusniemi, also known as MEK.

DMPTuuli has been in use since the autumn of 2016. Currently, Mildred 5 focuses on developing discipline-specific guidance for the use of DMPTuuli, in co-operation with researchers (see DMP hackathon for historians).

“The tool itself is easy to use and doesn’t really need development. User ratings have been positive (see DMPTuuli user survey data in Figshare). Our challenge is the content. Research data management is a new issue for many researchers, and a researcher must take into account how the data are managed, collected, described and organized, already during the research. Then they need to know what is required to publish the data, and how to preserve it. Research data management involves all kinds of processes and agreements. This isn’t an easy task, even though the tool is easy to use,” says MEK.

Research data management practices vary by discipline, and various organisations within the University of Helsinki also provide guidance. The problem is that this guidance is scattered; legal advice is located in one place, research funding advice in another. DMPTuuli tries to serve all research data management guidance in one place.

“We are about to begin co-operation with Ethics Committees in order to provide guidance for the management of sensitive data. This is important, because legislation is about to change (see General Data Protection Regulation, GDPR),” says MEK.

Pälvi Kaiponen. Photo by Jussi Männistö.

DMPTuuli has been implemented in close co-operation with the Academy of Finland. In the autumn of 2016, the Academy received 1000 applications from the University of Helsinki, 800 of which were made via DMPTuuli.

“The strength of Project Mildred lies in the fact that it has involved research funding organisations right from the beginning. And every time the funder is involved, the researcher’s interest is aroused. This enables us to market university services,” says Mildred 5’s Project Owner, Pälvi Kaiponen.

A year ago, Academy funding made DMP better known among researchers and also provided information on the development of DMPTuuli. For example, researchers wanted to be able to log in using their organisation IDs, a wish which has now been granted. Exemplary DMPs are also published.

The DMP is part of the transformation process of research culture, in which proper research data management plays a significant role.

“It’s important to increase researchers’ understanding of why a DMP is relevant. A DMP is not only for the Academy of Finland; it’s for the researcher him/herself to better manage data,” claims Pälvi Kaiponen.

“Once the DMP has been made, its significance is usually understood. More and more, researchers ask for help with the matter itself, that is, managing the data, instead of asking for help to meet the funder’s requirements. Eighty per cent of the researchers who participated in the DMP workshops in the spring of 2017 did so because of data management. The change has been pretty quick,” says MEK.

DMPTuuli is also suitable for teaching purposes, and the goal is that the tool will be used in teaching already at the bachelor level. This would foster an open data culture at the University of Helsinki. Research data management is one of the key points of open science.

“Teachers could use DMPTuuli in various courses; for example, in courses related to research methodology. Proper data management skills are also important in working life. You need to know where to save your files, you need to understand the importance of backing up, version control, and description. Such basic skills are needed in all academic professions,” says MEK.

Meet Mildred 4: Storage for big data researchers and safety for all

The technical infrastructure of the Mildred services are built at the IT Center. Whereas Mildred’s Sub-project 2 was primarily concerned with data storage, sharing and management, Mildred’s Sub-project 4 is mainly related to data storage (capacity) and backup.

Ville Tenhunen and Minna Harjuniemi in Viikki, Helsinki. Photo by Jussi Männistö.

The goal of Mildred 4 is to ensure that researchers can flexibly increase their data storage capacity and preserve their data in the event of potential technical problems. These data storage and backup services are important to all who use data storage services, but especially to researchers and research groups with high data volumes.

“We are building a storage that suits big data researchers. Researchers with smaller data can use the services that built by Mildred 2. Mildred 4 storage is useful for researchers who have hundreds of terabytes of data. The target group is clearly in data intensive research,” explains Mildred 4’s Project Manager, Ville Tenhunen.

“The data storage built in Mildred 4 is perhaps related more to the data management during the research process. When the dataset is ready to be published, Mildred 2 and Mildred 3 services come into play,” says Mildred 4’s Project Owner Minna Harjuniemi.

Piloting is already complete: Ceph and GlusterFS have been tested for Mildred 4 data storage. The purchases should take place in the autumn of 2017.

The data backup service is basically now ready to be bought. Several Mildred services are being piloted by researchers this autumn, but the backup service does not need to be tested in the same way. Technical piloting is enough for “invisible background service”.

“No piloting is needed for the backup service because it’s a well-established standard service,” says Tenhunen.

According to Minna Harjuniemi and Ville Tenhunen, the Mildred 2 and 4 process has revealed no real surprises as regards content issues.

“We still feel we’re on the right track and that we’re doing the right things for both society and the planet. It’s a good idea to get involved in data sharing, and it’s a good idea to promote it and provide researchers with tools for it,” says Minna Harjuniemi.

Meet Mildred 2: The data fridge for data chefs

An essential part of Project Mildred is building technical infrastructure for data services. This is carried out in the Mildred’s Sub-projects 2 and 4, under the co-ordination of the IT Center. Mildred 2 is responsible for building the Data Repository Service, which will help researchers manage, share, and store research data. The service looks like a website, but researchers can also use it via their own applications and file systems.

Ville Tenhunen and Minna Harjuniemi in Viikki, Helsinki. Photo by Jussi Männistö.

The repository service is divided into two parts: the implementation of EUDAT (European Data Infrastructure) tools and the building of our own repository. The EUDAT tools are for storing, sharing and collaborating data during the research process, as well as for publishing and describing finished datasets. The University of Helsinki’s own repository provides tools for researchers whose data are not suitable for a cloud service like EUDAT’s.

“The EUDAT pilots began in the autumn of 2017, and the tools are to be implemented this year. The university’s own repository service is to be technically tested, and it will be ready, in some form, by the end the year,” promises Mildred 2’s Project Manager, Ville Tenhunen.

Already now, researchers have access to a variety of data services outside the University of Helsinki. Project Mildred aims to exploit existing services and link them to the research process in the best possible way.

“We don’t need to step on other players’ toes here. In some disciplines, it’s clear that research data will be stored in an international, rather than a national or a local data repository. On a national level, CSC (IT Center for Science) provides long-term storage, but we are heading in a different direction. Of course, we understand that some valuable research data require long-term preservation, but we don’t want to enter too deep into the problematics of this,” says Mildred 2’s Project Owner, Minna Harjuniemi.

When we talk about data, we must not forget metadata. Metadata and data are closely integrated and cannot be separated, especially in data management, which covers the entire life cycle of research data. Project Mildred deals with metadata in both Mildred 2 and Mildred 3. However, Mildred 2 is primarily about data.

“Mildred 3 introduces data story-telling [based on metadata], which is now a red-hot topic. Mildred 2 only makes the data available, and we’ll ensure that the data for data stories are at hand when needed,” explains Ville Tenhunen.

“You can compare our work in Mildred 2 to cooking. We give you the kitchen and the ingredients. It’s someone else’s task to prepare the food according to the recipe,” says Minna Harjuniemi. Ville Tenhunen continues:

“Mildred 2 is like a fridge, into which researchers put their carrots. Then the chef comes in and does something with them. We only supply a fridge. But there are many kinds of fridge. Fridges have different degrees of coldness, for example, and we can say ‘don’t put your fish in that one, put it in this one.”

The data fridge, with its related data services, is created specifically for the needs of researchers at the University of Helsinki. At present, research data are stored everywhere; here and there, in the most diverse places (see Data Repository Survey).

“Until now, files of a certain size have been difficult for us. For example, 1–2 terabytes of data are too large to fit into existing systems at a reasonable cost, and too small for a more extensive repository system. A lot of data fall in between. Researchers with 300–400 terabytes of data are easier to handle, because they clearly need special solutions, and they have the money and expertise. Also, a small amount of data, such as 20 gigabytes for example, easily fits into Wiki, or almost anywhere, in fact,” says Tenhunen.

Ville Tenhunen, Minna Harjuniemi and Jussi Männistö. Photo by Juuso Ala-Kyyny.

Even when the repository services are technically finished and released, the work is not yet complete. Project Mildred represents a new kind of service thinking, in which the service provider and the customer work together to develop the service.

“Social interaction isn’t achieved by an administrational organization, such as the IT Center or the Helsinki University Library, producing a lot of material. We can’t say ‘Here’s the sandbox and some nice rules for you, off you go and play’. This won’t lead to social connectedness in 2017. Users have to be involved, and the service provider has to be part of the community. Communication and feedback channels can be used to respond to situations and to share information with users. The world is changing fast, and services need to change too. Competitive advantage is based on our ability to respond to these changing needs,” says Ville Tenhunen.

Meet Mildred 1: Data support from a one-stop shop

The goal of Mildred’s Sub-project 1 (there are four others) is to make data services easily accessible to researchers on the ThinkOpen website. This happens in two ways: by gathering the University of Helsinki’s existing data services onto one online service channel, and by designing self-service functions for researchers.

Eeva Nyrövaara and Aija Kaitera. Photo by Jussi Männistö

The service concept is more or less the same as that in Book Navigator: the services currently provided by several service providers are all available in a one-stop shop. Aggregation is sorely needed, as researchers are unaware of the university data services available to them.

“The idea is to bring together services such as storage and data publishing, to make it easier for a researcher to find them. At the same time they can get the required service for themselves,” says Mildred 1’s Project Manager, Aija Kaitera.

As well as single services and service packages (e. g. Storing confidential data, Sharing data), the researcher can use the search engine or the guidance wizard for service searching. The wizard also teaches the user basic data management. Aija Kaitera believes that typical service needs are related to data protection, data publishing and long-term storage.

“The question of personal data is of interest to many researchers, and we have to help them manage personal data properly. This service may include expert consultancy and technical solutions related to security; for example, specific storage,” Kaitera explains.

Now it is clearer what the university can offer and how these services are offered to researchers. The university’s data services were described and listed during the summer, and the online service channel will be introduced in the autumn of 2017. The website is ready to be tested, and researchers have been invited to take part in the piloting.

“This autumn, we have to decide how to maintain the services. How can we ensure that they are kept up to date? And how can we add new services to what we already have?” asks Mildred 1’s Project Owner, Eeva Nyrövaara.

Project Mildred is a unique venture, because it can cut across organizational boundaries. The Data Support team, part of Mildred’s Sub-project 1 illustrates this: via Data Support, researchers can access data management specialists in the Helsinki University Library, IT Services, Central Archives, Research Affairs, Personnel Services, and Legal Affairs. What is important is that the researcher does not need to know anything about the organizational structure behind Data Support.

“Our challenge is to present services with different backgrounds in a coherent, comprehensible way to the researcher. The services must be understandable from the perspective of the research process,” stresses Aija Kaitera.

Jussi Männistö, Eeva Nyrövaara and Aija Kaitera. Photo by Juuso Ala-Kyyny

The first phase involves only the services provided by the University of Helsinki. But co-operation with different service providers is under negotiation.

“We aim to provide services from external service providers as well; for example, a service request to CSC (IT Center for Science) could be sent through our online service channel,” says Kaitera.

In the future, the online service channel may include automated self-service functions (e.g. acquiring disk space) and various personalization features (e.g. a shopping cart, suggestions based on the discipline).

“We can’t quite get there this year, but these features would offer researchers really good added value,” claims Kaitera.

Embedding and marketing the online service channel among researchers is a major challenge, and support from researchers is welcome.

“Of course, it’s not enough for the online service channel to be on ThinkOpen. Researchers won’t find it. The data service channel must be integrated into project guidance, and everywhere else where the data is mentioned,” Aija Kaitera emphasizes.

Meet the Mildred Five!

“The University of Helsinki provides researchers and research groups with a research data infrastructure that includes tools and services for supporting the management, use, findability and sharing of data as well as with the capacity for storage, preservation, computing and processing. This data infrastructure is built and developed together with national and international parties, taking into account the services and infrastructures that they offer.”

Above is the fifth point in the University of Helsinki research data policy that was confirmed in February 2015. This research data policy was also the beginning of the Project Mildred that was launched in the spring 2016.

The goal of the Project Mildred is to implement the research data policy in practice. This is accomplished by providing the researchers with the tools to carry out the proper data management required by the University of Helsinki (see other points in research data policy) and other research funding institutions, for example Academy of Finland.

As we know, the Project Mildred is divided into five sub-projects:

  1. Digitalization of Research Data Services Delivery
  2. Data Repository Service
  3. Data Publishing and Metadata Service
  4. Data Storage and Backup
  5. Implementation of Data Management Planning Tool – Tuuli

The first phase of the Project Mildred will end by the end of the year 2017. In the next week, we will present each subproject and its current situation in this blog. So follow the blog posts and tweets (#MildredFive), and meet the Mildred Five!

In the video above: The photo session for the Project Mildred in Viikki, Helsinki. Ville Tenhunen (project manager) and Minna Harjuniemi (project owner) work in the Mildred sub-project 2 and 4. Photographer Jussi Männistö works in the Helsinki University Library. Photos will be published next week!

It is time to pilot Mildred services – volunteer researchers are being sought!

In the Project Mildred we have developed data management services for researchers. The work has been done in five Mildred sub-projects for over a year. Now most of the services are ready to be tested out. We invite researchers to join in any of the following five pilot schemes:

Searching the data service (Mildred 1)

What will be tested? The web site that helps a researcher to find the most suitable data service. This is an online service channel, where all the University of Helsinki data related services can be found.
When? In August 2017. The web service provider will be chosen in the first week of August, and researchers are expected to be involved in the development of the service at an early stage.
More information & enrollment: Aija Kaitera (aija.kaitera@helsinki.fi), Helsinki University Library.
Important! The online service channel is intended for users of all levels but especially for researchers who have little experience with data issues. In this respect, data newbies are most welcome to develop the service.
Good to know: The web site developed here is expected to be published in the autumn 2017.

Sharing and storing the data (Mildred 2)

What will be tested? The repository services designed for sharing and storing the data produced by researcher. The services make use of ready-made data tools developed by EUDAT, and the following services are introduced in the pilot: (1) B2DROP, a Dropbox-like service for data sharing and collaboration, (2) B2SHARE for data sharing and longer-term storage, and (3) B2SAFE for storing. Researchers use these services in their actual research. Researchers are expected to give feedback on the functionality of the services as well as on the further needs regarding the use of the service.
When? At first researchers use B2DROP, and the pilot of this service starts in August. The B2SHARE pilot will start in September. The B2SAFE pilot takes place by end of the year, and the schedule will be informed later. The pilots will continue until the end of the year unless otherwise agreed.
More information & enrollment: Kimmo Koskinen (kimmo.koskinen@helsinki.fi), Helsinki University Library and Ville Tenhunen (ville.tenhunen@helsinki.fi), IT Center.
Important! Participation in the pilot is risk free for researcher: firstly, the data management tools in the pilot are largely based on already existing and finished products, and secondly, EUDAT services are available to researchers also after the pilot. During the pilot phase, the use of the services is free of charge for the researchers. The distribution of the post-pilot costs will be clarified in the autumn. Furthermore, it is important to know that piloting involves only non-sensitive research data at this point. Solutions for sensitive data (e. g. personal data) will be run at a later point of time.
Good to know: The data handled in B2DROP can be transferred to B2SHARE in which case the data will be complemented with metadata and other information required for permanent preservation. The technical realization of the services is mainly carried out by EUDAT but the design and part of the technology is adjusted by the University of Helsinki. The service operator is IT Center for Science (CSC).

Searching and publishing the data (Mildred 3)

What will be tested? The user interface of the research data search service. In this service the user can search useful data sets. The pilot is supposed to give information about the functionality and browsing opportunities of the search service. Also the data publishing service will reach the pilot phase this year. The service is designed for the opening of the data produced by the researcher.
When? The prototype of the search service is tested in August/September 2017. The more sophisticated version of the search service will be opened to the public by the end of the year. The research data publishing service comes into the pilot phase during the autumn and is reported separately. The pilot for the data publishing service will be realized probably in October/November. A more accurate schedule will be informed later.
More information: Researchers interested in testing the search service can contact Pauli Assinen (pauli.assinen@helsinki.fi). Service provider Digitalist is responsible for implementing the pilot and recruiting.
Good to know: In the beginning, the Mildred search service crawls data from Zenodo (CERN), B2Share (EUDAT) and Etsin (ATT). Other sources for research data are being surveyed and the supply will expand later this year.

* * *

In addition to the pilots above, the storage and backup service for researchers (Mildred 4) will be piloted in the autumn 2017. The pilot plan and schedule will be refined later. The data management planning tool Tuuli (Mildred 5) is already in use, and researchers can already use it in drafting research plans. Discipline-specific Tuuli guidance will continue to develop, and researchers are involved in this work (See DMPTuuli hackathon for historians).

Historians improving Tuuli guidance – 15 notes from the hackathon discussions

What is research data for a historian? How to promote opening and sharing the data? And above all, how to improve guidance for data management so that historians would benefit from it in the best possible way?

These questions and many others were discussed by historians from the University of Helsinki in the Helsinki University Library (Kaisa House) on Monday, June 19. The aim of the hackathon event organized by the Helsinki University Library (HULib) was to develop guidance and instructions of the data management planning tool Tuuli (DMPTuuli) (see RDM guidance in the Mildred M5 project) for historians. The library has hosted a similar workshop for researchers in the fields of arts, design and architecture.

The three-hour long search for better guidance was proceeding step by step, following the structure of DMPTuuli. Hence, the researchers shared their views with (1) data description, (2) data documentation, (3) data storing, (4) ethical and legal issues, and (5) data sharing and long-term preserving (see also Research Guide by HU).

The hackathon was carried out in the form of group work, and the historians raised a number of concrete issues and general ideas about DMPTuuli guidance. As Susanna Nykyri from HULib summarized, historians’ remarks challenged and encouraged the library to further develop the guidance. In this blog post we have put together some of the key findings of the intensive Monday session. So, here’s 15 preliminary notes from the hackathon – you are free to share further ideas in the comment section!

The hackathon in action. Photos by Jussi Männistö.

General remarks

(1) The data management model and terminology used in DMPTuuli appears somewhat distant for historians. Researcher Lauri Hirvonen assumes that the tool may be more suitable for social and natural scientists than for historians accustomed to archive materials. ”The DMPTuuli seems to be most applicable for research projects making use of distinctive and limited data sets that are collected for analysis. A historian may not be able to produce this kind of specific data set, in which case some elements in DMPTuuli may seem rather far-fetched,” Hirvonen says.

(2) However, DMPTuuli as well as the hackathon event have encouraged historians to reflect their documentation practices. Researcher Maiju Wuokko: ”We are forced to think about data management now because funding institutions demand it, but at the same time this brings out many crucial principles and practices in the discipline. So at its best, data management planning can be much more than a compulsory obligation in the funding process!”

(3) The hackathon clarified the overall process of managing the data, says Tuula Pääkkönen who works as information systems specialist in National Library of Finland. Pääkkönen: ”All the questions in DMPTuuli highlight the issues, and even though they can seem tricky at first glance, it is good to bring them into the open. One challenge is how deep into the data management plan topics one should go. The recommendation for the DMP is 1–2 pages, so finding the suitable detail level for oneself but useful for others is one consideration.” Pääkkönen participated in the hackathon to seek for ideas for National Library’s Digi project.

(4) DMPTuuli sample answers tailored for historians are at the top of researchers’ wish list. Maiju Wuokko believes that concrete sample answers would have a positive impact on the internal debate in the field of history. But then, sample answers might control and direct the data management planning too much. However, the need for discipline-specific workshops in developing the guidance and strengthening the practices is obvious. The ultimate goal is to have guidance that follows the research process, as Professor Anu Lahtinen from the Department of Philosophy, History, Culture and Art Studies says.

Discussing about data documentation.

Data description

(5) The data produced by historians was divided into three types: the source material (e. g. archive materials as such), the data produced by the researcher (e. g. notes on archive materials), and the project management data (e. g. research permits). This division was considered relevant in principle. A researcher is solely responsible for the data produced by herself/himself.

(6) Due to the nature of the research, the data produced by a historian often consists of notes. Maiju Wuokko points out that this is a good thing to keep in mind when communicating with historians: ”It might be important to emphasize that the most urgent thing for us is to keep our own notes safe and to think about further use of the notes. Archives take care of the source material for us.”

Data documentation

(7) The core idea of the documentation is to ensure that the data collected by the researcher is also understandable for other researchers. As illustrated by Tuula Pääkkönen, the documentation makes sure that your data is understood – by fellow researchers during your holiday and by yourself after the holiday. In the field of history, where the research process is often an individual performance, this can be challenging because the documentation for others is not a self-evident practice. And on the other hand, historians might feel that inclusive documentation takes too much time from research.

(8) The documentation is often included in the research itself in the field of history. In this respect Lauri Hirvonen considers the documentation in DMP slightly overlapping with the research plan and research questions. Hirvonen: ”The actual publication should disclose all of the most relevant data in the form of references and data catalogs.”

(9) If the data produced by historians is mainly notes, the quality and integrity of the data can be ensured, according to Lauri Hirvonen, by readable notes that are secured by backups and reviewed with source-critical methods. Hirvonen: ”In practice, this is accomplished by meticulous referencing and by reviewing the data comprehensively in the introduction. Through the precise list of references and data sources, any researcher should be able to check the accuracy of the research. ”

Developing guidance for data storing.

Data storing

(10) The practices of data storing vary from historian to historian. Some researchers have stored data in the Finnish Social Science Data Archive, others use a commercial cloud service. When making use of external cloud services, problems may arise if the service is not supported by the University of Helsinki. In addition to archives and cloud services, historians have been using hard disk drives, external hard disk drives, and network drives to store data.

Ethical and legal issues

(11) The most complicated ethical and legal issues are usually related to large data sets. Tuula Pääkkönen: ”In this data, there may be all kinds of special unique cases. One solution is to use old-enough material to minimize the risk,” Pääkkönen says.

(12) The anonymization of the data is sometimes demanding because a person can be identified in many ways, even when the name is removed. And on the other hand, if the anonymization goes too far, the data may lose its value and relevance for other researchers. Ethical evaluation included in the DMP process is considered particularly useful when there is uncertainty about the sensitivity of the data.

Sharing views on ethical issues related to data.

Data sharing and long-term preserving

(13) According to Lauri Hirvonen, sharing the data among fellow researchers may sound somewhat irrelevant to some historians. It is because of the nature of the data, explains Hirvonen: ”The research in the field of history usually doesn’t generate distinctive data sets that are analyzed by other researchers. The data produced is often notes based on a variety of historical records (e. g. documents, manuscripts, maps etc.), and merely sharing those notes as open data is problematic. Of course, also historians may produce valuable qualitative or quantitative data, for example in the field of economic history, and this data can be shared as a distinctive data set.”

(14) At what stage can the data be published? A historian may be ready to share the data only when it is used up, that is, when a certain number of articles based on the data have been written. The historians in the hackathon also mentioned the possibility to share part of the data, while the rest would be left for the researcher until the opening of the completed data. However, there are also other reasons why data is only partially shared. For example, the same data may have a wide variety of contents that must be shared or preserved in different ways.

(15) Sharing and preserving the data produced by historians may generate costs in at least two ways: the costs may arise from the storing itself (e. g. storage capacity) or the costs may arise from the production of metadata required for sharing and preserving. Additionally, costs may arise when increasing the usability of the data (e. g. data visualizations). The question worth considering is when the extra documentation (for sharing and preservation purposes) requires extra funding.

Text by Juuso Ala-Kyyny, photos by Jussi Männistö. Remarks in this blog post are based on the notes by Juuso Ala-Kyyny (HULib) and Markku Roinila (HULib), and on the email interviews with historian Lauri Hirvonen, historian Maiju Wuokko and information systems specialist Tuula Pääkkönen. The hackathon event was organized by Helsinki University Library – Mari Elisa Kuusniemi, Susanna Nykyri, Jari Friman, Tuija Korhonen, Ville Träff, Mikko Ojanen and Dolf Assmann.

One One Two Day

Today is national One One Two Day 112 (in Finnish and Swedish). Today several government officials wish to direct attention to public safety and the proper use of the emergency phone number 112. The aim of the day is to make Finland the safest country in Europe. How to prevent accidents from happening? How to make everyday life more safe and secure? What kind of infrastructure promotes safety? How to find help when needed?

Today is also the second birthday of University of Helsinki Research Data Policy. Two years ago, on February 11, 2015 Rector Jukka Kola signed the Research Data Policy. The aim of the Research Data Policy is to define high-level principles regarding the collection, storing, use, and management of research data. The ultimate goal is to make researchers’ everyday life and core business, that is, research, easier and – in a sense – more safe.

Although the policy paper itself does not make anybody’s life easier of safer, the policy requires university to take certain actions in order to safeguard professional research data management throughout the research life cycle. From researcher’s point of view the key questions are: How to find help when needed? What kind of infrastructure supports my research?

During the implementation phase of the Research Data Policy, University of Helsinki has answered to these questions by first, setting up the Datasupport network and, second, launching the Project MILDRED. The Datasupport network assists researchers in the management of research data. Researchers are offered tools, services, and training regarding the management, use, access to, and sharing of research data. Via DataSupport researchers can reach all data management specialists at the University. The specialists will help researchers in data related emergency. As such, the Datasupport email researchdata@helsinki.fi is a kind of the 112 number for researchers.

But as is with the One One Two Day, the focus of the Datasupport network is not on the emergency cases, but on how to avoid them. How to plan research more secure and safe data-wise? The Datasupport network’s aim is to prevent emergencies and help researchers plan the collection, storing, use, and management of research data in advance.  Project MILDRED, on the other hand, builds a safe and well planned infrastructure on which the Datasupport and researcher can trust to support both data intensive research and research based on the so called long tail data. Infrastructure is the basis of well-working society and research.

What is there for me in Mildred?

By Mikko Tolonen

University of Helsinki has decided to update its research data infrastructure. This project is called Mildred – as hopefully you already know. Mildred includes five different sub-projects that have their emphases on storage, metadata creation, data management, the actual use of the research data, sharing of the data and interoperability. From an architectural perspective the project seems more or less straightforward. How this translates into practice, we are about to find out in the near future.

What were the reasons for starting this project? The apparent motivation from University’s side is the fact that research data will become a very viable asset in the future. It is within the interest of the University to provide the infrastructure that enables their researchers to storage their data, work on it and to disseminate it. It is also just as important that other scholars, general public and for example funding bodies can easily find the relevant research data and then put it to further use. It is a reasonable assumption that questions of intellectual property, data management and open science will be decisive factors for all the invested parties in the future as we move towards a more dynamic research culture. We have already good examples of different infrastructures competing on the quality of their research data and its openness. I think we can all agree that this direction of openness, reusability and reproducibility is the way to advance science, and it is therefore evident that to provide a functional research data infrastructure is within the general interest of scholars.

If this is the motivation at the University level, what about individual researchers? Why do we need Mildred? Can’t we just rely on existing research data infrastructures? There are plenty of those to go around with, or not? Isn’t Zenodo enough? What do we get out of this as University of Helsinki researchers?

First observation that I would like to make is that when we are discussing research data infrastructures on a general level, we need to remember that there is no such thing as a monolith concept of a “researcher”. As researchers we do not form a coherent whole with similar needs. We come from various different fields, in all shapes and sizes and with multiple research infrastructure needs. Therefore, it is evidently a good idea that individual research projects from different fields can have an impact on the kind of research data infrastructure that the university is building. That is also why we are after concrete user cases.

When looking at the question of research data infrastructure from the perspective of the humanities, for example, we can even say that we cannot really see how our research data infrastructure needs will be shaped in ten years time. A plain fact is that much of the humanities research data is not currently in a digital form. This is why we also need to focus on the digitization process together with libraries and archives that might not be a concern for some other fields of science. Also, in one sense, in open science and in the sharing of research data, we in the humanities are perhaps behind natural science.

At the same time, looking at the question of the needs of the humanities, it is evident that some of our research data needs will follow those of science in general. As an example, the use of CSC computing services is easily adaptable to humanities needs. Pouta cloud service functions quite well for our research group for example, but the question of backing up the data (or enabling reproducibility) is not that clear at the moment for us, to be honest. However, there are also particular hermeneutic questions that we are dealing with within the humanities that might mean that our data does not fit directly within the same scope as genome data, the level of detail can be different and so forth, but these are not reasons not to be on the same boat in the Mildred project as natural science. What Mildred enables us to do is also to see the similarities and realize the differences leading to the adoption of best practices from other fields of science when possible.

At the same time, what should not be forgotten is that what goes hand in hand with Mildred is different faculties and disciplines updating – or in some cases formulating – their own research data policies. This has been recently undertaken at the Faculty of Arts by the dean Hanna Snellman. Since most of humanities data has not yet been digitized, it is only natural that currently there is no research data policy at the Faculty of Arts – all of these questions have come to us quite recently, although language technology, for example, with large digital datasets has been developing at the University of Helsinki for past fifty years or so. But there should be a research data policy for the humanities and there soon will be.

This gives us an opportunity to think what are the right kinds of infrastructures for different types of humanities data. Most of the relevant infrastructures exist already (or are being developed) – what is crucial is to use them in an efficient way. The idea is not that Mildred will replace the existing infrastructures – Mildred is not designed as one research data infrastructure to rule them all – but we need to deliberate what research data should be curated by the national archives, what by the National or University Library, what by Language Bank of Finland, what by Mildred and so forth. As the humanities scholars come in all shapes and sizes, so does our data (also in different forms, for example, as sand from Egypt). Forming a research data policy is therefore crucial for humanities scholars and it will crucially help us in the questions of open science in the future. Let me underline that this is from my perspective certainly a process where Mildred is useful.

The ethos of Mildred, as I have understood it, is to engage as many real life projects (with real research data needs) as possible. This is also the obvious reason why these user groups have been formed. I know that the Mildred people have done their best to spread the word about the project and that it is possible to get involved in shaping it, but unfortunately it is remarkably hard to get the word through at the University level, at least through Flamma. I would encourage you that if you know that someone should be involved with Mildred, to get in touch with Ville Tenhunen or Eeva Nyrövaara and I am sure that they do their best to accommodate their needs as well.

One part of Mildred is (along with research data policies in different fields) to negotiate the different roles of various different existing research data infrastructures (both national and international). Collaboration with ATT, national data committee, other universities, repositories, libraries etc. is crucial. Luckily this is something that the project managers of Mildred are handling.

This includes the integration of much of the global standard setting processes that seem to be going on everywhere with respect to research data: How to refer to the data sets? How to use PIDs? How to implement these persistent identifiers for datasets (something that CSC and the British Library, according to my experience, are very much concerned with). Luckily, in Finland the Open Science and Research initiative is involved in these questions and Mildred is directly engaged in these processes as well. It would be hard for individual researchers focusing on their own research to stay on top of all these different developments in many different directions and levels. This is also one place where the processes of the Mildred project can be beneficial for us as researchers – research data infrastructure should eventually be guiding us in the use of best practices with clear guidelines.

One thing that I do know about Mildred is that what we do not want to do as researchers is to build a large, hard to use, inflexible, one-size fits all system that is not modeled after any real user cases.  At the same time, it is evident that we need storage space for different types and sizes of data.

Then there are of course great many concrete questions that we should be thinking quite hard based on the knowledge that can be gained from different use cases:

What level of reproducibility are we aiming at? Compatibility, version control, software citations, tool development with respect to our research data? How are these best supported by Mildred?

It seems that in all of these lingering general questions, small details with respect to an infrastructure are important. Meaning, it is important to build clear processes and functionalities that researchers are able to – and want to – use. This aspect of service design together with actual user cases cannot be overlooked.

All of these are questions that can be answered only through one particular route: that is through implementing and testing the forthcoming Mildred infrastructure against concrete research cases where research questions rule. My wish is that some kind of “platform thinking” would come through in Mildred for various different fields. For me, at best Mildred offers a platform for your research data needs so that you can better answer your current questions and take your scholarship to the next level. It is to me quite evident that this includes taking care of the data management, storage, reproducibility, tool development and dynamic use of the research data across different borders.

The idea of the user groups is not that they would be lip service. Mildred itself ought to be seen as an open science project. Sure, there are hardware questions that need to be decided and then cannot be altered, but if we are able to implement the right kind of platform thinking that includes the aspects of dynamic research, research collaboration beyond the university of Helsinki, questions of intellectual property, I am sure that this is a process worth participating in.

One important theme in the project is experiences where things have been going wrong with respect to your current research data and infrastructures that should support its use. I hope we get to concrete cases of implementing the structure of Mildred with respect to interoperability, dynamic data and so forth as soon as possible.