By Mikko Tolonen
University of Helsinki has decided to update its research data infrastructure. This project is called Mildred – as hopefully you already know. Mildred includes five different sub-projects that have their emphases on storage, metadata creation, data management, the actual use of the research data, sharing of the data and interoperability. From an architectural perspective the project seems more or less straightforward. How this translates into practice, we are about to find out in the near future.
What were the reasons for starting this project? The apparent motivation from University’s side is the fact that research data will become a very viable asset in the future. It is within the interest of the University to provide the infrastructure that enables their researchers to storage their data, work on it and to disseminate it. It is also just as important that other scholars, general public and for example funding bodies can easily find the relevant research data and then put it to further use. It is a reasonable assumption that questions of intellectual property, data management and open science will be decisive factors for all the invested parties in the future as we move towards a more dynamic research culture. We have already good examples of different infrastructures competing on the quality of their research data and its openness. I think we can all agree that this direction of openness, reusability and reproducibility is the way to advance science, and it is therefore evident that to provide a functional research data infrastructure is within the general interest of scholars.
If this is the motivation at the University level, what about individual researchers? Why do we need Mildred? Can’t we just rely on existing research data infrastructures? There are plenty of those to go around with, or not? Isn’t Zenodo enough? What do we get out of this as University of Helsinki researchers?
First observation that I would like to make is that when we are discussing research data infrastructures on a general level, we need to remember that there is no such thing as a monolith concept of a “researcher”. As researchers we do not form a coherent whole with similar needs. We come from various different fields, in all shapes and sizes and with multiple research infrastructure needs. Therefore, it is evidently a good idea that individual research projects from different fields can have an impact on the kind of research data infrastructure that the university is building. That is also why we are after concrete user cases.
When looking at the question of research data infrastructure from the perspective of the humanities, for example, we can even say that we cannot really see how our research data infrastructure needs will be shaped in ten years time. A plain fact is that much of the humanities research data is not currently in a digital form. This is why we also need to focus on the digitization process together with libraries and archives that might not be a concern for some other fields of science. Also, in one sense, in open science and in the sharing of research data, we in the humanities are perhaps behind natural science.
At the same time, looking at the question of the needs of the humanities, it is evident that some of our research data needs will follow those of science in general. As an example, the use of CSC computing services is easily adaptable to humanities needs. Pouta cloud service functions quite well for our research group for example, but the question of backing up the data (or enabling reproducibility) is not that clear at the moment for us, to be honest. However, there are also particular hermeneutic questions that we are dealing with within the humanities that might mean that our data does not fit directly within the same scope as genome data, the level of detail can be different and so forth, but these are not reasons not to be on the same boat in the Mildred project as natural science. What Mildred enables us to do is also to see the similarities and realize the differences leading to the adoption of best practices from other fields of science when possible.
At the same time, what should not be forgotten is that what goes hand in hand with Mildred is different faculties and disciplines updating – or in some cases formulating – their own research data policies. This has been recently undertaken at the Faculty of Arts by the dean Hanna Snellman. Since most of humanities data has not yet been digitized, it is only natural that currently there is no research data policy at the Faculty of Arts – all of these questions have come to us quite recently, although language technology, for example, with large digital datasets has been developing at the University of Helsinki for past fifty years or so. But there should be a research data policy for the humanities and there soon will be.
This gives us an opportunity to think what are the right kinds of infrastructures for different types of humanities data. Most of the relevant infrastructures exist already (or are being developed) – what is crucial is to use them in an efficient way. The idea is not that Mildred will replace the existing infrastructures – Mildred is not designed as one research data infrastructure to rule them all – but we need to deliberate what research data should be curated by the national archives, what by the National or University Library, what by Language Bank of Finland, what by Mildred and so forth. As the humanities scholars come in all shapes and sizes, so does our data (also in different forms, for example, as sand from Egypt). Forming a research data policy is therefore crucial for humanities scholars and it will crucially help us in the questions of open science in the future. Let me underline that this is from my perspective certainly a process where Mildred is useful.
The ethos of Mildred, as I have understood it, is to engage as many real life projects (with real research data needs) as possible. This is also the obvious reason why these user groups have been formed. I know that the Mildred people have done their best to spread the word about the project and that it is possible to get involved in shaping it, but unfortunately it is remarkably hard to get the word through at the University level, at least through Flamma. I would encourage you that if you know that someone should be involved with Mildred, to get in touch with Ville Tenhunen or Eeva Nyrövaara and I am sure that they do their best to accommodate their needs as well.
One part of Mildred is (along with research data policies in different fields) to negotiate the different roles of various different existing research data infrastructures (both national and international). Collaboration with ATT, national data committee, other universities, repositories, libraries etc. is crucial. Luckily this is something that the project managers of Mildred are handling.
This includes the integration of much of the global standard setting processes that seem to be going on everywhere with respect to research data: How to refer to the data sets? How to use PIDs? How to implement these persistent identifiers for datasets (something that CSC and the British Library, according to my experience, are very much concerned with). Luckily, in Finland the Open Science and Research initiative is involved in these questions and Mildred is directly engaged in these processes as well. It would be hard for individual researchers focusing on their own research to stay on top of all these different developments in many different directions and levels. This is also one place where the processes of the Mildred project can be beneficial for us as researchers – research data infrastructure should eventually be guiding us in the use of best practices with clear guidelines.
One thing that I do know about Mildred is that what we do not want to do as researchers is to build a large, hard to use, inflexible, one-size fits all system that is not modeled after any real user cases. At the same time, it is evident that we need storage space for different types and sizes of data.
Then there are of course great many concrete questions that we should be thinking quite hard based on the knowledge that can be gained from different use cases:
What level of reproducibility are we aiming at? Compatibility, version control, software citations, tool development with respect to our research data? How are these best supported by Mildred?
It seems that in all of these lingering general questions, small details with respect to an infrastructure are important. Meaning, it is important to build clear processes and functionalities that researchers are able to – and want to – use. This aspect of service design together with actual user cases cannot be overlooked.
All of these are questions that can be answered only through one particular route: that is through implementing and testing the forthcoming Mildred infrastructure against concrete research cases where research questions rule. My wish is that some kind of “platform thinking” would come through in Mildred for various different fields. For me, at best Mildred offers a platform for your research data needs so that you can better answer your current questions and take your scholarship to the next level. It is to me quite evident that this includes taking care of the data management, storage, reproducibility, tool development and dynamic use of the research data across different borders.
The idea of the user groups is not that they would be lip service. Mildred itself ought to be seen as an open science project. Sure, there are hardware questions that need to be decided and then cannot be altered, but if we are able to implement the right kind of platform thinking that includes the aspects of dynamic research, research collaboration beyond the university of Helsinki, questions of intellectual property, I am sure that this is a process worth participating in.
One important theme in the project is experiences where things have been going wrong with respect to your current research data and infrastructures that should support its use. I hope we get to concrete cases of implementing the structure of Mildred with respect to interoperability, dynamic data and so forth as soon as possible.