When building a system that enables research data management, sharing and preservation, focusing on mere data is not enough. Metadata are the crucial part, and must be taken into account. Metadata is like a product data sheet, telling you what the data is. Metadata makes research data comprehensible, discoverable and usable. Basically, without metadata you know nothing about the data.
Mildred’s Sub-project 3 is primarily about metadata. Its goal is to make the research data published by University of Helsinki researchers easy to find and widely exploited. A data publishing and search service is being developed to achieve these goals. We have taken the researcher’s perspective into account: a data publishing process should be as smooth and enticing to a researcher as possible.
“A place to publish your data, that’s the starting point. Then you get instructions on data storage, metadata and licensing issues. And when the requirements are met, you can add texts and visualizations to the workflow. Finally, you get a personal web page for each dataset,” Mildred 3’s Project Manager, Pauli Assinen describes the process.
University of Helsinki researchers’ data may be located in different repositories. This will also be the case in the future. The point of Mildred 3 is to gather the information about university research data – that is, metadata – into one place, the ThinkOpen website. The metadata are harvested from various sources, with tools developed in the ATTX project (see also ATTX’s Github.io pages).
“Using the ATTX tools, we harvest data from three data sources (Etsin, Zenodo, B2Share/EUDAT). This is the base. Then, it’s easy to add any other data source that has a compatible interface. Once data have been gathered into the internal database of ATTX, we can enrich them by utilizing OpenAIRE services, for example,” says Pauli Assinen.
Metadata are the core of data usability, but they also raise a lot of questions. Who describes the data (a researcher or a librarian)? Can we automate the data description, and to what extent? What kind of concept is the data description based on – what language do your data speak (see metadata standards)?
“Books and articles have standards for bibliographical information; a commonly shared idea about metadata that is implemented in practice. Research data have different metadata routines and formats, and we lack a commonly shared understanding. This is a challenge: when metadata are harvested from different sources, how do we validate and combine them?” asks Assinen.
However, complete metadata is an ideal that is not worth pursuing, because it weakens usability from researcher’s point of view.
“Often, when library people design metadata, they want to describe everything possible. But for a researcher, we should find the smallest common denominator, the minimum metadata that are interoperable between different systems. And this is a big challenge,” claims Mildred 3’s Project Owner, Pälvi Kaiponen.
“We have to balance how much effort researchers can put into producing sufficient metadata with how much metadata is enough for repositories to do their job properly. The library is responsible for the metadata production in publications – should libraries also be responsible for the metadata of datasets?” asks Assinen.
Various projects have searched for this balance (TTA – The National Research Data Initiative and Tietomalliselvitystyöryhmä), as have different metadata schemes (especially DataCite).
At the moment, Mildred 3 is moving towards a phase that is more visible to researchers. The ThinkOpen website is testing the user interface of the research data search service (see Mildred pilots). The final version will be completed by the end of 2017.
As with other Mildred projects, the job is not over when the technology is set up. We need to get researchers use the service for both publishing and searching for data.
“We can create some great applications, but if we can’t reach the researchers, these apps are useless. We need to find disciplines whose research is based on sharing data. And then we need to think about the disciplines that have the potential for sharing,” says Assinen.
Project Mildred has clearly become an internationally unique and ambitious project.
“I’ve urged our team to look for projects similar to Mildred around the world. And there are very few out there. One example was found at the University of Sydney. They have a Publish research data site and a Sydney Research Data registry, which has about thirty datasets. Therefore, finding benchmarks is quite difficult,” explains Assinen.
Pälvi Kaiponen and Pauli Assinen are convinced that building the research data infrastructure is an essential project. The Finnish authorities, the EU and many research funders (e.g. the Academy of Finland) are leading the development in the direction of better data management.
“When the funders highlighted the importance of publications (e.g. the Ministry of Education and Culture), this forced the university libraries to improve their publication management. The next step now is data management. This has already been achieved at the EU level, when the opening of data became the starting point of Horizon 2020. So, the path is clear, and other funders will follow. Sooner or later, open data will become the default option,” predicts Assinen.
“If the University of Helsinki is able to build a research data infrastructure that researchers find easy to use, our reputation will spread via the researcher community: ‘This is how we do it at the University of Helsinki!’” says Pälvi Kaiponen.