Historians improving Tuuli guidance – 15 notes from the hackathon discussions

What is research data for a historian? How to promote opening and sharing the data? And above all, how to improve guidance for data management so that historians would benefit from it in the best possible way?

These questions and many others were discussed by historians from the University of Helsinki in the Helsinki University Library (Kaisa House) on Monday, June 19. The aim of the hackathon event organized by the Helsinki University Library (HULib) was to develop guidance and instructions of the data management planning tool Tuuli (DMPTuuli) (see RDM guidance in the Mildred M5 project) for historians. The library has hosted a similar workshop for researchers in the fields of arts, design and architecture.

The three-hour long search for better guidance was proceeding step by step, following the structure of DMPTuuli. Hence, the researchers shared their views with (1) data description, (2) data documentation, (3) data storing, (4) ethical and legal issues, and (5) data sharing and long-term preserving (see also Research Guide by HU).

The hackathon was carried out in the form of group work, and the historians raised a number of concrete issues and general ideas about DMPTuuli guidance. As Susanna Nykyri from HULib summarized, historians’ remarks challenged and encouraged the library to further develop the guidance. In this blog post we have put together some of the key findings of the intensive Monday session. So, here’s 15 preliminary notes from the hackathon – you are free to share further ideas in the comment section!

The hackathon in action. Photos by Jussi Männistö.

General remarks

(1) The data management model and terminology used in DMPTuuli appears somewhat distant for historians. Researcher Lauri Hirvonen assumes that the tool may be more suitable for social and natural scientists than for historians accustomed to archive materials. ”The DMPTuuli seems to be most applicable for research projects making use of distinctive and limited data sets that are collected for analysis. A historian may not be able to produce this kind of specific data set, in which case some elements in DMPTuuli may seem rather far-fetched,” Hirvonen says.

(2) However, DMPTuuli as well as the hackathon event have encouraged historians to reflect their documentation practices. Researcher Maiju Wuokko: ”We are forced to think about data management now because funding institutions demand it, but at the same time this brings out many crucial principles and practices in the discipline. So at its best, data management planning can be much more than a compulsory obligation in the funding process!”

(3) The hackathon clarified the overall process of managing the data, says Tuula Pääkkönen who works as information systems specialist in National Library of Finland. Pääkkönen: ”All the questions in DMPTuuli highlight the issues, and even though they can seem tricky at first glance, it is good to bring them into the open. One challenge is how deep into the data management plan topics one should go. The recommendation for the DMP is 1–2 pages, so finding the suitable detail level for oneself but useful for others is one consideration.” Pääkkönen participated in the hackathon to seek for ideas for National Library’s Digi project.

(4) DMPTuuli sample answers tailored for historians are at the top of researchers’ wish list. Maiju Wuokko believes that concrete sample answers would have a positive impact on the internal debate in the field of history. But then, sample answers might control and direct the data management planning too much. However, the need for discipline-specific workshops in developing the guidance and strengthening the practices is obvious. The ultimate goal is to have guidance that follows the research process, as Professor Anu Lahtinen from the Department of Philosophy, History, Culture and Art Studies says.

Discussing about data documentation.

Data description

(5) The data produced by historians was divided into three types: the source material (e. g. archive materials as such), the data produced by the researcher (e. g. notes on archive materials), and the project management data (e. g. research permits). This division was considered relevant in principle. A researcher is solely responsible for the data produced by herself/himself.

(6) Due to the nature of the research, the data produced by a historian often consists of notes. Maiju Wuokko points out that this is a good thing to keep in mind when communicating with historians: ”It might be important to emphasize that the most urgent thing for us is to keep our own notes safe and to think about further use of the notes. Archives take care of the source material for us.”

Data documentation

(7) The core idea of the documentation is to ensure that the data collected by the researcher is also understandable for other researchers. As illustrated by Tuula Pääkkönen, the documentation makes sure that your data is understood – by fellow researchers during your holiday and by yourself after the holiday. In the field of history, where the research process is often an individual performance, this can be challenging because the documentation for others is not a self-evident practice. And on the other hand, historians might feel that inclusive documentation takes too much time from research.

(8) The documentation is often included in the research itself in the field of history. In this respect Lauri Hirvonen considers the documentation in DMP slightly overlapping with the research plan and research questions. Hirvonen: ”The actual publication should disclose all of the most relevant data in the form of references and data catalogs.”

(9) If the data produced by historians is mainly notes, the quality and integrity of the data can be ensured, according to Lauri Hirvonen, by readable notes that are secured by backups and reviewed with source-critical methods. Hirvonen: ”In practice, this is accomplished by meticulous referencing and by reviewing the data comprehensively in the introduction. Through the precise list of references and data sources, any researcher should be able to check the accuracy of the research. ”

Developing guidance for data storing.

Data storing

(10) The practices of data storing vary from historian to historian. Some researchers have stored data in the Finnish Social Science Data Archive, others use a commercial cloud service. When making use of external cloud services, problems may arise if the service is not supported by the University of Helsinki. In addition to archives and cloud services, historians have been using hard disk drives, external hard disk drives, and network drives to store data.

Ethical and legal issues

(11) The most complicated ethical and legal issues are usually related to large data sets. Tuula Pääkkönen: ”In this data, there may be all kinds of special unique cases. One solution is to use old-enough material to minimize the risk,” Pääkkönen says.

(12) The anonymization of the data is sometimes demanding because a person can be identified in many ways, even when the name is removed. And on the other hand, if the anonymization goes too far, the data may lose its value and relevance for other researchers. Ethical evaluation included in the DMP process is considered particularly useful when there is uncertainty about the sensitivity of the data.

Sharing views on ethical issues related to data.

Data sharing and long-term preserving

(13) According to Lauri Hirvonen, sharing the data among fellow researchers may sound somewhat irrelevant to some historians. It is because of the nature of the data, explains Hirvonen: ”The research in the field of history usually doesn’t generate distinctive data sets that are analyzed by other researchers. The data produced is often notes based on a variety of historical records (e. g. documents, manuscripts, maps etc.), and merely sharing those notes as open data is problematic. Of course, also historians may produce valuable qualitative or quantitative data, for example in the field of economic history, and this data can be shared as a distinctive data set.”

(14) At what stage can the data be published? A historian may be ready to share the data only when it is used up, that is, when a certain number of articles based on the data have been written. The historians in the hackathon also mentioned the possibility to share part of the data, while the rest would be left for the researcher until the opening of the completed data. However, there are also other reasons why data is only partially shared. For example, the same data may have a wide variety of contents that must be shared or preserved in different ways.

(15) Sharing and preserving the data produced by historians may generate costs in at least two ways: the costs may arise from the storing itself (e. g. storage capacity) or the costs may arise from the production of metadata required for sharing and preserving. Additionally, costs may arise when increasing the usability of the data (e. g. data visualizations). The question worth considering is when the extra documentation (for sharing and preservation purposes) requires extra funding.

Text by Juuso Ala-Kyyny, photos by Jussi Männistö. Remarks in this blog post are based on the notes by Juuso Ala-Kyyny (HULib) and Markku Roinila (HULib), and on the email interviews with historian Lauri Hirvonen, historian Maiju Wuokko and information systems specialist Tuula Pääkkönen. The hackathon event was organized by Helsinki University Library – Mari Elisa Kuusniemi, Susanna Nykyri, Jari Friman, Tuija Korhonen, Ville Träff, Mikko Ojanen and Dolf Assmann.

One One Two Day

Today is national One One Two Day 112 (in Finnish and Swedish). Today several government officials wish to direct attention to public safety and the proper use of the emergency phone number 112. The aim of the day is to make Finland the safest country in Europe. How to prevent accidents from happening? How to make everyday life more safe and secure? What kind of infrastructure promotes safety? How to find help when needed?

Today is also the second birthday of University of Helsinki Research Data Policy. Two years ago, on February 11, 2015 Rector Jukka Kola signed the Research Data Policy. The aim of the Research Data Policy is to define high-level principles regarding the collection, storing, use, and management of research data. The ultimate goal is to make researchers’ everyday life and core business, that is, research, easier and – in a sense – more safe.

Although the policy paper itself does not make anybody’s life easier of safer, the policy requires university to take certain actions in order to safeguard professional research data management throughout the research life cycle. From researcher’s point of view the key questions are: How to find help when needed? What kind of infrastructure supports my research?

During the implementation phase of the Research Data Policy, University of Helsinki has answered to these questions by first, setting up the Datasupport network and, second, launching the Project MILDRED. The Datasupport network assists researchers in the management of research data. Researchers are offered tools, services, and training regarding the management, use, access to, and sharing of research data. Via DataSupport researchers can reach all data management specialists at the University. The specialists will help researchers in data related emergency. As such, the Datasupport email researchdata@helsinki.fi is a kind of the 112 number for researchers.

But as is with the One One Two Day, the focus of the Datasupport network is not on the emergency cases, but on how to avoid them. How to plan research more secure and safe data-wise? The Datasupport network’s aim is to prevent emergencies and help researchers plan the collection, storing, use, and management of research data in advance.  Project MILDRED, on the other hand, builds a safe and well planned infrastructure on which the Datasupport and researcher can trust to support both data intensive research and research based on the so called long tail data. Infrastructure is the basis of well-working society and research.

What is there for me in Mildred?

By Mikko Tolonen

University of Helsinki has decided to update its research data infrastructure. This project is called Mildred – as hopefully you already know. Mildred includes five different sub-projects that have their emphases on storage, metadata creation, data management, the actual use of the research data, sharing of the data and interoperability. From an architectural perspective the project seems more or less straightforward. How this translates into practice, we are about to find out in the near future.

What were the reasons for starting this project? The apparent motivation from University’s side is the fact that research data will become a very viable asset in the future. It is within the interest of the University to provide the infrastructure that enables their researchers to storage their data, work on it and to disseminate it. It is also just as important that other scholars, general public and for example funding bodies can easily find the relevant research data and then put it to further use. It is a reasonable assumption that questions of intellectual property, data management and open science will be decisive factors for all the invested parties in the future as we move towards a more dynamic research culture. We have already good examples of different infrastructures competing on the quality of their research data and its openness. I think we can all agree that this direction of openness, reusability and reproducibility is the way to advance science, and it is therefore evident that to provide a functional research data infrastructure is within the general interest of scholars.

If this is the motivation at the University level, what about individual researchers? Why do we need Mildred? Can’t we just rely on existing research data infrastructures? There are plenty of those to go around with, or not? Isn’t Zenodo enough? What do we get out of this as University of Helsinki researchers?

First observation that I would like to make is that when we are discussing research data infrastructures on a general level, we need to remember that there is no such thing as a monolith concept of a “researcher”. As researchers we do not form a coherent whole with similar needs. We come from various different fields, in all shapes and sizes and with multiple research infrastructure needs. Therefore, it is evidently a good idea that individual research projects from different fields can have an impact on the kind of research data infrastructure that the university is building. That is also why we are after concrete user cases.

When looking at the question of research data infrastructure from the perspective of the humanities, for example, we can even say that we cannot really see how our research data infrastructure needs will be shaped in ten years time. A plain fact is that much of the humanities research data is not currently in a digital form. This is why we also need to focus on the digitization process together with libraries and archives that might not be a concern for some other fields of science. Also, in one sense, in open science and in the sharing of research data, we in the humanities are perhaps behind natural science.

At the same time, looking at the question of the needs of the humanities, it is evident that some of our research data needs will follow those of science in general. As an example, the use of CSC computing services is easily adaptable to humanities needs. Pouta cloud service functions quite well for our research group for example, but the question of backing up the data (or enabling reproducibility) is not that clear at the moment for us, to be honest. However, there are also particular hermeneutic questions that we are dealing with within the humanities that might mean that our data does not fit directly within the same scope as genome data, the level of detail can be different and so forth, but these are not reasons not to be on the same boat in the Mildred project as natural science. What Mildred enables us to do is also to see the similarities and realize the differences leading to the adoption of best practices from other fields of science when possible.

At the same time, what should not be forgotten is that what goes hand in hand with Mildred is different faculties and disciplines updating – or in some cases formulating – their own research data policies. This has been recently undertaken at the Faculty of Arts by the dean Hanna Snellman. Since most of humanities data has not yet been digitized, it is only natural that currently there is no research data policy at the Faculty of Arts – all of these questions have come to us quite recently, although language technology, for example, with large digital datasets has been developing at the University of Helsinki for past fifty years or so. But there should be a research data policy for the humanities and there soon will be.

This gives us an opportunity to think what are the right kinds of infrastructures for different types of humanities data. Most of the relevant infrastructures exist already (or are being developed) – what is crucial is to use them in an efficient way. The idea is not that Mildred will replace the existing infrastructures – Mildred is not designed as one research data infrastructure to rule them all – but we need to deliberate what research data should be curated by the national archives, what by the National or University Library, what by Language Bank of Finland, what by Mildred and so forth. As the humanities scholars come in all shapes and sizes, so does our data (also in different forms, for example, as sand from Egypt). Forming a research data policy is therefore crucial for humanities scholars and it will crucially help us in the questions of open science in the future. Let me underline that this is from my perspective certainly a process where Mildred is useful.

The ethos of Mildred, as I have understood it, is to engage as many real life projects (with real research data needs) as possible. This is also the obvious reason why these user groups have been formed. I know that the Mildred people have done their best to spread the word about the project and that it is possible to get involved in shaping it, but unfortunately it is remarkably hard to get the word through at the University level, at least through Flamma. I would encourage you that if you know that someone should be involved with Mildred, to get in touch with Ville Tenhunen or Eeva Nyrövaara and I am sure that they do their best to accommodate their needs as well.

One part of Mildred is (along with research data policies in different fields) to negotiate the different roles of various different existing research data infrastructures (both national and international). Collaboration with ATT, national data committee, other universities, repositories, libraries etc. is crucial. Luckily this is something that the project managers of Mildred are handling.

This includes the integration of much of the global standard setting processes that seem to be going on everywhere with respect to research data: How to refer to the data sets? How to use PIDs? How to implement these persistent identifiers for datasets (something that CSC and the British Library, according to my experience, are very much concerned with). Luckily, in Finland the Open Science and Research initiative is involved in these questions and Mildred is directly engaged in these processes as well. It would be hard for individual researchers focusing on their own research to stay on top of all these different developments in many different directions and levels. This is also one place where the processes of the Mildred project can be beneficial for us as researchers – research data infrastructure should eventually be guiding us in the use of best practices with clear guidelines.

One thing that I do know about Mildred is that what we do not want to do as researchers is to build a large, hard to use, inflexible, one-size fits all system that is not modeled after any real user cases.  At the same time, it is evident that we need storage space for different types and sizes of data.

Then there are of course great many concrete questions that we should be thinking quite hard based on the knowledge that can be gained from different use cases:

What level of reproducibility are we aiming at? Compatibility, version control, software citations, tool development with respect to our research data? How are these best supported by Mildred?

It seems that in all of these lingering general questions, small details with respect to an infrastructure are important. Meaning, it is important to build clear processes and functionalities that researchers are able to – and want to – use. This aspect of service design together with actual user cases cannot be overlooked.

All of these are questions that can be answered only through one particular route: that is through implementing and testing the forthcoming Mildred infrastructure against concrete research cases where research questions rule. My wish is that some kind of “platform thinking” would come through in Mildred for various different fields. For me, at best Mildred offers a platform for your research data needs so that you can better answer your current questions and take your scholarship to the next level. It is to me quite evident that this includes taking care of the data management, storage, reproducibility, tool development and dynamic use of the research data across different borders.

The idea of the user groups is not that they would be lip service. Mildred itself ought to be seen as an open science project. Sure, there are hardware questions that need to be decided and then cannot be altered, but if we are able to implement the right kind of platform thinking that includes the aspects of dynamic research, research collaboration beyond the university of Helsinki, questions of intellectual property, I am sure that this is a process worth participating in.

One important theme in the project is experiences where things have been going wrong with respect to your current research data and infrastructures that should support its use. I hope we get to concrete cases of implementing the structure of Mildred with respect to interoperability, dynamic data and so forth as soon as possible.

Request for Comments: User Stories and Scenarios

The ultimate goal in Mildred is to create services that are both useful and useable for its users. “Useful” in this context means something, which is needed and which makes one’s life at least somewhat (rather significantly) easier regarding handling and managing data. “Useable” and user experience will be a topic for another blog post, so let’s concentrate on the question how to make useful services.

When developing new software, modern organizations more and more often want to use so called agile methodologies. They promote incremental development of the product with close collaboration with customer; as the feedback is received constantly during the development process, the final product should meet customers’ real needs.

Among the several practices inside agile framework are user stories and roles. These are written in everyday language, and basically describe what task a user wants to achieve using the product. An example could be: “As an author of an article I want to publish my data set so that it can be found and citated by other researchers” or “As a researcher I want to find data sets published by other researchers so that I can make new observations combining the data sets”.

This is where we need you, dear future user of Mildred services! We have gathered some user stories based on the cases that our experts and data support group have solved during the years. Are they correct? Is something missing? Are some of them more important to you than others?

Read and comment directly here: https://wiki.helsinki.fi/display/Tutkimusdata/MILDRED+User+stories (requires authentication with UH or Haka credentials).

These user stories are used in most Mildred projects, so it’s uttermost important that they are a good presentations of your challenges. Unfortunately we cannot promise to fulfil all needs in the coming year, but the development will continue even after Mildred’s project phase is over.

Programme of the next user group meeting

Coordinates of the next MILDRED User Group meeting are:

Date and time: Wednesday 1st February 10:00-12:00

Place: Meilahti Campus, Biomedicum Helsinki 1 (Haartmaninkatu 8),  Kokoushuone 3 (http://www.helsinki.fi/teknos/opetustilat/meilahti/h8/kok3.htm).

Registration is open until 31st January 2017 (for dietary restrictions 24th January 2017): https://elomake.helsinki.fi/lomakkeet/76215/lomake.html

Programme:

10:00 – 10:10 Coffee

10.10 – 10.20 Opening, prof. Mikko Tolonen

10:20 – 10:30 Introduction to the theme and method: Current obstacles in data-centric research, Aija Kaitera & Mari Elisa Kuusniemi

10:30 – 11:10 Workshop session, participants

11:10 – 11:20 Summary, Aija Kaitera & Mari Elisa Kuusniemi

11:20 – 11:55 MILDRED 2: Example and demo of the possible data service, Kimmo Koskinen & Ville Tenhunen

11:55 – 12:00 Next steps and closing, Eeva Nyrövaara

Welcome!

Building better RDM guidance

The aim of the Mildred M5 group is the implementation of the data management planning tool DMPTuuli which helps you write data management plans. In data management plan you give a brief description of how you will collect, manage and store your data, and how the data can be used now and in the future. Many research funders like Academy of Finland require a data management plans in their funding applications. DMPTuuli links the funders requirements, the guidance and the support services provided by the UH to the same user interface.

In the M5 project the focus is on writing targeted research data management (RDM) guidance to researchers and students in the University of Helsinki. The current guidance in DMPTuuli is generic level instructions for all disciplines. Obviously, on size doesn’t fit for all and so we are planning to create discipline or data type specific guidance to better meet your demands.

First we are planning to make simplified guidance aimed for those who are not so familiar with the data management plans. So far we have defined the principles of the good guidance and benchmarked the available RDM guidance made by other organizations. At the moment we are writing a glossary of terms related to RDM.

In the next phase we will discover the needs for more specific guidance. For example, the sensitive data that needs to be anonymized might benefit of having RDM guidance that would go deeper than the generic level instructions.

But, we cannot write guidance without your contribution. Now that we are working on the beginner level guidance aimed for students for example, we would be happy to have volunteers to help us by giving comments and suggestions on what kind of guidance would be useful for students. And furthermore, if your research group has a need for more specific RDM instructions, please let us know! Feel free to use our feedback form or leave a reply to this post.

Licenses and next meeting of the user group

First things first. Next the MILDRED User group meeting will take place on Wednesday 1st February 2017 from 10 to 12. The venue will be Biomedicum Helsinki 1 (Haartmaninkatu 8),  Kokoushuone 3 (http://www.helsinki.fi/teknos/opetustilat/meilahti/h8/kok3.htm).

Last day to register is 26th January 2017 and you can do the registration here: https://elomake.helsinki.fi/lomakkeet/76215/lomake.html

Then, some thoughts about licenses.

It is important to define terms and conditions of the data, codes, materials etc. before someone else going to use them. It makes life easier if users know these things.

Therefore we have thought that we describe principles of the the Project MILDRED licenses before we have services or softwares anywhere. Idea is that you can comment this proposal. The steering group of the project will make decisions later, after the user group have had a possibility to comment these things.

The proposal:

If there is not any legal restrictions or other powerful arguments to some other licenses, within the Project MILDRED will be used following licenses:

The legal restriction could be for example so called copyleft license or use of the commercially licensed software etc. The powerful argument could be also a will of the owner of the information or material.

We like to hear your opinion about this proposal. You can leave your comment to this blog or you can use the feedback form of the project: https://elomake.helsinki.fi/lomakkeet/76062/lomake.html

EDIT: 21.12.2017 Venue and registration.

The MILDRED user group and research data service user profiles

The Project Mildred had the user group meeting last Tuesday with nearly 20 participants from all campuses of the University of Helsinki. Main issue of the meeting was the user profiles and researchers need.

Open space session produced numbers of comments and ideas which are very useful for the project. On the behalf of the project managers of the project, I like to thank everyone their participation and discussions! Results are now in the wiki.

If you didn’t make the meeting, no worries. We have opened wiki page where you can find the material and leave you comment. You have to log in to the wiki with University’s credentials or HAKA credentials.

Here are links:

Newsletter 2/2016

The second MILDRED Newsletter was emailed today. You can read it also below.

1. The user group workshop

How you use the research data? What tools you need? Which services the Project MILDRED should build up?

The Project MILDRED arrange the User Group meeting and workshop on Tuesday 15th November. Event will take a place 14:00-16:00 at Siltavuorenpenger 5, K218. In this sessions we will discuss about user profiles, services, requirements and next steps of the project.

Come and tell what kind of services you need for your research.

Registration is open until Friday 11.11.2016: https://elomake.helsinki.fi/lomakkeet/74066/lomake.html

2. Project managers of the subprojects of the MILDRED

Now all the Project MILDRED subprojects are up and running. They have own project managers as listed below:

Continue reading

Storing, sharing and visualizing

Autumn is coming! And the MILDRED-projects are finally accelerating to full speed. During the summer we have tried to figure out, what we really want to do: We have learned what kind of data services our researchers are using at the moment, we have gathered general requirements for research data infrastructures and we’ve tried to find out what kind of research data services there exits around the world nowadays.

The functional requirements and use cases for data repository and publishing services we have gathered during the summer can be put into three categories: Storing, Analyzing & Visualizing, and Sharing.

  1. Storing

    By storing we mean a service that provides a scalable storage space throughout the research data lifecycle so that users needs to save their data only once and only in one storage. The data should be secured, backed up and accessible from all different kind of clients e.g. browsers, command line or even desktop clients.

  2. Analysing & Visualizing

    Research data is very valuable as such, but we can add even more value to data by analysing and visualizing it. Therefore our research data services should provide all different kind of tools for users to enrich their data without moving or copying their data to other systems.

  3. Sharing

    By sharing, we actually mean three different types of sharing:

  • Short term sharing: As said in an earlier post by Anna Salmi, a big part of our researcher are using these file sync & share services like Dropbox and SugarSync. The researchers need a easy and fast way to share their files with colleagues all over the world without worrying the user permissions or federated authentication.
  • Shared storage: Researchers also need a data storage that is shared with their research groups or other colleagues. Additionally, research groups might need collaborative tools to ease their data workflow.
  • Data publication: For publishing research data, there are all kinds of requirements that is needed to provide a trusted and accessible repository, i.e. The publishing service must support discoverability, reproducibility, reusability. That means, for example, standardized APIs, rich metadata services and automated text citations.

During the summer we’ve tried to find an out-of-the-box open source software that would meet all the requirements we’ve gathered. Unfortunately and not so surprisingly there is no silver bullet. There are few software or solutions that would fulfil some requirements, but in the end they all lack something crucial. Basically, what we are trying to develop is some kind of platform-driven e-infrastructure, that is a rising trend in the field of research data infrastructures. There are few platform-driven repositories in the world today. For example Purr repository in Purdue University and The CUSP Datahub in New York University. Unfortunately the technology underneath these platforms are either obsolete or otherwise unsuitable.

Although, it may be that we can’t build a world-class open science platform right away, but we want to do something to build on and be, at least, one step closer to our world-class platform. Still the most important goal is to provide data services that our researchers want to use. We would like to offer a seamless user experience throughout the data life cycle to set the bar low for publishing research data. Hence this autumn user group will have a possibility to comment  use cases and concepts before any decisions. We want to be sure, we are delivering the right services for our researchers. And remember, any kind of feedback is appreciated.