Visibility of publications on the web is enhanced by harvesting of repositories

How are publications posted in institutional repository Helda disseminated around the world? This aspect has not been studied previously, so this blog article looks at this issue with the help of a small random sample. Based on a sample of twelve publications one can establish that publications from the repository are well disseminated into different net services, but there is a lot of variation in indexing related to publication types and service models.

(Tämä artikkeli on saatavilla myös suomeksi.)

Text: Mika Holopainen & Kimmo Koskinen

The open repository Helda at the University of Helsinki holds a large selection of content types, such as research articles, dissertations, serial publications and books from Helda Open Books collection. Helda contains metadata of publications and in most cases also their full text that can be downloaded in pdf form. Because of publisher restrictions articles are often not archived in their final published version but as accepted manuscripts.

The dissemination and availability of publications from Helda has not been previously studied, but because this is an interesting subject, we decided to explore the matter using small scale sampling. The target was to gain a general picture of the indexing of Helda publications into the databases of some major indexing services.

The harvesting services have developed their own algorithms for indexing that determine how effectively and completely the harvesting is functioning. Indexing in this context means the transfer of metadata to harvesting services in such a way that it includes a link to Helda repository, enabling the download of the final publication or an accepted manuscript version of it. The metadata in Helda contains a bibliographic description of the publication and often includes a DOI identifier or a link to the publisher website.

From the researcher point of view, it’s usually beneficial that awareness of their own publication spreads widely into different services as it means that the paper can raise more discussion and can be cited more. The positive effect of open publishing on citation rates has been observed in many studies (e.g. Piwowar et al. 2018).

Publications were spread into 13 services

Findability was first studied from search results in Google and Google Scholar and by checking search results from a few harvesting services.

We selected 12 publications that were added to Helda from January to March 2020. The study was conducted in June 2020, and at that time the papers had been available openly for 3 to 6 months. The distribution of publication types was as follows: 4 monographs, 3 doctoral dissertations, 2 masters’ thesis, 3 articles (in a monograph, in a conference proceeding and in a journal). The list of selected papers is attached at the end of this article.

The selection included different types of publications that were added to Helda during the same period. One of the goals was to discover possible variations in indexing of different publication types.

In this study publications from Helda were found in 13 different services, with a link to a description in Helda, and sometimes the full publication was also made available from the service. The services can be divided in these groups: 

  • Finnish research portals: University of Helsinki Research Portal, national Juuli portal
  • Finnish search services: Finna, Terkko Navigator
  • International search services: BASE, Core, Google Scholar, Microsoft Academic, OpenAIRE, Semantic Scholar
  • Social media services: Facebook, ResearchGate, Twitter
Figure 1. Percentage of findings of Helda publications (N=12) in selected online services. 

Differences between services

The domestic online services featuring in the results have different scope, content profiles and operating principles. University of Helsinki Research Portal – and TUHAT system in the background – does not perform automated harvesting from other services: University of Helsinki researchers submit metadata on their publications by themselves, and in addition the library imports data from online databases. Juuli is a joint research information portal that receives data from the research information systems at the Finnish universities. Finna is the common library portal of domestic memory organisations. Terkko Navigator is a portal from Helsinki University Library containing information on research publications especially tailored for the use of medical researchers.

Indexing of different publication types

The study is based on a small sample, so one should not draw definitive conclusions or make quantitative comparisons between different services. As a general observation it seems that most comprehensive harvesting results were reached by Core, Base (Bielefeld Academic Search Engine) and OpenAIRE (see Figure 1). This does not come as a surprise since these services have a mutual agreement on exchange and synchronisation of metadata. Also note that Base service comes from the University of Bielefeld, same place that takes care of harvesting for the OpenAIRE portal.

Description of Helda publication on OpenAIRE portal (click into the service

The indexing coverage of Google Scholar was in June fairly low, only 33 %. A separate new check was done in September, and at that point the coverage had increased considerably, reaching to 83 %. Based on this sample indexing into Google Scholar was a relatively slow process, which was rather surprising.

From observations in this study it was clear that at least ten online services harvest metadata added to Helda during some months after submission. There were differences in the scope of indexing especially in the case of masters’ thesis. Both thesis with their links were found in only three services: Core, Base and Google Scholar. In addition to publication type, differences in indexing coverage were produced because of differing operating principles of indexing services and their algorithms, e.g. do they index metadata with a link to the source, and does the same publication get harvested from many repositories. On top of these the outcome of harvesting is affected by agreements between services, and regulations and data collection systems of universities at the national level.

Many of the online services mentioned in this article are also discussed in a previous Think Open blog article: Testing Alternative Access – some other ways to reach research articles.

Publications selected for the sample: 

Mika Holopainen (TUHAT, ORCID, @mholopa) works in research services at Helsinki University Library, acting as information specialist and liaison librarian at the Kumpula campus.
Kimmo Koskinen (TUHAT, ORCID, @kikoskin) works as development manager in research services at Helsinki University Library.