Challenges of becoming too popular – arXiv growth requires updating maintenance

The most important repository of mathematical and physical sciences already contains 1.6 million e-prints. arXiv provides a platform for sharing e-prints openly for peer review. Over the years arXiv has grown into a giant, encouraging the birth of similar repositories in other scientific fields. This has been a challenge for arXiv maintenance, both in the technical and administrative sense. In this article, bibliometrics expert Eva Isaksson describes arXiv history, development and challenges.

(Tämä artikkeli on saatavilla myös suomeksi.)

Text: Eva Isaksson

When arXiv started in 1991, it was the first repository of scientific, openly available online articles aka e-prints. This originated from printed pre-prints, article versions that were distributed before publication. These were stapled with thin cardboard covers and distributed to libraries. The librarians placed them into heaps on the new items display table, from where researchers picked them up for browsing.

Then came the Internet and transformed the pre-print mail traffic into bytes. Researchers started to upload their materials onto a new service, started by Paul Ginsparg at the Los Alamos National Laboratory. This service started to grow, and it changed its name to arXiv after Ginsparg brought it with him to Cornell University. In 2001 Cornell University Library took over. This was a necessity, as arXiv had grown so much that the maintenance and financing was no longer possible otherwise.

Cornell University has been maintaining arXiv since 2001. / Photo: Eva Isaksson

Being a pioneer is not always easy. arXiv was born early, without a previous model to follow – no pre-existing code or ready-made standards. The service started when World Wide Web was very new. The functionalities were built in 1990s for acute needs of that time, for a community of researchers in their own terms. For researchers, arXiv is primarily a publishing platform instead of a self-archiving repository. Consequently, the metadata priorities differed from those of libraries.

The functionalities were built in 1990s for acute needs of that time, for a community of researchers in their own terms. For researchers, arXiv is primarily a publishing platform instead of a self-archiving repository. Consequently, the metadata priorities differed from those of libraries.

Challenges of time: maintenance for growing usage

Fast forward to year 2019 – how has arXiv changed?

The number of publications in arXiv has increased significantly: more information in arXiv.

First thing we must notice is that arXiv has grown enormously. It now includes over 1.6 million e-prints, and the growth is accelerating. Many people still think that the main focus is in high energy physics, but in reality, the growth of both high energy physics and astrophysics has been pretty stable, while the share of both mathematics and computer science are continuously increasing. The usage is also quite intense: about 0.7 million e-print views in one day.

Secondly, arXiv still runs on pretty old software. This does not only involve the e-print repository, but the moderation platform software. Without moderation, the process of adding new content would not be possible. In practice, this means that arXiv needs control mechanisms to make sure that the uploaded articles meet quality requirements, and to avoid abuse. Parts of this moderation processes are automated. Additionally, there are 175 volunteer moderators looking after their own specific fields of research. This has developed into a massive system that cannot be easily updated.

There are 175 volunteer moderators looking after their own specific fields of research. This has developed into a massive system that cannot be easily updated.

Third point: despite the fact that arXiv is a well-known and well-established service, it has experienced considerable challenges with both maintenance and development. At these high volumes there are bound to be costs. To tackle this, Cornell university library created a sustainability program in 2010. The most active users were invited to pay a support fee according their usage level tiers. Currently, 240 organizations are participating, most of them university libraries. Helsinki University Library is the only participating organization in Finland. Other heavy arXiv users include Aalto University and University of Jyväskylä, but they haven’t become supporters.

Fourth, one must mention the challenges posed by open access goals. Plan S has been a major headache for arXiv. A major reason is the fact that the arXiv metadata schemes date back to a time before current standards, and a big ship like this is not able to turn quickly into Plan S defined narrow straits. For example, the funding information requirement and certain technical requirements would necessitate considerable changes into arXiv. Also, the recommendation re open citations fall outside original purpose of arXiv.

Plan S has been a major headache for arXiv. A major reason is the fact that the arXiv metadata schemes date back to a time before current standards, and a big ship like this is not able to turn quickly into Plan S defined narrow straits. For example, the funding information requirement and certain technical requirements would necessitate considerable changes into arXiv. Also, the recommendation re open citations fall outside original purpose of arXiv.

The dilemma of the accepted paper

For a long time, people active with arXiv were looking into possibilities to expand the selection of research fields without overly burdening the moderation resources, Although a few fields related to mathematical methods have been added, the overall situation has changed in the 2010s because of new e-print services. The best known among these is biorXiv, which started 2013. Its interface clearly shows that both the code and the metadata definitions are a couple of decades more modern compared to arXiv.

However, biorXiv traffic, both in terms of uploads and downloads, is still minuscule compared to arXiv. There are also other differences. One of the requirements of the U.K. REF2021 evaluation process is to have all publications openly available. Researcher guidelines highlight differences between the two repositories.  Articles that have been accepted by publishers can be uploaded to arXiv, while biorXiv does not allow that. For example, University of Cambridge recommends that researchers should upload any articles to biorXiv well before acceptance, or alternatively to to the university repository. As for arXiv, it is recommended to include ”Accepted for publication” in the arXiv metadata.

Eva Isaksson from the University Helsinki has participated in arXiv user community activities in Cornell. Founder of arXiv, Paul Ginsparg (seen from profile) is sitting in the background.

University of Helsinki has also been struggling with its arXiv guidelines, as we have to consider the open access requirements of Ministry of Education and Culture in order to be able to count arXiv items into our open access statistis. The big problem here is that the metadata often lacks information about acceptance, as researchers often do not enter that info. How can one formulate guidelines so that this becomes clear for everyone involved? As a compromise, the library has recommended that researchers alert the library about all of their new arXiv entries. Do we need similar guidelines as at University of Cambridge to urge researchers to include info about acceptance with their arXiv entries?

Other new repositories exist, and several others are planned: chemrXiv, earthrXiv, etc.  For these, the volumes are still pretty small. Publishers have generally accepted these after noticing that they tend to bring added value and attention to the publisher’s version.

Other new repositories exist, and several others are planned. Publishers have generally accepted these after noticing that they tend to bring added value and attention to the publisher’s version.

E-print in arXiv or biorXiv? Tell about it to the library

  • According to the University of Helsinki’s principles of open publishing (2017) all publications produced at the university will be archived in the open repository HELDA.
  • As soon as an article has been published or submitted in a journal, it’s enough to send the name of the journal and the arXiv or biorXiv address of the article to the library: openaccess-info@helsinki.fi.
  • The library archives the latest version of the publication in HELDA open repository via TUHAT research database.
  • The archiving service is available to all University of Helsinki authors.

Extending the resources

ArXiv should not be taken for granted. It is an ambitious and surprisingly successful attempt to create new scientific culture. The current challenges arise from the huge popularity of the service. The basic principles are sound and arXiv has an excellent reputation. On the other hand, bringing the service up to date requires considerably more resources.

In January 2019, arXiv moved from Cornell University library to Computing & Information Science (CIS). The leadership structure has some holes, as there are vacancies that are not easily filled.  Updating the code could benefit from community involvement – maybe developing arXiv code in a Linux like process.

So far, arXiv has been able to acquire considerable funding from foundations. Apart from that, support from its international user communities is vital. Finnish users should also be active in providing sufficient resources for arXiv, as it is an important publication platform and information source also for Finnish researchers. Anyone interested in arXiv development can follow it in arXiv Public wiki.


Eva Isaksson (TUHAT, ORCID, @eisaksso) works as an Information Specialist (specializing in bibliometrics) at the Helsinki University Library.