The large-scale approach – my experience with open databanks

by Ana Lúcia Lindroth Dauner

When someone tells me that the climate is changing, I always think “Okay, but how? Where is it changing? Is it changing in the same way everywhere? How do the changes affect ecosystems and human societies?” In order to answer these types of questions, an comprehensive set of adequate, good quality data is essential. The challenge with these large-scale questions is that you cannot collect all this data within your research project – or even within your lifetime of research. For example, I am currently investigating how past changes in Arctic sea-ice cover affected lake ecosystems in the northern polar region. I cannot simply go out and collect samples from every lake in and around the Arctic.

The sampling of sedimentary records usually requires complex and expensive expeditions, with a team of scientists, especially if the sampling site is remote. For example, some lakes are located far away from roadways and electricity; so proper vehicles and carrying generators to the field are needed. After sediment cores are recovered, the sediment dating and analytical laboratory work is expensive, laborious and time-consuming. In order to analyse only one sediment record sliced into approximately 100 samples, it took me around 4 months of direct laboratory work during my PhD. So, imagine how long it would take for an army of PhD students to analyse, for example, one record from one hundred different lakes! And, even if I could sample one lake in each continent, it would not be representative of the whole region. Several factors affect the organic matter production within a lake, such as lake elevation, hydrology, the type of catchment vegetation and the distance from the coast (termed “oceanity” or “continentality”) that affects the temperature and precipitation patterns. These regional and local factors vary greatly within a large region. That being said, what is the solution? Well, one option is to use data collected by other researchers!

In order to analyse only one sediment record sliced into approximately 100 samples, it took me around four months of direct laboratory work during my PhD. So, imagine how long it would take for an army of PhD students to analyse, for example, one record from one hundred different lakes!

Even though the northern polar region is remote, high-quality scientific research has been performed in these regions for decades, and several of the published datasets are available in online databases such as Pangaea1, NOAA/WDS for Paleoclimatology2 and Neotoma Paleoecology Database3. They make available proxy information from lake sediment cores, some of which could be suitable for inferring past changes in lake production, such as the percentage of organic matter or the number of microalgal fossils. All I need to do is to download these datasets, combine the information and answer my question, right? Well, it would be fantastic if it were that fast and easy! However, that is rarely the case.

First of all, there are more than just one central database. It was almost like looking for a specific author within several libraries. Then, there was the challenge of carefully choosing the search words. If I simply looked for “lakes” in a northern hemisphere database, I received thousands of results, most of them lacking the information I was looking for. If I used too specific search words, such as “biogenic opal flux”, on the other hand, I got just a few data sets and might have missed out on valuable information. There is no magic formula here, and as generally in research, for big data science too, perseverance and patience are the keywords. One last bump on the road in the searching process was that many published datasets were not made available online. Although nowadays several journals and universities require authors to make their data openly available in one of the online databases, sometimes even before publishing the manuscript, significant amount of good datasets remain behind the authors. So, in many cases, even big data searching involves basic human connection: writing the authors asking for their data for a specific purpose, and hopefully waiting for them to respond positively.

After searching for proper databases through adequate search words, I (luckily!) found myself with lots of datasets. Yet after all that, I needed to tackle the main challenge: keeping things in order! As someone who struggles with orderliness, it was especially important to find a way to keep my research organized! Every sediment record received a unique ID and I created a spreadsheet where all the information was combined according to this unique ID. This way, I can easily check the information related to a specific record, such as the lake size or if this record has microalgae data. Another important step was to standardize the format of the datasets, so I can work with them in a more efficient way. Only after that, it was possible to start cleaning the data. After looking into various databases (or libraries), it is common to have replicas, i.e., the same data set was found in more than one library. So, they needed to be removed and the remaining data problems, such as missing data or inadequate temporal resolution and different units (e.g. temperatures in Celsius or Fahrenheit), fixed. All these filtering, cleaning and tidying steps are extremely important but can take a long time, so planning accordingly is vital!

Enough of the challenges: you want to know what happened to my data digging experiment, right? At this point, I am still dealing with the above-mentioned steps! Patiently and gradually, I will be able to proceed to the next steps: appropriate analyses and graphs to combine and compare all the information in a way that robust answers to my questions, so my compilation is not, well, “just a lot of data”. As someone who is still in the searching, cleaning and tidying step, I increasingly value the studies and the researchers that were able to extract information from several previous studies to answer a new and exciting research question.

Also, from what I have for now, I think we need more field work and lake records! Most of the studies in the online databases were performed in coastal regions and mainly in Alaska, Fennoscandia and coastal Greenland, but data from the Arctic Russia, some parts of Canada and more continental regions are still lacking. Moreover, analytical and chronological techniques and thus data resolution and proxy variety have progressed in time. In any case, I believe that the collection of published data I have compiled will be enough to start answering some the intriguing large-scale questions. Stay tuned!

_______

Ana Lúcia is a postdoctoral researcher who strangely finds joy in the crashing R.

Explore the amazing paleo data sets:

1 https://www.pangaea.de/

2 https://www.ncdc.noaa.gov/paleo-search/

3 https://www.neotomadb.org/