A series of four seminars on Statistics (week 47)

Series title: Statistics: Changes since I was an undergrad

Abstract. I took my first course in Statistics 37 years ago. How we do statistics has changed dramatically since then. The amount of data we produce and analyse has also increased enormously. However, different research communities are making use of these new possibilities to very different extents. Even Biology curricula at different universities differ substantially in their emphasis on “quantitative” methods and “numerical/mathematical” literacy. All branches of Biology are becoming more quantitative. By reflecting on how advances in computers and computing science (and in the methods these advances have made possible) have opened a whole new way of approaching data analysis, I hope I will make you rethink your approach to data analysis.

If you are planning to participate next year in my course 526052 Using R for reproducible data analysis I recomend that you attend this series of talks as a gentle introduction to the subject. If you attend you can get credits, either through the DDPS or in on the regular seminar series in Plant Biology.

Place: Biocenter 3, Room 5405 (5th floor, the room in front of the stairs)

Two hours are reserved, for talk plus discussion.

Please, let me know by e-mail if you intend to participate.

Part 1: Increased easy of computation

Monday 17 November, 10:15-12:00

This first talk focuses on describing the advances in computing hardware and software and why they are relevant to data analysis. I will also briefly mention the now fashionable “Data Science” and “Big Data” concepts and the currently fuzzy boundary between statistics and programming.

Part 2: Advances in theory and methods

Tuesday 18 November, 13:15-15:00

If your statistical knowledge is limited to the “traditional” methods, I hope to introduce you to the new possibilities brought about by lifting the that used to prevent us from using computation intensive methods and analysing big data set. In contrasts, if you are a young researcher, well versed in modern methods, you will still hopefully find my talk interesting from the historical perspective of getting a glimpse of what limitations we had to deal with in the recent past, and how they influenced, and still influence, the traditional ways of treating biological data. This talk focuses mostly on statistical theory and methods. However, no specialised methods like those used in molecular biology or vegetation analysis will be described in this talk.

Part 3: Examples of modern methods using R

Wednesday 19 November, 13:15-15:00

In this talk I will present some examples of types of analyses that have become available to any biologist thanks to the increase in computing capacity and the development of new theory and methods that make use of these new possibilities. The aim not to teach you how to apply this methods, but instead to give an idea of what a broad array of methods are currently available to anyone with access to a run-of-the-mill personal computer, or failing this a cheap cloud server.

Part 4: Reproducible research and data analysis

Thursday 20 November, 13:15-15:00

This talk introduces the currently hot topic of research accountability and repeatability. Why is this openness needed, and how it can be achieved in practice, and how modern software and modern combinations of old software make it possible to achieve this goal rather painlessly even for complex data analyses. I will also reflect on the origins of these ideas in relation to computer programming around the concept of literate programming proposed by Donald Knuth in the early 1980’s.

However good data looks at first sight, check it!

Introduction

I will tell today two true stories, one old and one very recent. The point that I want to make today is that one should never blindly trust the results of measurements. This applies in general, but both examples I will present have to do with measurements made with instruments, more specifically with measuring UV-B radiation in experiments using lamps.

A case from nearly 20 years ago

Researcher A received a new very good spectroradiometer from the manufacturer and used it to set the UVB output from the lamps.

Researcher B had access to an old spectroradiometer that could measure only a part of the UVB spectrum. He knew this, and measured the part of the spectrum that was possible to measure, and extrapolated the missing part from published data. He also searched the literature and compared his estimates to how the same lamps had been used earlier.

Researcher A was unlucky enough that because of a mistake at the factory, the calibration of the new instrument was wrong by about a factor of 10. She did not notice until after the experiment was well under way, but before publication. The harm was that the results were less relevant than what had been the aim, but no erroneous information was published.

Researcher B was able able to properly measure the UVB irradiance after the experiment was well under way, and he found that the treatment was within a small margin of what he had aimed.

A case I discovered just a few days ago

A recently published paper concluded that they had obtained evidence that a low and ecologically relevant dose of UVB on a single day was able to elicit a large response in the plants. From the description of the lamps used, the distance to the plants and the time that the lamps were kept switched on is easy to estimate that in fact they had applied a dose that was at least 15 or 20 times what they had measured and reported in the paper. Coupled to a low level of visible light this explains why they observed a large response from the plants! Neither the authors, reviewers, nor the editor had noticed the error! [added on 8 October] I read a few other papers on similar subjects from the same research group and the same problem seems to also affect them. I will try to find out the origin of the discrepancy, and report here what I discover.

[added on 26 October]
I have contacted three of the authors. They have confirmed the problem. Cause seems to have been that the researchers did not notice that the calibration they used had been expressed in unusual units by the manufacturer. The authors are concerned and are checking how large the error was, but first comparative measurements suggest that the reported values were underestimated by a factor of at least 20 times.

About this case, I do not yet know the whole story, but evidently it yielded a much worse result: The publication of several articles with wrong data and wrong conclusions.

Take home message

Whenever and whatever you measure, or when you use or assess non-validated data from any source, unless you know very well from experience what to expect, check the literature for ballpark numbers. In either case, if your data differ significantly from expectations try to find an explanation for the difference before you accept the data as good. You will either find an error or discover something new.

Biosophical society

I guess most of you have seen thisBS-poster poster tacked on the elevators walls and on the front door of Biocenter 3. If you are a student, or a researcher and you aim at understanding science, then the events organized by the society will surely broaden your view. At least from time to time, we all need to stop, and spend some time pondering the deeper questions about Biology. This helps in many ways: 1) it will make us think about things and assumptions we normally take for granted, 2) it will allow us to put our own research in a much broader context, which could reveal either unsuspected implications or logical flaws, 3) hopefully they will lift to some extent our blindness to the “logic” behind research in other disciplines, allowing us to see new patterns and connections. The current group of participants comes from several different disciplines and have different research backgrounds which should make the discussions particularly interesting for everybody.

I will myself participate, and chair one session some time in November.

For the programme and additional infrmation see the society’s own web site.

For info or booking a date, email Matan.Shenhav@helsinki.fi

 

“Reproducible research” is a hot question

I have long been interested in the question of reproducible research and as a manuscript author, reviewer and more recently, editor, have attempted to make sure that no key information was missing and that methods were described in full detail and, of course, valid.

Although the problem has always existed, I think that in recent years papers and reports with badly described methods have become more frequent. I think that there are many reasons for this: 1) the pressure to publish quickly and frequently as a condition for career advance, 2) the overload on reviewers work’ and the pressure from journals to get manuscript reviews submitted within a few days’ time, 3) the stricter and stricter rules of journals about maximum number of “free” pages, and 4) the practice by some journals of publishing methods at the end of the papers or in smaller typeface, implying that methods are not important for most readers, and irrelevant for understanding the results described (which is a false premise).

Continue reading