Some additions from yesterday’s session

1) Microsoft is using R internally, and what is more, they are going all out for R: it has already announced that MS SQL database’s next version will have R built in. Test release is scheduled for September. It has acquired Revolution Analytics. Microsoft is also funding R development through the R Consortium  announced 4 days ago, under the umbrella of the Linux Foundation.

2) There are rumors that some future version of Excel will have R built in!

3) R is being now accepted in the financial and commercial world and is quickly replacing SAS.

4) Just a single company called Teradata has more than 400 open vacancies for employees with R experience.

5) Hewllet Packard is releasing its own version of distributed R as open-source software.

6) Next UseR! meeting will be at Standford University and the organizing committee includes members both from Microsoft and Google.

I think it is time for our department and the whole university to stop teaching SPSS to our undergrads and switch to R as our main statistical software.

A series of four seminars on Statistics (week 47)

Series title: Statistics: Changes since I was an undergrad

Abstract. I took my first course in Statistics 37 years ago. How we do statistics has changed dramatically since then. The amount of data we produce and analyse has also increased enormously. However, different research communities are making use of these new possibilities to very different extents. Even Biology curricula at different universities differ substantially in their emphasis on “quantitative” methods and “numerical/mathematical” literacy. All branches of Biology are becoming more quantitative. By reflecting on how advances in computers and computing science (and in the methods these advances have made possible) have opened a whole new way of approaching data analysis, I hope I will make you rethink your approach to data analysis.

If you are planning to participate next year in my course 526052 Using R for reproducible data analysis I recomend that you attend this series of talks as a gentle introduction to the subject. If you attend you can get credits, either through the DDPS or in on the regular seminar series in Plant Biology.

Place: Biocenter 3, Room 5405 (5th floor, the room in front of the stairs)

Two hours are reserved, for talk plus discussion.

Please, let me know by e-mail if you intend to participate.

Part 1: Increased easy of computation

Monday 17 November, 10:15-12:00

This first talk focuses on describing the advances in computing hardware and software and why they are relevant to data analysis. I will also briefly mention the now fashionable “Data Science” and “Big Data” concepts and the currently fuzzy boundary between statistics and programming.

Part 2: Advances in theory and methods

Tuesday 18 November, 13:15-15:00

If your statistical knowledge is limited to the “traditional” methods, I hope to introduce you to the new possibilities brought about by lifting the that used to prevent us from using computation intensive methods and analysing big data set. In contrasts, if you are a young researcher, well versed in modern methods, you will still hopefully find my talk interesting from the historical perspective of getting a glimpse of what limitations we had to deal with in the recent past, and how they influenced, and still influence, the traditional ways of treating biological data. This talk focuses mostly on statistical theory and methods. However, no specialised methods like those used in molecular biology or vegetation analysis will be described in this talk.

Part 3: Examples of modern methods using R

Wednesday 19 November, 13:15-15:00

In this talk I will present some examples of types of analyses that have become available to any biologist thanks to the increase in computing capacity and the development of new theory and methods that make use of these new possibilities. The aim not to teach you how to apply this methods, but instead to give an idea of what a broad array of methods are currently available to anyone with access to a run-of-the-mill personal computer, or failing this a cheap cloud server.

Part 4: Reproducible research and data analysis

Thursday 20 November, 13:15-15:00

This talk introduces the currently hot topic of research accountability and repeatability. Why is this openness needed, and how it can be achieved in practice, and how modern software and modern combinations of old software make it possible to achieve this goal rather painlessly even for complex data analyses. I will also reflect on the origins of these ideas in relation to computer programming around the concept of literate programming proposed by Donald Knuth in the early 1980’s.