Introduction to Open Data Science (you may become a data scientist!)

Open Data, Open Science, and Data Science – our newest course combines these hot topics with an attractive twist. Introduction to Open Data Science (IODS) is open for everyone willing to learn how to use RStudio, R Markdown and GitHub (state-of-the-art tools of data science) to visualise and analyse open data with multivariate statistical methods following principles and practices of reproducible research. Cha(lle)nge your plans and become a data scientist!

Data scientist? Please don’t be mean!

By no means. The fact is that our era of data – often larger than ever and complex like chaos – requires several new skills from statisticians and other data scientists. My claim (or vision) here is very simple:

”Whatever your discipline identity is, we may all become data scientists.”

Think about what has happened to, say, biology: it has become more or less a cocktail of computer science and statistics – or data science (of course, with a strong biological foundation). A similar process is now shaking the humanities and social sciences (as well as many other fields). Contemporary terms like ”digital humanities” or ”computational social science” or ”digital social science” are used to emphasize this process. Indeed, it is very much a Data Revolution that is going on, and we are all deeply involved. The questions are:

  • What do you want to do with the endless amount of beautiful data?
  • Are (or ”R”) you ready to learn new skills to do what you want to do?

Core themes and learning objectives of IODS: challenging the universe!

When we started visioning Introduction to Open Data Science (IODS) ver.1.0,  we wrote down the core themes and learning objectives of IODS quite boldly, in order to challenge the crowd of (at that time) imagined students – along with ourselves:

”We must discover the patterns hidden behind numbers in matrices and arrays. We are not afraid of coding, recoding, programming, or modelling. We want to visualize, analyze, interpret, understand, and communicate. These are the core themes of Open Data Science (Open Data – Open Science – Data Science). And this course is THE course for learning these skills.”

”After completing this course you will understand the principles and advantages of using open research tools with open data and understand the possibilities of reproducible research. You will know how to use R, RStudio, R Markdown, and GitHub for these tasks and also know how to learn more of these open software tools. You will also know how to apply certain statistical methods of data science, that is, data-driven statistics.”

That was the ”big picture”. In detail, the contents of IODS include a selection of multivariate statistical methods that you will learn in order to visualise and analyse various open data sets, applying and following the principles and practices of reproducible research.

In passing, I would like to mention that my forthcoming textbook, co-authored with Brian Everitt will appear in early 2019 (soon after IODS ver.3.0). The book and its website with open data sets and R codes will provide an excellent path to continue, enlarge, and deepen your studies around these themes.

The weekly themes of IODS (updated for ver.3.0) are as follows:

  1. Tools and methods for open and reproducible research:
    R, RStudio, R Markdown, GitHub
  2. Regression and model validation:
    Data wrangling
    Simple regression
    Multiple regression
    Regression diagnostics
  3. Logistic regression:
    Regression for binary outcomes
    Training and testing a (predictive) model
    Cross-validation
  4. Clustering and classification:
    Datasets in R
    Discriminant analysis
    K-means clustering
  5. Dimensionality reduction techniques:
    Principal component analysis
    Multiple correspondence analysis
  6. (NEW!) Analysis of longitudinal data:
    Graphical displays and summary measures
    Linear mixed effects models

From day 1, we will begin learning the software tools and platforms that we use on the course. Each week, exercises are completed, reported, and openly shared using those tools, working with open data sets and applying various statistical models, methods, and techniques that are learned on the fly. The course grade consists of the points earned from the peer-reviews of the weekly reports.

Astonishing feedback from the students makes us try even harder

We have received a lot of feedback of IODS. Only a few examples appear below (some more on the course page). The feedback proves that we are not the only ones who are excited about the course!

”First of all I want to thank you all about this course which has been the funniest and most interesting ever. This was my first touch to R, GitHub and Slack. I never thought that I would get this excited about something, but I did. I noticed that the R environment is an endless world and its not as difficult as I thought at first. I will definitely continue to learn codes and statistics.”

”Now I feel that this was the best course, in which I have ever been participating, because: 1. The time schedule was very flexible, and I had an opportunity to work according my time schedule. 2. I like very much the interactive DataCamp exercises part, where I have got an idea how to start, and only after that to continue with the RStudio exercises. 3. The video lectures helped me a lot because this suits a lot to my way of studying: first to read and watch theoretical lectures, and after that continuing with the practical exercises part. 4. I like the idea for peer reviews, because in this way we had an opportunity to compare (to some extension) what we have done with the work of the others.”

”I really enjoyed this course, to be honest this is the best course that I had in Helsinki. Combining both DataCamp and Rstudio exercise was amazing idea, it helped me alot. Even though I have been using R since couple of years but during this course I learned more sophisticated ways of programming.”

”I must admit that I attended the course to see how you’ve done the arrangements, what technologies you’ve chosen and to get some ideas how to run similar courses on my own. Learn from the best is the word 🙂 I’ve used R since 2000. I still learned new things on the course!”

”The course was really interesting and hands-on approach worked well. Datacamp exercises were well organised. Need for this kind of applied statistical (data science) courses where you’re needed to clean your dataset and then use correct statistical methods is in high demand. You can get a feel that you’re learning something actually useful for real life. Learning Github has been really huge benefit.”

In addition to super positive feedback (such as the samples above), we have got plenty of constructive (sometimes critical but always useful) feedback, too, with thoughtful suggestions for future versions of IODS. We are grateful for the feedback, as it has helped us to adjust many things, such as the prerequisites of the course.

Teachers should collaborate much more with students.

The ”we” in the sections above refers to my excellent teaching assistants Tuomo Nieminen and Emma Kämäräinen, who at the time of IODS ver.1.0 were master’s students of statistics. I was pretty sure that in addition to our common visions, they would also know exactly what was needed to fulfill the visions. And they certainly did!

In addition to selecting the best tools of data science, finding suitable data sets, inventing and coding the exercises, etc., they helped the other assistants, the students – and me! – to navigate through the IODS ver.1.0 while we were still finishing some details week by week. It was hard work but also a successful breakthrough in many ways.

Tuomo and Emma appear with me in this ”legendary” video that we selfshot in the UniTube Studio just before the first IODS (we even wrote a brief script and practiced it twice!). In retrospective, it sounds a bit clumsy (especially my own part) but who cares! The excitement in the air can still be sensed 🙂

So, from the very beginning, IODS has been created in close collaboration with students.  Students often know better what is going on (especially outside the academia) and which tools and skills we should teach at the university. I warmly recommend this kind of work-flow.

Next version – next level. It is now – or next time!

IODS ver.3.0 will open on 30th October 2018, with new contents related to the analysis of longitudinal data, being currently under construction by me and my current chief assistant, Petteri Mäntymaa.

If you got interested, please take the following advice into account:

  • You will need a laptop computer (Mac / Windows / Linux), where you can download and install the required software tools needed throughout the course from the very beginning. All software tools that we use are freely available.
  • A basic knowledge of introductory statistics is a serious prerequisite. For example, we assume that you are pretty familiar with the basic concepts related to simple linear regression (that is where we will start from). Feel free to ask for more detailed information of what you should know about statistics, as IODS is not meant to be a systematic introduction to statistical methods; we merely assume some kind of basic level to begin from. But, variation is always for good. Please do not hesitate to ask, if you are interested!
  • Some elementary skills of R will certainly help. If you have not used R before, we recommend that you take a good look at a free introductory course of R on DataCamp, probably the coolest platform of learning data science online. (As one more important element, we use our own DataCamp course on IODS, and I can promise it is extremely engaging!)
  • You must be able to follow the weekly schedules. The deadlines of submitting and peer-reviewing the weekly reports are strict, and the work will begin immediately on day 1. Early enroll and active class participation is recommended.

For more information, please see the IODS ver.3.0 course page.

So, see you on IODS! If not this time, then next time, perhaps? In any case, you are welcome to learn new skills and meet new people from different fields and disciplines with a strong passion for Open Data Science. Maybe you will become a data scientist, too!

Kimmo Vehkalahti
Fellow of the Teachers’ Academy
University (Senior) Lecturer,
Docent (Adjunct Professor) of Applied Statistics
Centre for Research Methods and Social Statistics (Data Science)
Faculty of Social Sciences, University of Helsinki