Preparing the data

Before doing any analysis it is necessary to prepare the data so that it is as error-free as possible and organized so that it is suitable for spreadsheet programs to read it. There are many sources of error, such as coding errors, that should be checked before starting any analysis. Otherwise the risk of drawing incorrect conclusions because of biased data remains. The first thing to do is to browse through the data and try to find the possible errors manually. All the variables should also be checked for their range of possible values – for example, there are no 999-years-old people – so checking the minimum and maximum values for all variables is useful. The convention for coding “missing values” (i.e. non-responses) is to mark them as 9999, N/A (standing for “no answer”), or blank. You should check whether the program you are using interprets missing values correctly.

Another type of data-preparation problem can occur if the data has been gathered by hand or scraped from a website, discussion boards, Facebook, etc. Data that comprises text, such as names of people or companies, can include spelling errors, missing letters, etc. Usually such errors have to be dealt with by hand, but nowadays there are good computational resources for the purposes of data cleaning, such as Google’s Open refine or Fusion Tables. Open refine is easy to use, and there are many good tutorials on its main functions. Open refine includes algorithms that can, for instance, help to identify spelling errors and unify textual values. Once ready, the data can be transferred into a CSV file and uploaded into any spreadsheet program.