Workflow of statistical data analysis - Summer 2018This course is part of the International Max Planck Research School on Adapting Behavior in a Fundamentally Uncertain World. The course can also be credited as a part of MW24.5.
- Can be downloaded here after 10:00 on the day of the exam.
- Lecture (daily): t.b.a. KU Hörsaal, Bachstraße 18K
- Workflow of empirical work may seem obvious. It is not. Small initial mistakes can lead to a lot of hard work afterwards. In this course we discuss some techniques that hopefully facilitate the organisation of your empirical work.
- Here is a preliminary version of the handout:
Data is attached to the handout. If you are using a Microsoft operating system and have difficulties using Adobe Acrobat (in particular extracting attachments) check whether you (or your administrator) have prevented Adobe Acrobat from opening certain types of attachments. If difficulties with Adobe Acrobat persist, there are many alternatives (PDF-XChange, Foxit,...)
- There are two kinds of problems a researcher faces when doing empirical work.
- One is to find the right statistical method for the problem.
- The other is to organise the data and the evaluation in a way such that results can be replicated.
- We interrupt our work for a few days or weeks and want to go back to it quickly.
- We share our work with a collaborator and want her or him to quickly understand what we did and to participate in an efficient way.
- After we sent our paper away to a journal referees might demand small changes in the analysis.
The sad truth is that often researchers find it very hard to replicate the results of their own statistical analysis. During the analysis we make a lot of small decisions, many of them seem obvious when we make them, but when we replicate our work, it turns out that it is not clear which subset of the data we really included, how special cases were coded, how outliers were identified and treated, how bootstraps were run, what was the precise meaning of which variable, and which tests were used with which parameters. Too often it happens that even after spending days and weeks trying out a few dozend of combinations of these parameters we can not replicate what we did a few month ago. If we are lucky, we come perhaps close to the results we published proudly in the past, but we do not get the same resuls. This can be a more than embarassing experience.
The aim of the course is to develop a strategy that helps avoiding this problem. In the course we will discuss strategies that we can use to organise our data and our analysis in a way that allows us even years later to redo our analysis quickly, reliably, with exactly the same results.
An efficient workflow helps us to get back to statistical work quickly after an interruption and also helps to share an analysis with coauthors.
- Aims of statistical data analysis
- Organising your work
- How to separate creativity from chaos?
- Organising ideas in files
- Organising ideas in functions
- Preparing data
- Reading data
- Cleaning data
- Organising data
- Working with data
- Descriptive statistics
- Specific results
- Presenting results
- Weaving and tangling
- Version control
- Workflow in general:
- There is an interesting book on workflow of data analysis, however it is based on Stata: J. Scott Long, The Workflow of Data Analysis Using Stata, Stata Press, 2009.
- Hadley Wickham providies a view on “Tidy Data”: Hadley Wickham; Tidy Data; Journal of Statistical Software, 2014.
- Friedrich Leisch; Sweave User Manual.
- Nicola Sartori; Sweave = R · LATEX2.
- Yihui Xie; knitr - Elegant, flexible, and fast dynamic report generation with R.
- Max Kuhn; CRAN Task View: Reproducible Research..
- It will help if you can bring your own portable
computer to the classes and exercises. For our practical examples (during the entire course) we will use the software environment R. I think that it is helpful to coordinate on one environment and R has the advantage of being free and rather powerful.
- Documentation for R is
provided via the built in help system but also through the
- The R Guide, Jason Owen (Easy to read, explains R with the help of examples from basic statistics)
- Simple R, John Verzani (Explains R with the help of examples from basic statistics)
- Einführung in R, Günther Sawitzki (In German. Rather compact introduction.)
- Econometrics in R, Grant V. Farnsworth (The introduction to R is rather compact and pragmatic.)
- An Introduction to R, W. N. Venables und D. M. Smith (The focus is more on R as a programming language)
- The R language definition (Concentrates only on R as a programming language.)
- We will use the following packages:
car, Ecdat, foreign, Hmisc, knitr, lattice, memisc, tikzDevice, tools, xtable. If, e.g., the command
library(Ecdat)generates an error message (
Error in library(Ecdat): There is no package called 'Ecdat'), you have to install the package.
- Installing packages with Microsoft Windows:
Rgui.exeand install packages from the menu
Packages / Install Packages).
- Installing packages from advanced operating systems:
- From within R use the command
install.packages("Ecdat"), e.g., to install the package
- In the lecture we will use RStudio as a front end.
- For weaving and knitting we need LaTeX (e.g. TeX Live or MiKTeX).
- RStudio provides a front end to R, LaTeX, git and svn.
- In the course we
will use git as an example for a version control system. git might be
already installed on your computer. You should
also have a merge-tool, e.g.
meld. (Any of kdiff3, araxis, bc3, codecompare, diffuse, ecmerge, emerge, gvimdiff, opendiff, p4merge, tkdiff, tortoisemerge, vimdiff, xxdiff... would work as well).
- Stata, unfortunately, does not have an equivalent to Sweave. Still, there are some tools:
- Documentation for R is provided via the built in help system but also through the R Homepage. Useful are