Uni Jena
Wirtschaftswissenschaftliche Fakultät
Lehrstuhl für Empirische und Experimentelle Wirtschaftsforschung
[A picture of Oliver Kirchkamp]

Workflow of statistical data analysis

This course is part of the International Max Planck Research School on Adapting Behavior in a Fundamentally Uncertain World. The target group are students who, due to the interdisciplinary nature of the IMPRS school, do not have any background in statistics
Schedule:
Lecture (daily): 27.-31.7. (for details, see the calendar)
Motivation
There are two kinds of problems a researcher faces when doing empirical work.
  • One is to find the right statistical method for the problem.
  • The other is to organise the data and the evaluation in a way such that results can be replicated.
Replication is necessary in several contexts:
  • We interrupt our work for a few days or weeks and want to go back to it quickly.
  • We share our work with a collaborator and want her or him to quickly understand what we did and to participate in an efficient way.
  • After we sent our paper away to a journal referees might demand small changes in the analysis.
In all these cases replications seems like a trivial and obvious task. Of course, with the same data and the same methods, how can it be a problem to replicate results?

The sad truth is that often researchers find it very hard to replicate the results of their own statistical analysis. During the analysis we make a lot of small decisions, many of them seem obvious when we make them, but when we replicate our work, it turns out that it is not clear which subset of the data we really included, how special cases were coded, how outliers were identified and treated, how bootstraps were run, what was the precise meaning of which variable, and which tests were used with which parameters. Too often it happens that even after spending days and weeks trying out a few dozend of combinations of these parameters we can not replicate what we did a few month ago. If we are lucky, we come perhaps close to the results we published proudly in the past, but we do not get the same resuls. This can be a more than embarassing experience.

The aim of the course is to develop a strategy that helps avoiding this problem. In the course we will discuss strategies that we can use to organise our data and our analysis in a way that allows us even years later to redo our analysis quickly, reliably, with exactly the same results.

An efficient workflow helps us to get back to statistical work quickly after an interruption and also helps to share an analysis with coauthors.

Topics:
  • Introduction
  • Aims of statistical data analysis
  • Organising your work
    • How to separate creativity from chaos?
    • Organising ideas in files
    • Organising ideas in functions
  • Preparing data
    • Reading data
    • Cleaning data
    • Organising data
  • Working with data
    • Documentation
    • Descriptive statistics
    • Specific results
  • Presenting results
    • Weaving and tangling
Literature
There is an interesting book on workflow of data analysis, however it is based on Stata: J. Scott Long, The Workflow of Data Analysis Using Stata, Stata Press, 2009.
Software
For our practical examples (during the entire course) we will use the software environment R. I think that it is helpful to coordinate on one environment and R has the advantage of being free and rather powerful.
  • Documentation for R is provided via the build in help but also through the R Homepage. Useful is An Introduction to R, The R language definition, Simple R, and Econometrics in R.
  • A first entry into R eased through mice and menues is available through the R Commander.
  • Users of Firefox get access to R help through the R Site Search Sidebar. However, this is a bit tricky. If the rsitesearch.xpi package does not install, open the package (e.g. in Emacs) and change two values in install.rdf: Set maxVersion to a version at least as large as the version of your browser. Set updateURL to an empty value.
  • In the lecture I use the versatile editor Emacs with the ESS interface (ESS also helps with Stata, SAS, Splus, BUGS, and others).
  • For the last part of the course we will rely on Sweave. Sweave and make are easily installed in GNU Linux. Here are some hints how to use Sweave with Microsoft Windows.
  • It will help if you can bring your own portable computer to the classes and exercises. You should have an up-to-date version of R installed. We will also need the following libraries: car, Ecdat, foreign, Hmisc, memisc, Sweave, tools. To work with Sweave we need LaTeX (e.g. TeX Live or MiKTeX).
  • You might also find make and Emacs useful tools. (The MacOS-X users might find aquamacs less scary)
Handout + Exam
Here is a preliminary version of the handout. Data is attached to the handout. Here you also find an example for the solution to the first exercise. Please solve the exam between 18:00 on 31st July until 18:00 on 1st August. You will need some data to solve the exam. If you are using Microsoft and have difficulties using Adobe Acrobat (in particular when saving attachments) look here