![[A picture of Oliver Kirchkamp]](../images/oliver5344.jpeg)
Workflow of statistical data analysis
This course is part of the International Max Planck Research School on Adapting Behavior in a Fundamentally Uncertain World. The target group are students who, due to the interdisciplinary nature of the IMPRS school, do not have any background in statistics- Schedule:
- Lecture (daily): 27.-31.7. (for details, see the calendar)
- Motivation
- There are two kinds of problems a researcher faces when doing empirical work.
- One is to find the right statistical method for the problem.
- The other is to organise the data and the evaluation in a way such that results can be replicated.
- We interrupt our work for a few days or weeks and want to go back to it quickly.
- We share our work with a collaborator and want her or him to quickly understand what we did and to participate in an efficient way.
- After we sent our paper away to a journal referees might demand small changes in the analysis.
The sad truth is that often researchers find it very hard to replicate the results of their own statistical analysis. During the analysis we make a lot of small decisions, many of them seem obvious when we make them, but when we replicate our work, it turns out that it is not clear which subset of the data we really included, how special cases were coded, how outliers were identified and treated, how bootstraps were run, what was the precise meaning of which variable, and which tests were used with which parameters. Too often it happens that even after spending days and weeks trying out a few dozend of combinations of these parameters we can not replicate what we did a few month ago. If we are lucky, we come perhaps close to the results we published proudly in the past, but we do not get the same resuls. This can be a more than embarassing experience.
The aim of the course is to develop a strategy that helps avoiding this problem. In the course we will discuss strategies that we can use to organise our data and our analysis in a way that allows us even years later to redo our analysis quickly, reliably, with exactly the same results.
An efficient workflow helps us to get back to statistical work quickly after an interruption and also helps to share an analysis with coauthors.
- Topics:
-
- Introduction
- Aims of statistical data analysis
- Organising your work
- How to separate creativity from chaos?
- Organising ideas in files
- Organising ideas in functions
- Preparing data
- Reading data
- Cleaning data
- Organising data
- Working with data
- Documentation
- Descriptive statistics
- Specific results
- Presenting results
- Weaving and tangling
- Literature
- There is an interesting book on workflow of data analysis, however it is based on Stata: J. Scott Long, The Workflow of Data Analysis Using Stata, Stata Press, 2009.
- Exam
- Thursday, 21. July, 8:00AM. Please solve all questions independently and return your answers as a single PDF file until Friday, 22. June, 9:00AM.
- Software
-
- It will help if you can bring your own portable computer to the classes and exercises. You should have an up-to-date version of R installed. We will also need the following libraries: car, Ecdat, foreign, Hmisc, memisc, Sweave, tools. To work with Sweave we need LaTeX (e.g. TeX Live or MiKTeX). To work with svn we need subversion. While there are lots of clients for SVN with a nice and shiny graphical user interface we will only use the command line in this course.
- On a Linux system install subversion with
sudo apt-get install subversion - On Mac OS-X subversion is usually already installed.
- On Microsoft Windows you can install e.g. SlikSVN
in a directory of your choice. Then either adjust the environment variable
PATHto include the path to the SVN binaries (MyComputer / Properties / Advanced / Environment / Path) or always include the complete path in the command, like"C:\Program Files\SlikSvn\bin\svn.exe" - Since we will use the command line for our examples with svn you have to get a command prompt. For Microsoft Windows up to XP this was fairly easy: Open the start menu, choose “exectute” and type in
command. A black window with a command prompt will open. For later version of Microsoft Windows type “Command Prompt” into the search box and then, in the list of results, choose “Command Prompt”.
- If you want to use Sweave with Microsoft Windows here are some hints:
- Install MiKTeX
- Install Ghostscript
- As an editor you can use TeXnicCenter. When you start TeXnicCenter, it will ask for the location of your LaTeX installation. Point it to
miktex\binwithin the folder into which you installed MikTeX in step 1. - Save the file latex-ps-pdf.tco to your disk,
start TeXnicCenter and hit Alt-F7 to open the output profile preferences. Click “import” at the
bottom of the window and select the profile you have just saved.
Unfortunately, you will still have to adjust a total of six path names:
- Select the newly imported profile “LaTeX => PS => PDF” from the list on the left-hand side.
The first three path names are in the LaTeX tab, which should have been selected
automatically.
Currently the “Path to the (La)TeX compiler” should be “latex.exe”. Depending on the exact path you chose to install MiKTeX to, you have to change it to something like “C:\Program Files\MiKTeX 2.9\miktex\bin\latex.exe”. Clicking the “...” button next to the path name lets you browse for the file.
- Make similar changes to “Path to BibTeX executable” and “Path to MakeIndex executable”.
- Switch to the “Postprocessor” tab. First, select “DviPS (PDF) and change “executable” to something like “C:\Program Files\MiKTeX 2.9\miktex\bin\dvips.exe”.
- Select “Ghostscript (ps2pdf)” and change the executable to something like “C:\Program Files\ghostscript\gs8.62\bin\gswin32.exe”. Again, the exact path to use depends on the name of the directory that contains your GhostScript installation.
- The last pathname is that for your preferred PDF viewer, most likely Adobe Reader. Select the “Viewer” tab and change “Path of executable” to something like “C:\Program Files\Adobe\Reader 9.0\Reader\AcroRd32.exe”. Once you are done, click “OK” and close TeXnicCenter for the moment.
- Select the newly imported profile “LaTeX => PS => PDF” from the list on the left-hand side.
The first three path names are in the LaTeX tab, which should have been selected
automatically.
- Install R
- Start R and use the command
setwd()to set your working directory to the directory that contains yourexample.Rnwfile. The command will look like this:setwd("C:/Documents and Settings/yourUserName/Documents/sweave")The example file uses a package called stats. Install it by issuing:
install.packages("stats")Select a CRAN mirror near you and click "OK". - Every time you want to produce a new PDF from your Sweave file, you will have to do two
things.
- In R issue the command:
Sweave("example.Snw", stylepath=TRUE) - In TeXnicCenter open the file
example.texand hit Ctrl-Shift-F5 to build and view the current file. Please note: Do not editexample.tex, as R will overwrite its contents and your changes will be lost, when you use the Sweave command again. Instead, editexample.Snw.
- In R issue the command:
- You might also find make and Emacs useful tools. (The MacOS-X users might find aquamacs less scary)
For our practical examples (during the entire course) we will use the software environment R. I think that it is helpful to coordinate on one environment and R has the advantage of being free and rather powerful.- Documentation for R is
provided via the build in help but also through the
R Homepage.
Useful are
- The R Guide, Jason Owen (Easy to read, tries to explain R with the help of examples from basic statistics)
- Simple R, John Verzani (Tries to explain R with the help of examples from basic statistics)
- Einführung in R, Günther Sawitzki (In German. Rather compact introduction. The statistical part can be quite demanding)
- Econometrics in R, Grant V. Farnsworth (The introduction to R is rather compact and pragmatic. The econometric models go beyond what we are doing in this lecture)
- An Introduction to R, W. N. Venables und D. M. Smith (The focus is more on R as a programming language)
- The R language definition (Concentrates only on R as a programming language. A must to read if you write your own functions)
- A first entry into R eased through mice and menues is available through the R Commander.
- An interesting development environment is RStudio.
- In the lecture I use the versatile editor Emacs with the ESS interface (ESS also helps with Stata, SAS, Splus, BUGS, and others). Users of MacOS-X will prefer the Emacs-Clone aquamacs.
- It will help if you can bring your own portable computer to the classes and exercises. You should have an up-to-date version of R installed. We will also need the following libraries: car, Ecdat, foreign, Hmisc, memisc, Sweave, tools. To work with Sweave we need LaTeX (e.g. TeX Live or MiKTeX). To work with svn we need subversion. While there are lots of clients for SVN with a nice and shiny graphical user interface we will only use the command line in this course.
- Handout
- Here is a preliminary version of the handout. Data is attached to the handout. If you are using a Microsoft operating system and have difficulties using Adobe Acrobat (in particular when saving attachments) look here
