#### Workflow of statistical data analysis

This course is offered within the context of the IMPRS BeSmart Summerschool.- Asynchronous teaching
- Videos can be found here.
- Exercises: See below. Participants submit their answers each day before exercises start.

- Synchronous teaching
- Daily exercises (16.8.-20.8.), 12:30-13:30.
During synchronous teaching partcipants we will use RStudio and the software mentioned below.

- One is to find the right statistical method for the problem.
- The other is to organise the data and the evaluation in a way such that results can be replicated.

- We interrupt our work for a few days or weeks and want to go back to it quickly.
- We share our work with a collaborator and want her or him to quickly understand what we did and to participate in an efficient way.
- After we sent our paper away to a journal referees might demand small changes in the analysis.

The sad truth is that often researchers find it very hard to replicate the results of their own statistical analysis. During the analysis we make a lot of small decisions, many of them seem obvious when we make them, but when we replicate our work, it turns out that it is not clear which subset of the data we really included, how special cases were coded, how outliers were identified and treated, how bootstraps were run, what was the precise meaning of which variable, and which tests were used with which parameters. Too often it happens that even after spending days and weeks trying out a few dozend of combinations of these parameters we can not replicate what we did a few month ago. If we are lucky, we come perhaps close to the results we published proudly in the past, but we do not get the same resuls. This can be a more than embarassing experience.

The aim of the course is to develop a strategy that helps avoiding this problem. In the course we will discuss strategies that we can use to organise our data and our analysis in a way that allows us even years later to redo our analysis quickly, reliably, with exactly the same results.

An efficient workflow helps us to get back to statistical work quickly after an interruption and also helps to share an analysis with coauthors.

- Introduction, replication and robustness (Exercise on Mon., 16.8., 12:30)
- Motivation
- Is workflow obvious? Consistency. Reproducibility.
- Replicability. Structure of a paper.
- Aims of statistical data analysis. Making the analysis reproducible. Interaction with coauthors.

- Documentation I (Exercise on Tue., 17.8., 12:30)
- Literate programming, Weaving and tangling
- An example with Rnw, R and LaTeX.
- Why markup languages? An example with R-Markdown.

- Documentation II (Exercise on Wed., 18.8., 12:30)
- Chunks. How to include results.
- Practical issues. Tables. Alternatives.
- Version control (Motivation)

- Documentation III (Exercise on Thu., 19.8., 12:30)
- Version control with git. Non-linear work.
- Concurrent edits. Limitations.
- Version control - practicalities.

- Organising work (Exercise on Fri., 20.8., 12:30)
- Scripting. Robustness.
- Functions. Calcuations that take a lot of time. Randomness. Exploiting structure.
- Human readable scripts. Debugging. Structure in Models. Results of functions.

- There is an interesting book on workflow of data analysis, however it is based on Stata: J. Scott Long, The Workflow of Data Analysis Using Stata, Stata Press, 2009.
- Hadley Wickham; Tidy Data; Journal of Statistical Software, 2014.
- Hadley Wickham, Garrett Grolemund; R for Data Science; 2017.

- Documentation for R is
provided via the built in help system but also through the
R Homepage.
Useful are
- The R Guide, Jason Owen (Easy to read, explains R with the help of examples from basic statistics)
- Simple R, John Verzani (Explains R with the help of examples from basic statistics)
- Einführung in R, Günther Sawitzki (In German. Rather compact introduction.)
- Econometrics in R, Grant V. Farnsworth (The introduction to R is rather compact and pragmatic.)
- An Introduction to R, W. N. Venables und D. M. Smith (The focus is more on R as a programming language)
- The R language definition (Concentrates only on R as a programming language.)

- We will use the following packages:
`car, Ecdat, foreign, Hmisc, knitr, tidyverse, lattice, memisc, tikzDevice, tools, xtable`

. If, e.g., the command`library(Ecdat)`

generates an error message (`Error in library(Ecdat): There is no package called 'Ecdat'`

), you have to install the package.- Installing packages with Microsoft Windows:
- With RStudio: Use the tab “Install”. Otherwise: Start
`Rgui.exe`

and install packages from the menu`Packages / Install Packages`

). - Installing packages from modern operating systems:
- From within R use the command
`install.packages("Ecdat")`

, e.g., to install the package`Ecdat`

- In the lecture we will use RStudio as a front end.

- LaTeX
- For weaving and knitting we need LaTeX (e.g. TeX Live or MiKTeX).
- RStudio
- RStudio provides a front end to R, LaTeX, git and svn.
- git
- In the course we
will use git as an example for a version control system. git might be
already installed on your computer. You should
also have a merge-tool, e.g.
`meld`

. (Any of kdiff3, araxis, bc3, codecompare, diffuse, ecmerge, emerge, gvimdiff, opendiff, p4merge, tkdiff, tortoisemerge, vimdiff, xxdiff... would work as well). - Stata
- Stata, unfortunately, does not have an equivalent to Sweave and knitr. Still, there are some tools:

#### Exercises

Please send your answers to the following questions as an email to `oliver@kirchkamp.de`

.

##### Exercise 1. Submit before Mon., 16.8., 12:00.

Summarise briefly (about 100-200 words) how you see your own workflow (for your past research as well as for your future plans). Do you see any workflow related issues in your future research?

##### Exercise 2. Submit before Tue., 17.8., 12:00.

The `R`

command

`help(package="datasets")`

gives you a list of the datasets that are provided by the package `datasets`

. Find a dataset whose name starts with the same letter as your last name. If there is no matching dataset, find one with the next letter in the alphabet. After the letter `Z`

continue with the letter `A`

.

Next write a brief document either as `Rnw`

or `Rmd`

. The document includes your name, your email address, the name of the dataset and one statistic from the dataset that you find interesting. Send me only the `Rnw`

or `Rmd`

file as an attachment to your email.

##### Exercise 3. Submit before Wed., 18.8., 12:00.

In the dataset from the previous exercise, find some information that can be represented as a table. Add this table to your `Rnw`

or `Rmd`

file (for `Rnw`

, the `xtable`

function might help; for `Rmd`

you might have a look at the library `huxtable`

). Send me the `Rnw`

or `Rmd`

file as an attachment to your email.

##### Exercise 4. Submit before Thu., 19.8., 12:00.

Create a an empty `git`

repository (you may use RStudio). Create a first commit in this empty repository. Then add the `Rnw`

or `Rmd`

file from Exercise 2 to the repository and create another commit with this file added. Then include the changes of Exercise 3 and commit again. Finally, create a `zip`

-compressed archive of your `.git`

folder. If you don’t know how to create `zip`

compressed archive, you can use `R`

: Use the `zip`

package from `R`

and (in the working directory where the `.git`

folder is) execute the following command:

```
library(zip)
zip("exercise3.zip",".git")
```

Send only the `zip`

archive as an attachment to your email.

##### Exercise 5. Submit before Fri., 20.8., 12:00.

Have another look at your solution to Exercise 3. Rewrite your `Rnw`

or `Rmd`

file so that your statistical analysis for Exercise 3 is now done entirely within a function. You define this function in one chunk. In a second chunk you call this function. The compiled paper should have the same appearance as the one for Exercise 3. You can obtain extra credits if the function takes parameters. Send the `Rnw`

or `Rmd`

file as an attachment to your email.