20 November 2014

Broadly speaking, reproducible research aims to generate and report data analyses and scientific claims in a reproducible manner. Or simply, by one click, others could re-generate the results you reported and the report itself. From my personal experience of several years of quantitative genetics research, the broad definition of reproducible research is too ambitious. In a real world study, normally we will spend many days, weeks or even years for data tweaking and hammering. In these processes, many versions of intermediate data will be generated, many parameters for model fitting were tried, and many figures were plotted and changed later. It is such a pain to reproduce a research even conducted by yourself, who normally carried out multiple research projects across years. Do not mention to share with others.

Modern data analyses rely on tools. Thanks for the powerful R IDE: RStudio and version control tool: GitHub, which largely make this idea of reproducible research feasible. After a workshop in RILAB, some of my colleagues show their interests in my workflow of project management and documentation, although it is far from completely reproducible. It encouraged me to blog it and share some of the tools that might be helpful for others. Here is the workflow I used.

Setup project

The right corner icon of RStudio will enable you to manage your New Project.
I normally start a project with

#install.packages('ProjectTemplate')
library('ProjectTemplate')
create.project('temp')
system("mv temp/* .")
system("rm -r temp/")
  • using GitHub to do the version control
  • using packrat to do the R package dependency management
  • using ProjectTemplate to layout working directories


blog comments powered by Disqus