Data flow - labordynamicsinstitute/replicability-training GitHub Wiki

The simplified/ ideal data flow is as follows:

[ INPUT DATA ] ---( Cleaning program )---> [ ANALYSIS DATA ] ---( Analysis program )---> [ TABLES / FIGURES ]

Input data

Input data (or "raw data") are the unmodified original data. These could be the survey as collected by the author, the dataset downloaded from the US Census Bureau, or the administrative dataset on schools provided by County Cork's school district.

  • The input data are rarely provided
  • The input data may be referenced by URL or DOI, in which case the replicator should attempt to download it
  • There may be some ambiguity, but in general, if the data are provided by somebody else, it's "input data"
  • if data are transformed into something else before being analyzed, then in general it is "input data"

Analysis data

Analysis data are the direct input to the (duh) analysis.

  • There might be multiple such datasets - they can be collectively described
  • if the program that runs a regression, creates a table, or constructs a graphic loads some data, it is "analysis data"

Intermediate data

For efficiency or other reasons, programs provided by authors might create multiple intermediate datasets. For instance, combining the oil prices and the inflation rate might be combined in a dataset, which is saved, which is then combined with information on the health status of pigeons to create the analysis dataset. Such a dataset is considered "intermediate".

  • intermediate data do not need to be described
  • they may be used to debug if something goes wrong
  • they are rarely provided in the ZIP file

Cleaning programs

Cleaning programs (or data preparation programs, or "data wrangling") reformat, standardize, or otherwise manipulate the data, bringing them into a form that is amenable to the analysis intended by the authors.

Analysis programs

Analysis programs are anything that implements the analysis that is described in the article. This may be

  • running an OLS or other statistical method
  • creating a summary table
  • creating a graphic Sometimes, a regression may be run in order to create data, which is in turn analyzed or used elsewhere. That might be better described as a "data preparation program", but not an analysis program. Sometimes, an analysis program will rely on the output from a previous analysis program. Your judgement is needed, and there is a lot of grey area.
  • Often, the author will have only one program, which combines data cleaning and analysis. Treat this as the analysis program.

What to do when input data, code for cleaning, and analysis data are all provided?

  1. Start with the analysis data set, and see if it replicates the paper.
  2. Go back, preserve the original analysis dataset (save under a new name, say, with "original" in it), then run the cleaning programs.
  3. Once the analysis dataset is regenerated, compare it to the one provided (say, by running a "codebook" or a "summary" command in Stata, and comparing the output).