Folder Organization - norrissam/ra-manual GitHub Wiki

We store project files on both the GitHub repo and a Dropbox folder. All code and code-like inputs (see below) as well as exhibit outputs should live in the GitHub repo where they can be version controlled. All data should live in the Dropbox, which can accomodate large data files.

GitHub repository

The GitHub repo has two main folders: code and exhibits.

  • /code contains master.do, and pathnames.do. master.do should run the entire project, including data cleaning, exhibit creation and compilation.
  • /code/cleaning contains all data prep code.
  • /code/analysis contains all code to construct tables and figures.
  • /code/ado contains project-specific ado files.
  • /exhibits contains a .tex file of the currently relevant exhibits, and the compiled pdf (do not save the extra latex files like .aux to Github).
  • /exhibits contains /tab, /fig, and /facts subfolders for tables, figures, and data-generated facts (e.g., statistics or coefficients). Each can contain subfolders by theme (e.g., placebo for placebo checks).

Each project will have a Github wiki to record major decisions (information to be referenced in the paper) and minor decisions (unlikely to be referenced in the paper). For any references to material outside this project that may eventually be useful for the paper, we will also include them here.

The repo may also include a notes folder for plain text notes that we would like to be versioned.

GitHub likes to keep total storage low, and will not allow users to push files larger than 100MB. If for some reason a code or output file is larger than about 10 MB, then we should store it on Dropbox.

Dropbox

Dropbox stores all data files as well as other project documents.

  • ~/ProjectName/Data/Raw stores all original data files that are used in the analysis.
    • The only manipulations that we will generally make within the /Data/Raw folder are: (1) unzipping; and (2) importing into formats readable by Stata, R, etc. When such manipulations are necessary, we should use the following subfolder structure
    • All folders containing raw data should have a plain text README file that details the data source and any additional information necessary for replication.
  • ~/ProjectName/Data/Intermediate stores all intermediate data files, i.e. files created by the code in the GitHub Code folder
  • ~/ProjectName/Data/Clean stores all data files used to create exhibits.
  • ~/ProjectName/Admin stores time sheets, data use agreements, etc.
  • ~/ProjectName/Literature stores all relevant literature, typically in pdf format. Please use the following format to name files: Authors_ShortTitle_Year.pdf, and when useful arrange in subfolders.
  • ~/ProjectName/Notes stores any comments or notes that we want to share between each other that are too long or complicated for a post on slack (e.g. a latex file and associated pdf that works through a proof). Very long notes in plain text format (e.g., long latex proofs) could be done in a Notes folder within the Github repo.

Code structure

  • In general, the code in code/cleaning will operate on data from Data/Raw and place it in Data/Intermediate while constructed final data will be placed in Data/Clean.
  • Then code in code/analysis will take data from Data/Clean and construct figures and tables to be placed in /exhibits.
  • To the extent that is reasonable, the cleaning code, analysis code, and data folders should have parallel structures. Naming should be self-explanatory whenever possible. For example, Data/Raw and Data/Intermediate might both have a subfolder called census_data. Code in code/cleaning/censusData would clean the raw census data in Data/Raw/censusData and place it in Data/Intermediate/censusData.

Organization

  • File names should be informative. For example, a file with code producing summary statistic tables can be called summary_stats.do, and a separate file containing diff-in-diff analysis can be called dind_exhibits.do
  • Don't name files analysis_v2.NEW.do as it is easy to get confused.
    • We also want to be able to trace dependencies and keep folders clean of unused files.
  • It is good practice to regularly review code delete unused files. If it turns out to be important later, it can always be recovered from Github.