Folder Organization - norrissam/ra-manual GitHub Wiki
We store project files on both the GitHub repo and a Dropbox folder. All code and code-like inputs (see below) as well as exhibit outputs should live in the GitHub repo where they can be version controlled. All data should live in the Dropbox, which can accomodate large data files.
GitHub repository
The GitHub repo has two main folders: code and exhibits.
- /codecontains- master.do, and- pathnames.do.- master.doshould run the entire project, including data cleaning, exhibit creation and compilation.
- /code/cleaningcontains all data prep code.
- /code/analysiscontains all code to construct tables and figures.
- /code/adocontains project-specific ado files.
- /exhibitscontains a .tex file of the currently relevant exhibits, and the compiled pdf (do not save the extra latex files like .aux to Github).
- /exhibitscontains- /tab,- /fig, and- /factssubfolders for tables, figures, and data-generated facts (e.g., statistics or coefficients). Each can contain subfolders by theme (e.g., placebo for placebo checks).
Each project will have a Github wiki to record major decisions (information to be referenced in the paper) and minor decisions (unlikely to be referenced in the paper). For any references to material outside this project that may eventually be useful for the paper, we will also include them here.
The repo may also include a notes folder for plain text notes that we would like to be versioned.
GitHub likes to keep total storage low, and will not allow users to push files larger than 100MB. If for some reason a code or output file is larger than about 10 MB, then we should store it on Dropbox.
Dropbox
Dropbox stores all data files as well as other project documents.
- ~/ProjectName/Data/Rawstores all original data files that are used in the analysis.- The only manipulations that we will generally make within the /Data/Rawfolder are: (1) unzipping; and (2) importing into formats readable by Stata, R, etc. When such manipulations are necessary, we should use the following subfolder structure
- All folders containing raw data should have a plain text README file that details the data source and any additional information necessary for replication.
 
- The only manipulations that we will generally make within the 
- ~/ProjectName/Data/Intermediatestores all intermediate data files, i.e. files created by the code in the GitHub Code folder
- ~/ProjectName/Data/Cleanstores all data files used to create exhibits.
- ~/ProjectName/Adminstores time sheets, data use agreements, etc.
- ~/ProjectName/Literaturestores all relevant literature, typically in pdf format. Please use the following format to name files: Authors_ShortTitle_Year.pdf, and when useful arrange in subfolders.
- ~/ProjectName/Notesstores any comments or notes that we want to share between each other that are too long or complicated for a post on slack (e.g. a latex file and associated pdf that works through a proof). Very long notes in plain text format (e.g., long latex proofs) could be done in a Notes folder within the Github repo.
Code structure
- In general, the code in code/cleaningwill operate on data fromData/Rawand place it inData/Intermediatewhile constructed final data will be placed inData/Clean.
- Then code in code/analysiswill take data fromData/Cleanand construct figures and tables to be placed in/exhibits.
- To the extent that is reasonable, the cleaning code, analysis code, and data folders should have parallel structures. Naming should be self-explanatory whenever possible. For example, Data/RawandData/Intermediatemight both have a subfolder called census_data. Code incode/cleaning/censusDatawould clean the raw census data inData/Raw/censusDataand place it inData/Intermediate/censusData.
Organization
- File names should be informative. For example, a file with code producing summary statistic tables can be called summary_stats.do, and a separate file containing diff-in-diff analysis can be calleddind_exhibits.do
- Don't name files analysis_v2.NEW.doas it is easy to get confused.
- 
- We also want to be able to trace dependencies and keep folders clean of unused files.
 
- It is good practice to regularly review code delete unused files. If it turns out to be important later, it can always be recovered from Github.