Coding Principles - norrissam/ra-manual GitHub Wiki

Fully automated code

  • The entire project, from cleaning raw data to compiling the paper pdf, can be run from one command, typically code/master.do. This shell script will call other files, e.g. Stata, R, Matlab, and Python.
  • Do not manually pre-process data, e.g. manipulate Excel sheets, before importing into R or Stata. All data processing, beginning with the original file, should be automated and called from master.do.
    • All programs required for analysis should be installed in master.do
    • To keep code/master.do from getting too long and unwieldy, master.do may instead call sub-batch files that group together calls to functionally-related Stata, R, etc. code (e.g., master_analysis.do, master_cleaning.do, etc.)
  • This includes formatting tables and figures. You should never update .tex files to display the information correctly. Our in-house table creation ado files (balanceTable, MultiPartTab suite, and tabnoteWidth) can handle the vast majority of tasks. Those programs should be extended where necessary, or new ones written.

Interoperability

  • Code should be system agnostic, in that once user-specific directories are identified in pathnames.do, the code can be run from any team member's machine without having to manually change the program's working directory.
    • File paths should always use forward slashes (/) rather than backslashes (\) to avoid problems on non-Windows OS's.
  • Code should also be version agnostic, so that team members (and future reproducers) get the same results no matter what version of software they are using.
  • Whenever using an algorithm that requires random number generation (e.g., Monte Carlo simulation or bootstrapping), set the seed so that the results do not change every time the code is run.

Globals vs. Locals

All pathnames should be specified as global variables.

if c(username)=="me" { 
  global REPO "/Users/me/Documents/GitHub/cahoots"
  global PROJ "/Users/me/Dropbox/cahoots" 
}

global INTDATA "$PROJ/data/intermediate"
use "$INTDATA/education.dta", clear

Universal options across analyses like fixed effects or clustering level for standard errors can also be set as globals.

global se_dind = "cluster(inci_id)"
global fe_dind = "call_month call_city other_changes"

reghdfe y inst, absorb(${fe_dind}) ${se_dind}	

Everything else that is only called within the do-file (eg. variable lists) should be specified as a local.

local outcomes_pre "pre_wages pre_days pre_conv"
local outcomes_post "post_wages post_any_wages"

Programs vs. ado-files

We encourage the use of helper programs. For example, suppose that you were converting dates from strings into Stata format dates. You might begin by looping over a list of dates like this:

local datelist "date1 date2 date3"
foreach d of local datelist {
  g junk = date(`d',"MDY")
  drop `d'
  rename junk `d'
  format `d' %d
}

Later on in the same file you might want to convert another list of dates. Instead of copying over the same code, it would be better to define a program at the first point it's necessary, then use it whenever necessary:

cap prog drop replaceDate
prog def replaceDate
  syntax varlist, datetype(string)
	
  foreach var of varlist `varlist' {
    tempvar junk 
    g `junk' = date(`var',"`datetype'")
    drop `var'
    rename `junk' `var'
    format `var' %d
  }
end

replaceDate date1 date2 date3", datetype("MDY")
...
replaceDate date4 date5 date6, datetype("DMY")

This approach has two advantages:

  • First, it makes the code much easier to read by reducing the number of lines of code where we are actually manipulating the data.
  • Second, it is more robust. It uses the tempvar junk rather than a hardcoded variable, which means that if a variable called junk somehow ended up in the dataset, the code would not fail.
  • In addition, the first dates were in "MDY" format, while the second were in "DMY" format. Because we can build flexibility into programs, this lets us build cleaner, easier-to-debug code.

Important: It's ok to create inline helper programs like the above. But if it is a program that might be used in multiple do-files, save it as an .ado file in the ado folder rather than duplicating it.

Sorts

Most mistakes come from incorrect merges or sorts. The most important principle in trying to avoid these mistakes is to build your code so that it breaks as soon as the data doesn't look like you thought it did when you wrote the code. Specifically:

  • Often, we sort the data and then conduct an operation that depends on the order of variables. Always assert that the sort is unique before doing so.

For example, suppose that we had a dataset where each row was a defendant id pid, a crime id crime_id and the date of an alleged crime date_crime. We would like to calculate a measure of the number of crimes that the defendant has been accused of at each date. One way to calculate this would be

sort pid date_crime
by pid: g n_prev_crime = _n - 1

However, this would be wrong if pid date_crime was non-unique. Then, crimes occuring on the same day would have different measure of previous criminality. Inserting a isid pid date_crime before doing the sort and variable creation would have averted this mistake.

Merges

  • Always specify the type of merge (1:1, m:1, or 1:m). Failing to specify the merge type calls the old, non-robust version of merge. The command alerts you when a merge is not merging on a unique id in both datasets, allowing you to fix the issue.

  • Whenever possible, use the assert statement in merge. For example, suppose that you are merging defendant age into a file of defendants. We might expect that we have a measure of age for all defendants. If so, then we can assert(2 3), indicating that there should be no defendant missing an age measure.

  • Use the nogen option except when you plan to explicitly use the _merge variable later. You should never save a dataset that has _merge in it; if you need this variable later, give it a more informative name.

  • For some merges, you may want to replace variable values with 0s to indicate missing values.

merge 1:1 pid using "$INTDATA/sex", assert(3) nogen
merge 1:1 pid using "$INTDATA/ages", assert(1 3)
g missing_age = _merge==1
drop _merge
merge 1:1 pid using "$INTDATA/employ", assert(1 3)
replace any_wages = 0 if _merge==1
drop _merge
  • Always include the keep() option to indicate which observations are to be kept from the merged data set.

  • Whenever possible, include the keepusing() option and enumerate explicitly which variables you want to add to the dataset.

merge m:1 year using "$AUXIL/inflation", assert(2 3) keep(3) keepusing(cpi2010) nogen

Assert

  • Use assert liberally to ensure that the data looks the way you expect it to.
  • assert is particularly useful after merges.
  • To pick up from the last example, suppose that we knew we had a measure of age for all defendants except those from court 1, which did not record age. Then, instead of using assert(2 3) during the merge, we could instead assert the following statement after the merge.
assert _merge!=1 | court==1

Keys

  • Each dataset should have a valid (unique, non-missing) key. For example, you might have dataset of US county characteristics, e.g. square miles and 1969 population, with one row for each county, and the key being the stateFIPS+countyFIPS.
  • Keep datasets normalized (meaning that they contain only variables at the same logical level as the key) as late in the data preparation process as possible. Once you merge a state-level dataset with a county-level dataset, the state-level variables are recorded many times (one for each county). This takes a lot of space and can also confuse other aspects of data preparation.
  • Similar to assert, use isid liberally to check that a dataset is unique at the level you want it to be.
  • isid is especially useful before saving datasets and joining datasets. For example, if you wanted to join a dataset at an individual level with convictions, you would check isid pid before the join.

Loops

  • Whenever you are repeating an operation often, use a loop. You should become familiar with Stata's loop syntax.
    • foreach x in ...
    • foreach var of varlist ...
    • forvalues y = 2(2)20 ...
  • If you want to loop over the values of a variable, you can use levelsof to generate a local.
levelsof location, local(cities)
foreach c of local cities {
  twoway scatter wages age if location=="`c'"
}

Tempfiles

Tempfiles allow you to manipulate data without overwriting saved files and creating clutter in the data directory. They are a better alternative to preserve/restores when you need to work with more than one dataset.

  • You often need to import auxiliary helper files. To work with them (eg. do merges or joins), they have to be saved as Stata datasets.
  • Sometimes, you only want to work with a subsample of data (only individuals under a certain age) without creating a new dataset, so you would impose a restriction and save it as a tempfile.
  • For large datasets or computationally intensive operations, it is better to perform an operation (eg. fuzzy matching) in chunks (eg. for every year), You can write a loop iterating the operation over each chunk, saving the output as a tempfile, and append the files.
  • Here is an illustrative example:
\\ load yearly records 
forv i = 2001(1)2008 {
  import delimited "$RAWDATA/records_`i'.csv, clear
  keep if city=="Vancouver" & !mi(block)
  g year = `i' 
  tempfile y`i'
  save `y`i''
}

clear
forv i = 1(1)8 {
  append using `y`i''
}
tempfile allyears
save `allyears'
...
use "$INTDATA/convictions.dta", clear
merge m:1 block year using `allyears', keep(1 3) nogen

Creating new variables

  • Always check for missing values.
  • When generating dummy values, be careful about whether you want to condition on missingness. You might have a variable with missing values. When you create a dummy variable for each value, you may want dummy = 0 if the variable is missing, or you want the dummy to be missing for that row too.