Coding Style - norrissam/ra-manual GitHub Wiki

Structure

Do-files should be organized into blocks by the purpose of the code.

*****************************
* Prepare data
*****************************
* Format X variables
... 
* Format Y variables
... 
*****************************
* Run regressions
*****************************
... 
*****************************
* Output tables
*****************************
 ...

If you include a comment as a header like this for one major block of code, you should include a similar header for every block of code at the same logical place in the hierarchy. This is a case where redundant comments are allowed. The comments are not there to provide information, but to make the code easy to scan.

Commenting

Use comments to describe what you are doing, either right above the code block or at the end of the line using the //

Indenting

We will use tabs for indenting with each member of the team required to use 2 space tabs (this can usually be changed in the preferences menu of your text editor). Use indenting to increase the structure and readability of code. For example,

keep if (state==5 & year>1980) | ///
        (state==6 & year>1990)

is much easier to read than if the entire expression was on one line. Do similarly with preserves:

preserve
  keep if state==5
  collapse gdp, by(year)
  tempfile gdp5
  save `gdp5'
restore

Equalities and logical expressions

Put a space between the & and | symbols, but not between equalities and inequalities. For example:

keep if state==5 & age>=18

Loops

Loops should be indented by one tab (i.e. two spaces). Loops should end on the same line as they began, like this:

forv j = 1(1)10 {
  ...
}

Programs

We encourage the use of helper programs. For example, suppose that you were converting dates from strings into Stata format dates. You might begin by looping over a list of dates like this:

local datelist "date1 date2 date3"
foreach d of local datelist {
  g junk = date(`d',"MDY")
  drop `d'
  rename junk `d'
  format `d' %d
}

Later on in the same file you might want to convert another list of dates. Instead of copying over the same code, it would be better to define a helper file at the first point it's necessary, then use it whenever necessary:

cap prog drop replaceDate
prog def replaceDate
  syntax varlist, datetype(string)
	
  foreach var of varlist `varlist' {
    tempvar junk 
    g `junk' = date(`var',"`datetype'")
    drop `var'
    rename `junk' `var'
    format `var' %d
  }
end

replaceDate date1 date2 date3", datetype("MDY")

...

replaceDate date4 date5 date6, datetype("DMY")

This approach has two advantages. First, it makes the code much easier to read by reducing the number of lines of code where we are actually manipulating the data. Second, it is more robust. It uses the tempvar junk rather than a hardcoded variable, which means that if a variable called junk somehow ended up in the dataset, the code would not fail.

In addition, the first dates were in "MDY" format, while the second were in "DMY" format. Because we can build flexibility into programs, this lets us build cleaner, easier-to-debug code.

Important: It's ok to create inline helper programs like the above. But if it's something that might be used in multiple do-files, save it as .ado in the ado folder rather than duplicating it.

Merging

Always specify the type of merge (1:1, m:1, or 1:m). Failing to specify the merge type calls the old, non-robust version of merge.
Never do many to many (m:m) merges, or at least, only do them when you have a very good reason.
Always include the assert() option to indicate what pattern of matched observations you expect.
Always include the keep() option to indicate which observations are to be kept from the merged data set.
Whenever possible, include the keepusing() option and enumerate explicitly what variables you intend to be adding to the dataset; you can include this option even when you are keeping all the variables from the using data.
Use the nogen option except when you plan to explicitly use the _merge variable later. You should never save a dataset that has _merge in it; if you need this variable later, give it a more informative name.

Auxiliary files for concordances

It is sometimes better not to hardcode concordances. For example, suppose that we wanted to use FIPS codes to refer to states, but only had string state names in the data. One approach would be to write

g fips = .
replace fips = 1 if state=="Alabama" 
...
replace fips = 56 if state=="Wyoming"

but this is going to take 50 lines of hard-to-read, hard-to-debug code. A much better approach is to make an auxiliary file that stores this information in an easy-to-read table format. We store these files in the git repo in /auxil, where they are version controlled. For example, we might create /auxil/concord_fips_state.csv, which would have two columns, fips and state. Then, a few lines solution is:

insheet using /auxil/concord_fips_state.csv
keep state fips
tempfile fips
save `fips'

use data.dta
merge state m:1 using `fips', assert(2 3) keep(3) nogen

Note that this approach accommodates multiple state name keys in the concord_fips_state.csv, for example "Alabama" and "ALABAMA" could both point to fips code 1.

Breaking up analysis code into sections

In long files that contain multiple parts that do not all need to be run (primarily analysis files), it is important to break up the code into sections based on function -- for example, one section of code might create tables that produce output for one of the outcomes, while another section creates the figures for a second outcome. Whether a particular section is run will be governed by a local variable that is set at the beginning of the do-file, where the top of the code file contains a series of definitions of local variables determining all the sections to be run.

The locals should be named such that it is clear what section they govern. If that section is not to be run, the value of the local should be set to 11. If it is to be run, the value should be set to 1. There should always be a local called "RunAll" that governs whether all the sections of code are run, regardless of how their corresponding local is defined. This makes it much easier to run through the full set of analysis code when needed, without changing each individual local.

For example, the top of the do-file might have a section that looks like the following:

     local ChildTableNiceOLSIV             11
     local PregnancyEffectsNiceOLSIV       1
     local JuviCrimeEffectsCuya            11
     local RunAll                          11

where the section of the code labelled "PregnancyEffectsNiceOLSIV" will be run when the code file runs, while the other sections will not.

Then the start of each section of code will look like the following:

  if `JuviCrimeEffectsCuya'==1 | `RunAll' == 1 {
          ....
  }

As this sets whether the code section will be run or not depending on how the local was defined at the top.

Miscellaneous

Always use forward slashes (/) in file paths. (Stata allows backslashes (\) in file paths on Windows, but these cause trouble on non-Windows OS’s.)
Use preserve/restore sparingly. Excessive use can make code hard to follow; explicitly saving and opening data files is often better. Stata's command tempfile is encouraged.
When the number of variables needed from a dataset is not too large, list the variables explicitly in the use command.