Stata Style Guide - norrissam/ra-manual GitHub Wiki
Do-file structure
- In long do-files that contain multiple parts that do not all need to be run (primarily analysis files), it is important to break up the code into sections based on function. One section of code might create tables that produce output for one of the outcomes, while another section creates figures for a second outcome.
- Whether a particular section is run will be governed by a local variable that is set at the beginning of the do-file.
- The top of the code file should define a series of local variables determining which sections will be run.
- The locals should be named such that it is clear what section they govern.
- If that section is not to be run, the value of the local should be set to 11. If it is to be run, the value should be set to 1.
- There should always be a local called "RunAll" that governs whether all the sections of code are run, regardless of how their corresponding local is defined. This makes it much easier to run through the full set of analysis code when needed, without changing each individual local.
For example, the top of the do-file might have a section that looks like the following:
local ChildTableNiceOLSIV 0
local PregnancyEffectsNiceOLSIV 1
local JuviCrimeEffectsCuya 0
local RunAll 0
where the section of the code labelled PregnancyEffectsNiceOLSIV will be run when the code file runs, while the other sections will not.
Then the start of each section of code will look like the following, and it will be run or not depending on how the local was defined at the top:
if `JuviCrimeEffectsCuya'==1 | `RunAll' == 1 {
...
}
Sometimes do-files will not have parts that need to be turned on or off. In that case, they can also be organized into blocks by the purpose of the code.
*****************************
* Prepare data
*****************************
* Format X variables
...
* Format Y variables
...
*****************************
* Run regressions
*****************************
...
*****************************
* Output tables
*****************************
...
If you include a comment as a header like this for one major block of code, you should include a similar header for every block of code at the same logical place in the hierarchy. This is a case where redundant comments are allowed. The comments are not there to provide information, but to make the code easy to scan.
Commenting
Comments should succinctly describe what the following code is doing. Use comments to describe what you are doing, either right above the code block or at the end of the line using //
Spacing can make it clear which code a comment pertains to:
// calculate time to next case
bys id (date_filed): g time_to_next = date_filed[_n+1] - date_filed if id==id[_n+1]
label var time_to_next "Time to next case"
// make indicator for next case within x days
forv j in 180 365 {
g next_case_within`j' = time_to_next<=`j'
label var next_case_within`j' "Next case within `j' days"
}
Indentation
-
Stata should be set up to have two spaces per tab. This ensures that do-files will align in the same way when loaded by different authors.
-
Indent one tab after braces and after preserves. This improves code readability by making it clear when and at what level an operation is happening at. For example:
eststo clear
foreach y in wages employment {
forv t = 1(1)4 {
eststo: reg `y'_y`t' incar
}
}
Spacing
Spacing should be used to improve readability when you are doing similar operations. For example:
label var n_prev_cases "Number of previous cases"
label var fel_case "Case includes felony felony charges"
is much clearer than
label var n_prev_cases "Number of previous cases"
label var fel_case "Case includes felony felony charges"
Variable names and labels
- Variable names should be short but informative. use underscores rather than camelStyle.
- Use variable labels to clarify variable names if necessary. To the extent possible, variables should be labelled in cleaning files rather than analysis files.
Assigning values and logical operators
Use " = " for defining variables, and "==" for logicals. This improves readability by making it immediately clear whether an expression is part of a logical operator or an assignment.
For example:
g black = race=="B" if race!=""
Logical operations
- Put a space between the
&and|symbols, but not between equalities and inequalities. For example:
keep if state==5 & age>=18
- Long logical conditions (three or more if statements) should be split across multiple lines.
- Operators should go before the line break, and the logical conditions should be aligned with spaces. For example,
keep if reason_desc=="SHOCK INCARC/YOA BOOT" | ///
reason_desc=="YOA TERM SATISFIED" | ///
status_desc=="YOA PAROLE - CONDITIONAL" | ///
status_desc=="INTENSIVE SUPERVISION"
if much easier to read than if it was all on one line. Even better is:
keep if inlist(reason_desc,"SHOCK INCARC/YOA BOOT","YOA TERM SATISFIED") | ///
inlist(status_desc,"YOA PAROLE - CONDITIONAL","INTENSIVE SUPERVISION")
which simplifies and streamlines the logical operation. In general, inlist should be used instead of multiple or conditions chained together when possible.
Loops
Loops should be indented by one tab (i.e. two spaces). Indentation is especially important for nested loops. Loops should end on the same line as they began, like this:
foreach y in wages taxes {
forv q = 1(1)12 {
g `y`_`t' = `y` if quarter==`q'
}
}
Miscellaneous
-
Always use forward slashes (
/) in file paths. (Stata allows backslashes (\) in file paths on Windows, but these cause trouble on non-Windows OS’s.) -
Use
preserve/restoresparingly. Excessive use can make code hard to follow; explicitly saving and opening data files is often better. Stata's commandtempfileis encouraged. -
When the number of variables needed from a dataset is not too large, list the variables explicitly in the
usecommand. This makes the code work faster and use less memory, and makes it easier to read because it's clear exactly what variables are present.