Coding Principles - norrissam/ra-manual GitHub Wiki

We generally want to follow the Gentzkow and Shapiro code structure and data storage protocols. Also see chapters 6 and 7 of the old Gentzkow and Shapiro guide that cover abstraction and self-documentation. Several basic things to re-emphasize:

Fully automated code

  • The entire project, from initial data to compiling the paper pdf, can be run from one command, typically code/master.do. This shell script will call other files, e.g. Stata, R, Matlab, and Python.
    • Do not manually pre-process data, e.g. manipulate Excel sheets, before importing into R or Stata. All data processing, beginning with the original file, should be automated and called from master.do.
    • To keep code/master.do from getting too long and unwieldy, master.do may instead call sub-batch files that group together calls to functionally-related Stata, R, etc. code (e.g., master_analysis.do, master_cleaning.do, etc.)
  • This includes tables and figures. You should never update .tex files to display the information correctly. Our in-house table creation files (the balanceTable.ado, MultiPartTab suite, and tabnoteWidth.ado) can handle the vast majority of tasks. Those programs should be extended where necessary, or new ones written.

Sorts

Most mistakes come from incorrect merges or sorts. The most important principle in trying to avoid these mistakes is to build your code so that it breaks as soon as the data doesn't look like you thought it did when you wrote the code. Specifically:

  • Often, we sort the data and then conduct an operation that depends on the order of files. Always assert that the sort is unique before doing so.

For example, suppose that we had a dataset where each row was a defendant id pid, a crime id crime_id and the date of an alleged crime date_crime. We would like to calculate a measure of the number of crimes that the defendant has been accused of at each date. One way to calculate this would be

sort pid date_crime
by pid: g n_prev_crime = _n - 1

However, this would be wrong if pid date_crime was non-unique. Then, crimes occuring on the same day would have different measure of previous criminality. Inserting a isid pid date_crime before doing the sort and variable creation would have averted this mistake.

Merges

  • Always use the modern merge command, (e.g., merge 1:1 id using ...) rather than the old merge command (merge id using ...). The modern command alerts you when what you thought was a 1:1 merge is not, allowing you to fix the issue.
    • Whenever possible, use the assert statement in merge. For example, suppose that you are merging defendant age into a file of defendants. We might expect that we have a measure of age for all defendants. If so, then we can assert(2 3), indicating that there should be no defendant with no age measure.
    • Unless the _merge statement is necessary post-merge, use nogen

Assert

Use assert liberally to ensure that the data looks the way you expect it to. assert is particularly useful after merges. To pick up from the last example, suppose that we knew we had a measure of age for all defendants except those from court 1, which did not record age. Then, instead of using assert(2 3) during the merge, we could instead

assert _merge!=1 | court==1

after the merge.

Keys

  • Each dataset has a valid (unique, non-missing) key. For example, you might have dataset of US county characteristics, e.g. square miles and 1969 population, with one row for each county, and the key being the stateFIPS+countyFIPS.
  • Keep datasets normalized (meaning that they contain only variables at the same logical level as the key) as late in the data preparation process as possible. Once you merge a state-level dataset with a county-level dataset, the state-level variables are recorded many times (one for each county). This takes a lot of space and can also confuse other aspects of data preparation.

Interoperability

  • Code should be system agnostic, in that once user-specific directories are identified at the top of the code, along with a query for the identity of the user, the code can be run from any team member's machine without having to manually change the program's working directory.
    • File paths should always use forward slashes (/) rather than backslashes (\) to avoid problems on non-Windows OS's.
    • Best practice is to include a short piece of code in the /code directory that can other routines can call to obtain directory paths. In Stata, this can be accomplished through the use of the include command. Thus, each routine need only point to each user's /code directory rather than the full set of sub-directories. This approach avoids the need to modify every routine should we decide to re-organize the directory structure at a later date.
  • Code should also be version agnostic, so that team members (and future reproducers) get the same results no matter what version of software they are using.
  • Whenever using an algorithm that requires random number generation (e.g., Monte Carlo simulation or bootstrapping), set the seed so that the results do not change every time the code is run.
  • Every two months, one person will be assigned to delete old and unnecessary code/exhibits in order to keep our materials manageable.