Code style guide - ganong-noel/lab_manual GitHub Wiki

When does my code need to meet these standards?

  • Code which is merged to master needs to meet these standards. It is OK to be sloppier for exploratory work. However, it is useful to think about sloppy code in terms of creating technical debt.

Cross language style

CodeAndData.pdf from gslab is a good starting place

Clarity and parsimony

  • Your first priority should be for your code to be clearly written.
  • Good names are an essential part of clear code. Use under_scores and not camelCase nor hyp-hens. See Chapter 7 of CodeAndData.
  • Use comments when, after having chosen the best possible names, you have additional explanation to share with a future reader or user of the code. This will happen a lot.
    • The main cost of comments is when they accidentally become out-of-date. Comments are therefore more useful when they explain what a function or code block does or why it is there, rather than discuss results which are likely to change in the future when the underlying data change.
    • Comments are required is when your code has an equation. Above the line where you compute the equation, write out the algebraic counterpart of the equation. if the equation is from a book or a paper or a website, provide a reference so a reviewer can compare the equation you implemented to the equation in prior work.
  • Conditional on being clear, parsimony is better. If you have copy-pasted twice, it is time to rewrite your code.
    • The functions and iteration chapters of the R4DS textbook are helpful for learning how to write concise code.
  • When filtering a data frame, use a new line for each filter condition (e.g. filter(x, \n y) and not filter(x, y))

Unit tests, parameters, clean execution

  • Write unit tests of functions before you start writing code on a problem. See Chapter 8 of CodeAndData.
  • Setting parameters
    • For code which is merged to main, all parameters must be set in a single CSV or script (either is fine).
    • When developing, it is OK to temporarily hard code a parameter at the top of your script.
  • Monitoring output
    • In collaboration with PIs, identify 5-10 crucial estimates.
    • Write unit tests so that when a crucial estimate changes, we catch it right away
  • Clean code shouldn’t throw warnings. If code does throw warning, have a comment explaining why.
  • Review output carefully. View the data frame at every stage.
    • PG addendum: the bullet above is too brief to be useful. Needs elaboration.
  • When using summarise() or mutate() you will often be tempted to use na.rm = TRUE to drop values that are NA. This is allowed but risky. When you do, add a unit test directly above that clarifies which columns have NA values and which do not, ideally with a description of why those columns have NA values.

Organization

  • Separate slow code from fast code.
    • In some cases you may want to put the slow code in a separate script
    • Plotting code is always fast. If the underlying data frame is generated by slow code, save it. That way you can edit the plot later without re-running the slow code.

Joins

  • Explicitly specify join keys
  • Count the number of unique values of the join key in both source tables using test_that functions
  • Use anti_join() to see which values will be dropped in each table from the join. Delete this once you are satisfied that you understand why these rows will be dropped.
  • Count the number of unique values of the join key in the joined table using test_that

Plots

  • Writing the data underlying a plot to a csv usually increases efficiency because then you can make cosmetic changes without re-running all the code.
  • By "the data" we mean the dots or bars that are plotted, not the microdata
  • We recommend that you write a csv for any plot which is merged to main, used in the main stack of a slide deck, or when it takes a long time to re-run the underlying code

File names

File names should be meaningful and end in .R. Avoid using special characters in file names - stick with numbers, letters, -, and _.

# Good
fit_models.R
utility_functions.R
# Bad
fit models.R
foo.r
stuff.r

If files should be run in a particular order, prefix them with numbers. If it seems likely you’ll have more than 10 files, left pad with zero:

00_download.R
01_explore.R
...
09_model.R
10_visualize.R
If you later realise that you’ve missed some steps, it’s tempting to use 02a, 02b, etc. However, I think it’s generally better to bite the bullet and rename all files.

Pay attention to capitalization, since you, or some of your collaborators, might be using an operating system with a case-insensitive file system (e.g., Microsoft Windows or OS X) which can lead to problems with (case-sensitive) revision control systems. Prefer file names that are all lower case, and never have names that differ only in their capitalization.

source

Language-specific

R

General

Follow the tidyverse style guide

See this internal guide for (among other sings) some tips on using purrr and on using dplyr joins.

Helpful resources

Common functions shared in gnlab

The script prelim.R maintained in gnlab's template repo template_repo (link here) provides a number of useful functions shared in our lab:

  • fte_theme
  • coef_label
  • test_equal_stat
  • winsor
  • save_animation

Stable paths in R

  • Code should be easily movable between computers as well as Rmd and R scripts.
  • Use one directory/folder to store all gnlab repos. Peter usually calls this folder repo.
  • JS: I think the bullet below is out-of-date and faulty. An easier, equally stable solution is to create the repo folder in your Home directory if you're on Mac and in your Documents directory if you're on PC. Then, for example, if you're working in the rdfo repo and use the file path "~/repo/rdfo/", it will reference the correct folder on everyone's machine.
  • Use rprojroot to determine where the base directory of your repo is.
    • Your R working directory needs to be "inside" the repo to find the proper root file. In a script that means you will have a code block such as:
      if (Sys.getenv()["USER"](/ganong-noel/lab_manual/wiki/"USER") == "peterganong") {
        setwd("~/repo/strategic/")
      } else {
        setwd("~/gnlab/strategic/")
      }
      
    • relative paths can be built off the repo using a helper function such as:
      make_path <- rprojroot::is_git_root$make_fix_file()
      out_path <- make_path("analysis/build/structural_model_fortran/")
      
    • Throughout your code use the variables storing the path to refer to files
  • Use a config.yml file to manage paths outside of the repo. (more details.)

Positional indexing to copy over a base rate:

  • A clear way to do this is to use the which() function and use a condition to identify the observation you're looking for.
  • An example using a csv file that is in the uieip repo:
state_ui <- read_csv("analysis/input/ar539.csv") %>%
  transmute(state_abb = st,
            week = mdy(c2),
            state_ic = c3 + c4 + c5 + c6 + c7,
            state_cc = c8 + c9 + c10 + c11 + c12 + c13) 

state_ui %>%
  filter(year(week) >= 2020) %>%
  group_by(state_abb) %>%
  mutate(start_covid_ic = state_ic[which(week == ymd("2020-03-14"))],
         inc_ic_since_covid = state_ic - state_ic[which(week == ymd("2020-03-14"))])
  • start_covid_ic is an example of copying over the base rate to a new column
  • inc_ic_since_covid is an example of how you can directly incorporate the base rate in your calculation and not generate an extra column

Python

Follow the google style guide.

  • You might also find this Hitchhiker's guide helpful.
  • HARK is written for Python 2.7. However, they anticipate migrating to Python 3 in the future and so where possible please try to write your code to be Python 3 compatible.