Code style guide - uchicago-bfi-gnlab/lab_manual GitHub Wiki

When does my code need to meet these standards?

Code which is merged to master needs to meet these standards. It is OK to be sloppier for exploratory work. However, it is useful to think about sloppy code in terms of creating technical debt.

Cross language style

CodeAndData.pdf from gslab is a good starting place

Clarity and parsimony

Your first priority should be for your code to be clearly written.
Good names are an essential part of clear code. Use under_scores and not camelCase nor hyp-hens. See Chapter 7 of CodeAndData.
Use comments when, after having chosen the best possible names, you have additional explanation to share with a future reader or user of the code. This will happen a lot.
- The main cost of comments is when they accidentally become out-of-date. Comments are therefore more useful when they explain what a function or code block does or why it is there, rather than discuss results which are likely to change in the future when the underlying data change.
- Comments are required is when your code has an equation. Above the line where you compute the equation, write out the algebraic counterpart of the equation. if the equation is from a book or a paper or a website, provide a reference so a reviewer can compare the equation you implemented to the equation in prior work.
Conditional on being clear, parsimony is better. If you have copy-pasted twice, it is time to rewrite your code.
- The functions and iteration chapters of the R4DS textbook are helpful for learning how to write concise code.
When filtering a data frame, use a new line for each filter condition (e.g. filter(x, \n y) and not filter(x, y))

Unit tests, parameters, clean execution

Write unit tests of functions before you start writing code on a problem. See Chapter 8 of CodeAndData.
Setting parameters
- For code which is merged to main, all parameters must be set in a single CSV or script (either is fine).
- When developing, it is OK to temporarily hard code a parameter at the top of your script.
Monitoring output
- In collaboration with PIs, identify 5-10 crucial estimates.
- Write unit tests so that when a crucial estimate changes, we catch it right away
Clean code shouldn’t throw warnings. If code does throw warning, have a comment explaining why.
Review output carefully. View the data frame at every stage.
- PG addendum: the bullet above is too brief to be useful. Needs elaboration.
When using summarise() or mutate() you will often be tempted to use na.rm = TRUE to drop values that are NA. This is allowed but risky. When you do, add a unit test directly above that clarifies which columns have NA values and which do not, ideally with a description of why those columns have NA values.

Organization

Separate slow code from fast code.
- In some cases you may want to put the slow code in a separate script
- Plotting code is always fast. If the underlying data frame is generated by slow code, save it. That way you can edit the plot later without re-running the slow code.

Joins

Explicitly specify join keys
Count the number of unique values of the join key in both source tables using test_that functions
Use anti_join() to see which values will be dropped in each table from the join. Delete this once you are satisfied that you understand why these rows will be dropped.
Count the number of unique values of the join key in the joined table using test_that

Plots

Writing the data underlying a plot to a csv usually increases efficiency because then you can make cosmetic changes without re-running all the code.
By "the data" we mean the dots or bars that are plotted, not the microdata
We recommend that you write a csv for any plot which is merged to main, used in the main stack of a slide deck, or when it takes a long time to re-run the underlying code

Stable paths

Code should be easily movable between computers as well as scripts.
Use one directory/folder to store all gnlab repos. This folder should be called repo, and should be in your Home directory if you're on Mac and in your Documents directory if you're on PC. Then, for example, if you're working in the rdfo repo and use the file path "~/repo/rdfo/", it will reference the correct folder on everyone's machine.
Paths to input and output folders should be hardcoded at the start of relevant scripts. This makes them easy to change e.g. during a PR when paths no longer point to issue folders.

File names

File names should be meaningful and end in .R. Avoid using special characters in file names - stick with numbers, letters, -, and _.
In most cases, we don't add dates to file names
- Git/Github capture the revision history quite nicely so it’s fine to overwrite your old output when you update a document.
- Data builds are an exception, as we usually want a quick and easy way to go back to previous iterations. See the Confluence guides in Chase for more details

# Good
fit_models.R
utility_functions.R
# Bad
fit models.R
foo.r
stuff.r

If files should be run in a particular order, prefix them with numbers. If it seems likely you’ll have more than 10 files, left pad with zero:

00_download.R
01_explore.R
...
09_model.R
10_visualize.R
If you later realise that you’ve missed some steps, it’s tempting to use 02a, 02b, etc. However, I think it’s generally better to bite the bullet and rename all files.

Pay attention to capitalization, since you, or some of your collaborators, might be using an operating system with a case-insensitive file system (e.g., Microsoft Windows or OS X) which can lead to problems with (case-sensitive) revision control systems
Prefer file names that are all lower case, and never have names that differ only in their capitalization.

source

Language-specific

R

General

Follow the tidyverse style guide

lintr checks for adherence to the style guide. More info on automating lintr

See this internal guide for (among other sings) some tips on using purrr and on using dplyr joins.

Helpful resources

R4DS textbook
ViewPipeSteps() for double-checking intermediate output.

Common functions shared in gnlab

The script prelim.R maintained in gnlab's template repo template_repo (link here) provides a number of useful functions shared in our lab:

fte_theme
coef_label
test_equal_stat
winsor
save_animation

Positional indexing to copy over a base rate:

A clear way to do this is to use the which() function and use a condition to identify the observation you're looking for.
An example using a csv file that is in the uieip repo:

state_ui <- read_csv("analysis/input/ar539.csv") %>%
  transmute(state_abb = st,
            week = mdy(c2),
            state_ic = c3 + c4 + c5 + c6 + c7,
            state_cc = c8 + c9 + c10 + c11 + c12 + c13) 

state_ui %>%
  filter(year(week) >= 2020) %>%
  group_by(state_abb) %>%
  mutate(start_covid_ic = state_ic[which(week == ymd("2020-03-14"))],
         inc_ic_since_covid = state_ic - state_ic[which(week == ymd("2020-03-14"))])

start_covid_ic is an example of copying over the base rate to a new column
inc_ic_since_covid is an example of how you can directly incorporate the base rate in your calculation and not generate an extra column

Python

Follow the google style guide.

You might also find this Hitchhiker's guide helpful.
HARK is written for Python 2.7. However, they anticipate migrating to Python 3 in the future and so where possible please try to write your code to be Python 3 compatible.