Code style guide - ganong-noel/lab_manual GitHub Wiki
When does my code need to meet these standards?
- Code which is merged to
master
needs to meet these standards. It is OK to be sloppier for exploratory work. However, it is useful to think about sloppy code in terms of creating technical debt.
Cross language style
CodeAndData.pdf from gslab is a good starting place
Clarity and parsimony
- Your first priority should be for your code to be clearly written.
- Good names are an essential part of clear code. Use
under_scores
and notcamelCase
norhyp-hens
. See Chapter 7 ofCodeAndData
. - Use comments when, after having chosen the best possible names, you have additional explanation to share with a future reader or user of the code. This will happen a lot.
- The main cost of comments is when they accidentally become out-of-date. Comments are therefore more useful when they explain what a function or code block does or why it is there, rather than discuss results which are likely to change in the future when the underlying data change.
- Comments are required is when your code has an equation. Above the line where you compute the equation, write out the algebraic counterpart of the equation. if the equation is from a book or a paper or a website, provide a reference so a reviewer can compare the equation you implemented to the equation in prior work.
- Conditional on being clear, parsimony is better. If you have copy-pasted twice, it is time to rewrite your code.
- When filtering a data frame, use a new line for each filter condition (e.g.
filter(x, \n y)
and notfilter(x, y)
)
Unit tests, parameters, clean execution
- Write unit tests of functions before you start writing code on a problem. See Chapter 8 of
CodeAndData
. - Setting parameters
- For code which is merged to
main
, all parameters must be set in a single CSV or script (either is fine). - When developing, it is OK to temporarily hard code a parameter at the top of your script.
- For code which is merged to
- Monitoring output
- In collaboration with PIs, identify 5-10 crucial estimates.
- Write unit tests so that when a crucial estimate changes, we catch it right away
- Clean code shouldn’t throw warnings. If code does throw warning, have a comment explaining why.
- Review output carefully. View the data frame at every stage.
- PG addendum: the bullet above is too brief to be useful. Needs elaboration.
- When using
summarise()
ormutate()
you will often be tempted to usena.rm = TRUE
to drop values that areNA
. This is allowed but risky. When you do, add a unit test directly above that clarifies which columns have NA values and which do not, ideally with a description of why those columns have NA values.
Organization
- Separate slow code from fast code.
- In some cases you may want to put the slow code in a separate script
- Plotting code is always fast. If the underlying data frame is generated by slow code, save it. That way you can edit the plot later without re-running the slow code.
Joins
- Explicitly specify join keys
- Count the number of unique values of the join key in both source tables using
test_that
functions - Use
anti_join()
to see which values will be dropped in each table from the join. Delete this once you are satisfied that you understand why these rows will be dropped. - Count the number of unique values of the join key in the joined table using
test_that
Plots
- Writing the data underlying a plot to a
csv
usually increases efficiency because then you can make cosmetic changes without re-running all the code. - By "the data" we mean the dots or bars that are plotted, not the microdata
- We recommend that you write a
csv
for any plot which is merged tomain
, used in the main stack of a slide deck, or when it takes a long time to re-run the underlying code
File names
File names should be meaningful and end in .R. Avoid using special characters in file names - stick with numbers, letters, -, and _.
# Good
fit_models.R
utility_functions.R
# Bad
fit models.R
foo.r
stuff.r
If files should be run in a particular order, prefix them with numbers. If it seems likely you’ll have more than 10 files, left pad with zero:
00_download.R
01_explore.R
...
09_model.R
10_visualize.R
If you later realise that you’ve missed some steps, it’s tempting to use 02a, 02b, etc. However, I think it’s generally better to bite the bullet and rename all files.
Pay attention to capitalization, since you, or some of your collaborators, might be using an operating system with a case-insensitive file system (e.g., Microsoft Windows or OS X) which can lead to problems with (case-sensitive) revision control systems. Prefer file names that are all lower case, and never have names that differ only in their capitalization.
Language-specific
R
General
Follow the tidyverse style guide
- lintr checks for adherence to the style guide. More info on automating lintr
See this internal guide for (among other sings) some tips on using purrr
and on using dplyr
joins.
Helpful resources
- R4DS textbook
ViewPipeSteps()
for double-checking intermediate output.
Common functions shared in gnlab
The script prelim.R
maintained in gnlab's template repo template_repo
(link here) provides a number of useful functions shared in our lab:
fte_theme
coef_label
test_equal_stat
winsor
save_animation
Stable paths in R
- Code should be easily movable between computers as well as Rmd and R scripts.
- Use one directory/folder to store all gnlab repos. Peter usually calls this folder
repo
. - JS: I think the bullet below is out-of-date and faulty. An easier, equally stable solution is to create the
repo
folder in your Home directory if you're on Mac and in your Documents directory if you're on PC. Then, for example, if you're working in therdfo
repo and use the file path"~/repo/rdfo/"
, it will reference the correct folder on everyone's machine. - Use
rprojroot
to determine where the base directory of your repo is.- Your R working directory needs to be "inside" the repo to find the proper root file. In a script that means you
will have a code block such as:
if (Sys.getenv()["USER"](/ganong-noel/lab_manual/wiki/"USER") == "peterganong") { setwd("~/repo/strategic/") } else { setwd("~/gnlab/strategic/") }
- relative paths can be built off the repo using a helper function such as:
make_path <- rprojroot::is_git_root$make_fix_file() out_path <- make_path("analysis/build/structural_model_fortran/")
- Throughout your code use the variables storing the path to refer to files
- Your R working directory needs to be "inside" the repo to find the proper root file. In a script that means you
will have a code block such as:
- Use a
config.yml
file to manage paths outside of the repo. (more details.)
Positional indexing to copy over a base rate:
- A clear way to do this is to use the
which()
function and use a condition to identify the observation you're looking for. - An example using a csv file that is in the uieip repo:
state_ui <- read_csv("analysis/input/ar539.csv") %>%
transmute(state_abb = st,
week = mdy(c2),
state_ic = c3 + c4 + c5 + c6 + c7,
state_cc = c8 + c9 + c10 + c11 + c12 + c13)
state_ui %>%
filter(year(week) >= 2020) %>%
group_by(state_abb) %>%
mutate(start_covid_ic = state_ic[which(week == ymd("2020-03-14"))],
inc_ic_since_covid = state_ic - state_ic[which(week == ymd("2020-03-14"))])
start_covid_ic
is an example of copying over the base rate to a new columninc_ic_since_covid
is an example of how you can directly incorporate the base rate in your calculation and not generate an extra column
Python
Follow the google style guide.
- You might also find this Hitchhiker's guide helpful.
- HARK is written for Python 2.7. However, they anticipate migrating to Python 3 in the future and so where possible please try to write your code to be Python 3 compatible.