Developer guide - CancerRegistryOfNorway/NORDCAN GitHub Wiki

Making releases

See https://github.com/CancerRegistryOfNorway/NORDCAN/blob/master/collect_nordcan_participant_instructions.R.

R package guidelines

R package development can follow guidelines presented in https://r-pkgs.org/ unless otherwise specified in this document.

nordcancore

  • all metadata concerning all nordcan (input) datasets such as cancer_record_dataset and their columns
  • general utilities not specific to some usage (not specific e.g. to computing statistics)

nordcanpreprocessing

  • applies metadata and utilities from nordcancore to verify and enrich nordcan (input) datasets such as cancer_record_dataset
  • contains solutions specific to preprocessing, and not e.g. statistics computation
  • verification functions may be used in other packages that this package does not depend on

nordcansurvival

  • functions specific to survival statistics computation only
  • may depend on other nordcan packages above but does not necessarily need to

basicepistats

  • basic epidemiological statistics functions in general form, i.e. they can be applied to any epidemiological dataset
  • must not depend on any nordcan package

nordcanepistats

  • applies metadata and utilities from above nordcan packages to compute statistics
  • functions specific to statistics computation

New packages can be made if a new "theme" of functionalities is needed. Pre-existing packages may also be split if a package becomes too diverse in its functionalities; e.g. perhaps if we end up with a large number of verification functions, they might live in their own package.

Documentation guidelines

  • all requirements must be documented (this is done primarily where datasets supplied by the user are described)
  • all objects, functions, commands, etc. made available to the user must be documented
    • e.g. internal R package functions do not need to be documented, but can be if there's a special need for it; exported functions should be documented using the R-specific documentation system
  • all non-obvious statistical methods should be documented (or have a reference)
    • e.g. instead of explaining how relative survival is computed just refer to publication where the method is presented
    • e.g. simply counting rows does not really need explaining, whereas counting prevalent cases is more complicated and does need to be explained
  • because statistical methods need to be explained, also new datasets and columns formed based on the user-supplied datasets must be documented
    • e.g. how the various exclusion rules in preprocessing were defined in code must be documented
    • however, naturally any temporary, only internally used, short-term columns / datasets / other objects do not need to be documented

General code style guidelines

We name nothing using CamelCase nor snail.case. Instead everything should be named using snake_case.

In some languages different cases are used to identify different kinds of objects, e.g. classes may be written in CamelCase. However, neither R nor Stata rely on writing lots of classes, so we don't need that distinction.

Good: my_var, my_fun

Bad: myvar, myfun, MyVar, MyFun, myVar, myFun, My_Var, My_Fun, etc.

R style guidelines

We follow Google's R style guide (https://web.stanford.edu/class/cs109l/unrestricted/resources/google-style.html) with the exception that we name nothing using CamelCase nor snail.case but always use snake_case.

General

Object names

Good: my_var, my_fun

Bad: myvar, myfun, MyVar, MyFun, myVar, myFun, My_Var, My_Fun, etc.

Object element names

Good:

my_list <- list(my_elem_1 = something, my_elem_2 = something_else)

Bad:

my_list <- list(MyElem1 = something, MyElem2 = something_else)
my_list <- list(Elem1 = something, Elem2 = something_else)
my_list <- list(My_Elem1 = something, My_Elem2 = something_else)

Assignment

Let's avoid = for assignment. If <- is used consistently for assignment then assignments are easy to identify in code (and easy to separate from e.g. usage of function arguments such as my_fun(arg = obj)).

Good:

my_var <- 1

Bad:

my_var = 1

If-else

Be mindful of appropriate spacing and write if-else conditions on multiple lines.

Good:

if (a == 1) {
  my_var <- "something"
} else {
  my_var <- "something else"
}

Bad:

if (a == 1) {my_var = "something"} else {my_var = "something else"}
if(a == 1){
my_var = "something"
}else{
my_var = "something else"
}

Functions

Refer to functions from other packages explicitly using pkg::fun instead of just fun. There are exceptions. You don't need to do that with e.g. data.table's :=, although it is actually a function.

Good:

my_fun <- function(x) {
  # ... some code ...
  dt <- data.table::data.table(a = something, b = something_else)
  dt[, "c" := something_else_again]
  # ... some code ...
  something <- basicepistats::stat_count(x)
  # ... some code ...
  return(result)
}

Bad:

my_fun <- function(x) {
  # ... some code ...
  dt <- data.table(a = something, b = something_else)
  dt[, "c" := something_else_again]
  # ... some code ...
  something <- stat_count(x)
  # ... some code ...
  return(result)
}

Function arguments can have default values. A default value in an argument implies that it is optional (i.e. the function should do at least something with the default value). If an argument does not have a default, this implies that the argument is mandatory, and the user must consciously supply some value to it to use the function.

my_fun <- function(arg1, arg2 = 2) {
  arg1 + arg2
}

# default value should be easy to read. for more complex defaults, do this:
my_fun <- function(arg1, arg2 = NULL) {
  if (is.null(arg2)) {
    # do something complex
    arg2 <- 10^2
  }
  arg1 + arg2
}
# NULL default can also be used to communicate that something "extra"
# can be done if it is non-NULL:
my_fun <- function(arg1, arg2, arg3 = NULL) {
  result <- arg1 + arg2
  if (!is.null(arg3)) {
    result <- result / arg3
  }
  result
}

User-available functions can use the ellipsis ... if needed. However, internal functions should avoid this, and instead additional arguments passed to other functions should be made explicit:

user_available_fun <- function(x, ...) {
  arg_list <- list(...)
  arg_list["x"](/CancerRegistryOfNorway/NORDCAN/wiki/"x") <- x
  result <- do.call(internal_fun_1, arg_list)
  return(result)
}

internal_fun_2 <- function(x, internal_fun_1_arg_list = list(arg_2 = 1)) {
  arg_list <- internal_fun_1_arg_list 
  arg_list["x"](/CancerRegistryOfNorway/NORDCAN/wiki/"x") <- x
  do.call(internal_fun_1, arg_list)
}

User-available functions must do at least some basic assertions on its inputs. However, it is not necessary to have assertions that require heavier computation (such as inspecting the validity of all values in a dataset; instead it is usually enough to assert that the dataset has correct column names and types). Internal functions can also have basic assertions. It is particularly important to assert the class of input arguments since R does not fix them in advance when the function is defined. The dbc::assert_user_input_is_data.table_with_required_names and dbc::assert_prod_input_is_data.table_with_required_names produce different error messages depending on whether the user has made a mistake or whether there is some programming mistake and are strongly recommended.

user_available_fun <- function(x, ...) {
  dbc::assert_user_input_is_data.table_with_required_names(x, required_names = some_names)
  arg_list <- list(...)
  arg_list["x"](/CancerRegistryOfNorway/NORDCAN/wiki/"x") <- x
  result <- do.call(internal_fun_1, arg_list)
  return(result)
}
user_available_fun <- function(x, ...) {
  dbc::assert_prod_input_is_data.table_with_required_names(x, required_names = some_names)
  arg_list <- list(...)
  arg_list["x"](/CancerRegistryOfNorway/NORDCAN/wiki/"x") <- x
  result <- do.call(internal_fun_1, arg_list)
  return(result)
}

avoid using variables external to the function as much as humanly possible. this improves readability of the function and makes your function more general. of course you can use other functions inside your own without a worry.

my_var <- 1
my_fun <- function(x) x + my_var # bad
my_fun <- function(x, y = my_var) x + y # good

# - if you really need to use an external variable, consider writing a tiny
#   retrieval function for fetching it. or at the very least name the external
#   variable so clearly that there can be no mistake as to where its from.
.__MY_SECRET_EXTERNAL_VARIABLE <- 1
my_fun <- function(x) .__MY_SECRET_EXTERNAL_VARIABLE + x # OK-ish

get_my_var <- function() 1
my_fun <- function(x) get_my_var() + x # better

Specific to using data.tables

Assignment

data.table has a different syntax for assignment to columns. Actually using the data.frame syntax for assignment can cause problems (raise warnings)!

Good:

dt[, "my_col" := my_values]
dt[my_other_col == 1, "my_col" := conditional_value]
dt[, "my_col" := NULL]

Bad:

dt[, "my_col"] <- my_values
dt["my_col"](/CancerRegistryOfNorway/NORDCAN/wiki/"my_col") <- my_values
dt$my_col <- my_values
dt$my_col[dt$my_other_col == 1] <- conditional_value
dt[, "my_col"] <- NULL
dt["my_col"](/CancerRegistryOfNorway/NORDCAN/wiki/"my_col") <- NULL

When function output is a data.table

data.table has a quirk where a data.table output by a function should be returned as dt[] instead of dt to ensure printing will work. The object is returned correctly in either case, only the printing will not work immediately when output by a function using just dt.

Good:

my_fun <- function(x) {
  # .. some code
  return(dt[])
}

Bad:

my_fun <- function(x) {
  # .. some code
  return(dt)
}

git guidelines version 0.1

Committing is important skill. Commit early, commit often; every bugfix, feature, refactoring or reformatting should go into an own commit. Commit when finished a "logical part".

Don't

  • Don't commit code that doesn't work - test before committing.
  • Don't commit code you don't understand
  • Don't commit if other developers disagree

git commit message

explain what and why (NOT how)

example subject line: [#32] bug: Correct infile argument to R survival_statistics()

[tag of issue] label Subject line using IMPERATIVE, preferably <= 72 chars, can be longer if one sentence only

imperative test (If applied, this commit will ...) Update getting started documentation

Use body for longer explanation. Write explain what and why (NOT how)

tags (to be reviewed and harmonized with labels used in github/gitlab):

  • feat: The new feature you're adding to a particular application
  • bug: A bug fix
  • chore: Regular code maintenance.
  • refactor: Refactoring a specific section of the codebase
  • style: Feature and updates related to code styling
  • test: Everything related to testing
  • docs: Everything related to documentation