Coding Good Practices and Some Tips - EpiModel/EpiModeling GitHub Wiki

Introduction

Before going into some specifics, here are a few common rule one should try to follow when doing any kind of programming.

  • Fail Faster: It's better to crash early than to find out far in the future
  • Fail Clearer: When your program fails, it should be explicit and clear why
  • Code is read way more than it is written: Make the code clear for an external user (you in a week)
  • Make it work first, optimize second: when in doubt make the code safe

Coding Base Resources

It is highly recommended to read the R for Data Science book. It will give you a good foundation for your R coding journey.

Naming conventions

Attributes and parameters

Format

  1. always.use.dot.case
  2. never use "underscore" _
  3. only lower case

These are not only good practices. Not respecting these assumption (1 and 2 in particular) can lead to actual errors due to EpiModel expectations.

Suffixes

Attributes and Parameters are often of the same type. We try to indicate what to expect from an attribute and parameter with a set of common suffixes:

  • Attributes:
    • .last: a timestep where something happened for the last time
    • .count: the number of time something happened
  • Parameters:
    • .int: an interval as a number of timesteps
    • .or: an odds-ratio
    • .prob: a probability [0, 1]
    • .rate: a rate, probability of something happening per timestep [0, 1]

Prefixes

To improve the model clarity, try to use a common prefix for the attributes and parameters referring to the same thing.

  • gono.: things related to gonorrhea
  • syph: things related to syphilis
  • ...

Common Elements

Finally, we often use the same components over and over. Try to use the commonly used terms to refer to them: - dx: diagnosis / diagnosed - ndx: not diagnosed - tx: treatment / treated - ntx: untreated - inf: infection / infected - test: test (diagnostic test / screening) - sympt: symptom / symptomatic - asympt: asymptomatic - ...

Examples

Here are some syphilis attribues and parameters:

  • Attributes:
    • syph.inf: is the node infected by syphilis? (0 or 1)
    • syph.inf.last: when did the last syphilis infection occured?
    • syph.inf.count: number of syphilis infections
    • syph.dx: is the node diagnosed with syphilis? (0 or 1)
    • syph.tx: is the node treated for syphilis? (0 or 1)
  • Parameters:
    • syph.prob: probability of getting infected by syphilis per sex act
    • syph.sympt.tx.prob: probability of getting treated for syphilis if symptomatic
    • syph.screen.hivneg.rate: per timestep probability of getting screened for syphilis if HIV negative

When in doubt, try to mimic the conventions used in the project.

Variables in Modules

dat sub-elements

When assigning a variable using get_attr or get_param, keep the original name of the attribute or parameter.

Inner Variables

All other variable should follow these rules:

  • snake_case
  • only lower case
  • never use "dot" .

This naming distinction allows to easily discriminates what comes from dat and what has been defined elsewhere.

Similar to attributes and parameters, a set of common suffixes is often used:

  • _ids: positional IDs
  • _acts: positions in the act list
  • _name: name of something (not the thing)

Attributes Default Values

Theoretically, an attribute can take any scalar value. However, it is easier when these rules are followed for the defautls:

  • avoid NA as much as possible
    • someone HIV negative should have the values 0 for hiv.dx and not NA
    • this limits the need to always check for the NAs edgecases
  • flags should be 0 or 1, it's rare to have a case where NA is usefull
  • timesteps like .last : should be -Inf by default - it never occurred and is a valid number to do computations

Subsetting the Population

Goal: get the positional IDs of nodes that match a given set of conditions

Example:

  • HIV positive nodes
  • Diagnosed for their HIV
  • Not on PrEP
  • Circumsised
  • Infected with syphilis
  • Treated for their syphilis

Simplest: explicitly list all conditions

elig_ids <- which(
   status == 1 &
     diag.status == 1 &
	 prep == 0 &
	 circ == 1 &
     syph.inf == 1 &
     syph.tx == 1
)

Optimized: remove redondant conditions

  • Being diagnosed with HIV implies HIV infection (in this model)
  • PrEP is not possible when diagnosed with HIV (in this model)
  • Syphilis treatment implies syphilis infection (in this model
elig_ids <- which(
   diag.status == 1 &
	 circ == 1 &
     syph.tx == 1
)

Subset id optimization:

Because syph.tx == 1 is very rare, it's faster to first get these nodes, then keep only the circumsised ones and finally keep only the ones with an HIV+ diagnostic.

elig_ids <- which(syph.tx == 1)
elig_ids <- elig_ids[circ[elig_ids] == 1]
elig_ids <- elig_ids[diag.status[elig_ids] == 1]

This last optimization is only useful when one of the condition is rare (~10%) of the population.

Important Conclusion

As a general rule, always use the simplest version first and optimize later if relevant.

When in doubt about the redundancy of two conditions, keep both.

Never use "common sense", only act if you are sure what is happening within the model