Pipelines: Best Practices - QuantGen/HPCC GitHub Wiki

  • Componentialize: One component for each task (this is especially important when there are multiple people working on the project)
    • How to organize?
    • How to implement?
      • A pipeline will likely contain different types of scripts (e.g., written in bash or R, or using external software such as PLINK)
      • I like to make sure that they have a similar interface, i.e., command name + arguments (in R, this can be done with the optparse package)
      • This approach will also work well with Slurm
      • Each component should be minimal
        • Minimize output messages
        • Put functions that do the heavy lifting into separate files, or even better (especially with C/C++ functions), an R package
      • The usual software development best practices apply (Code Complete)
        • Use a shared coding standard
        • Don't be too clever
        • Document your code
        • Use good variable names (avoid 'x', 'tmp', ...)
  • Differentiate between data (what you start with), code, and output
    • Code and output should correspond in directories so that you know what comes from where
  • Use version control (i.e., git)