Instructions for developers - genetics-of-dna-methylation-consortium/godmc_phase2 GitHub Wiki

This page is written for those who are part of the developing team that are updating the pipeline or adding new analyses. At present just the developer group are able to contribute to the pipeline - if you are interested in joining this group please contact either Josine Min ([email protected]) or Eilis Hannon ([email protected]). To implement new analyses to the pipeline you will need an approved proposal. You can find proposal forms here.

This page contains two levels of information. First it outlines how edits are to be made. Second it outlines the architecture of the code and provides some orientation to what is already included and where it can be found.

Process for proposing changes to the pipeline

To preserve code that has been tested and validated for release and is currently being run by cohort analysts, you are no longer able to make edits directly to the main branch. Instead we have implemented an approvals process. Please follow the process below to ensure that your changes are incorporated into the pipeline in a timely manner with minimim risk to the wider project.

Log an issue. Issues are how we keep track of what is broken and needs attention and who is fixing something. The first before any changes are made is to log it as an issue on the GitHub repository. Please use the appropriate template and provide as much detail as possible. Perhaps you are fixing something that someone else has reported in which case you can bypass this step.
Assign the issue. Let everyone els eknow you are working on the issue by assigning it to yourself. This can be done on the issue web page. On the right had side at the top is a label Assignees. There is a link to assign yourself or if you click the cogwheel you can assign someone else.
Create a branch. Create a feature branch off of main to do your development on. This will take a copy of the scripts on main, and separate them so that you can make changes without impacting on the functionality of the main branch. Give your branch a meaningful name. You can fast track this process, by using the Create a branch link further down the right hand side menu on the issue page, under the label Development. This is advisable becuase it gives the branch a name that links to this issue, and creates a link between the issue and the code development for others to find easily.
Checkout the branch locally. On your local system, you are probably viewing the code base for the main branch. You can confirm this by running

git status

The first line of this output will be a statement like On branch main. Before you make any changes to the code, i.e. try to fix the bug you need to switch to the development branch you created, you can do this as follows:

git checkout <insert branch name>

Rerunning git status will confirm successful switching. If you use the Create a branch functionality described in step 3, GitHub will provide you with the correct commands to run on your local server.

Make and test your changes. You can now edit your local scripts. Commit changes as you usual would with a meaningful commit message. So long as you stay on the fetaure branch the local copy of the scripts will have any edits made on this branch, so you can test them by running the pipeline as you typically would. OThers can also contribute to this branch, by checking it out themselves. Just remember to git push and git pull regularly to ensure that the changes are uplaoded to GitHub to enable other sto mirror them on their systems and for you to download changes others make to the branch.
Submit a pull request. Once you are happy that you have fixed the issue you can submit your changes to be merged into the main pipeline. You can inform us that a change is ready to be merged by submitting a pull request. Through this form please indicate which branch contains your fix, and details what changes are included.
Wait for your changes to be approved. To protect the pipeline, we are implementing a review of suggested changes prior to their merging. You can assign someone to review your proposed changes.
Merge changes. Once someone has approved your chnages you will be able to merge into the pipeline. Note that while you were working on your changes, othrs might have been made pushed to the main branch which conflict with yours. These will need to be resolved before you can merge successfully. If this occurs please reach out for further support if you need it. Once your branch is in a position to be merged, there will be a green merge button to click.
Clean up. Make sure the issue has been closed (this might happen automatiically if you link the pull request to the issue(s) you are dealing with), and delete the branch.
Inform analysts to run git pull. To get a local copy of these changes, which are no on the main branch, all analysts will need to run git pull to ensure they are working with the latest version.

As there are a lots of groups working on the pipeline we encourage you to push and pull regularly to ensure you are up to date, even if you are not actively developing.

Releases of the pipeline

In order to ensure analysts are using the latest version of the code, we are implementing versionning. This means that if you make an update to the code that is critical to your module, you can enforce a check that leads to an error if the scripts are behind where they should be.

There are two elements to get this to work. First we tag various points in the code history with version numbers. The initial release was version 1.0.0. If there is a critical upgrade to the pipeline that you need all cohorts to use, please contact either Eilis or Josine to tag a new release. The second element is to record in the resources/logs/versions.txt file the tag of the release that you require cohorts to use. We can advise you what to set this to if you need to update it. Note that this is the minimum release version that that module requires.

A schmatic of the process can be seen below. Each circle represents a commit, and each line represents a branch.

---
title: Workflow for subsequent pipeline development
---
 gitGraph
    commit tag: "v1.0.0"
    branch cell-type-qtls
    commit
    commit
    commit
    checkout main
    branch sex-chr-qtls
    commit
    checkout main
    merge cell-type-qtls tag: "v1.0.1" id: "PR1"
    branch fix-upload-error
    commit
    checkout main
    merge fix-upload-error tag: "v1.0.2" id: "PR2"
    checkout sex-chr-qtls
    commit
    commit
    checkout main
    merge sex-chr-qtls tag: "v1.0.3" id: "PR3"

Design of the pipeline

Background to the pipeline:

We chose to set up a new github repository for the next phase of GoDMC. However, lots of the original scripts/instructions etc were copied over from the original repository. It is worth taking a look there first to see if there is a previous script you can adapt. The pipeline is hosted on git as it is a fantastic tool for collaborative software development. If your never have used git before then you can look at some training pages here: git course and introducing git. Please use the issues to functionality to report bugs, to document what you are working on and to keep up to date with the development status of the pipeline.

Pipeline structure and logistics

The pipeline works because we require a consistent data format and directory structure. It is designed to be run on the command line by executing a series of sequential bash scripts. Each of these scripts is numbered, where the number refers to the stage of the pipeline. Where there are multiple bash scripts for a stage, these are and then lettered to refer to the order of the scripts within that stage (e.g. 02axxx.sh, 02bxxx.sh). These stages are defined and referred to throughout the wiki. The bash scripts co-ordinate the pre-processing steps and analyses by calling on other scripts, software programmes, input data and reference data which are located in the following folders:

input_data this is were the analysts will put the methylation, genotype and covariate data for their cohorts.
processed_data this is where outputs of the pre-processing steps are stored. There are a number of sub-folders within this folder. This is the locatioon for any temporary/intermediate files that need to be used later on in the pipeline but won't be uploaded as results.
resources this is where associated data/binaries/scripts that the pipeline needs to run are to be stored (e.g. allele frequencies of SNPs, weights for calculating smoking scores, R scripts that are called in the shell scripts). If you use software other than R then you need to add your binaries here: /resources/bin. Other resources are organised into folders.
results this is where all results files are to be stored. It is files in this folder that will be uploaded and shared across cohorts. Please note you are not allowed to add any individual level data here. You can only add summary statistics here. This can include plots.

Note analysts should only be expected to deposit data in the input_data folder. Anything you want to share with analysts should go in the resources folder. Git doesn't work well with large files in this case you may need to host these somewhere else and have the analysts download them. Commands for this should be included in 01-check_data.sh.

As well as the consistent data format and directory structure there are two key files that make the pipeline run.

config file. This is where the cohort analysts can provide further details on their cohort that are needed for the pipeline. It is a series of variables that they need to complete. We provide an example of what this should look like. It is worth looking in here to see what variables are already defined that you may need to make use of.
./resources/parameters file. This is not edited by the analyst, but is used by the developers to specify the location of log files, results files, software, binaries etc as bash variables. If you see a variable referred to in a script it is likely defined in this file.

Major changes from phase 1

We have moved the generation of covariate and DNAm scores (including cell counts) from script 04a to script 03a. This was originally here: https://github.com/MRCIEU/godmc.

Other tips

Familiarise yourself with what is already in the pipeline - it is probably that many of the covariates have already been derived for example. Please reuse as much as possible and minimise replication. If you have an objection to the way something have been implemented please raise an issue for discussion.
You need to write a wiki page with instructions. This needs to explain to the analyst what this stage of the pipeline is doing. Please include here your contact details for analysts to use if they run into trouble.
We suggest you set up notifications for issues so you can be kept informed of changes to the pipeline.
You need to add installation instructions for your R packages [here:] (System-requirements)
In resources/datacheck/requirements.R you need to add the R packages that analysts need for their analysis
In resources/datacheck/covariates.R we check covariates. Genetic PCs will be generated in script 02.
In resource/logs/version.txt, specify the required version of your scripts. The required version of the script should be equal to or earlier than the latest tag of the pipeline. If the pipeline has multiple tag versions, make sure to update the version of your script accordingly.
Bare in mind we have cohorts with related samples so your analyses need to consider both related and unrelated data.
We also have cohorts with single sex eg females only or same age so your scripts need to consider this.
Each stages should implement a check to see whether all results files have been generated. There is a generic bash script for this purpose which needs to be edited for each stage. This runs with the command check_upload.sh 04 check.
Each stage needs to implement it's own upload command to upload all files from the results folder to the desired upload location. It is your responsibility to ensure that all the necessary files are uploaded and uploaded to the correct location. For some modules we use check_upload.sh 04 upload.
Please note that for upload you need to encrypt the results files. You can use gpg -v -o cohortname_01.tgz.aes -c --cipher-algo AES256 cohortname_01.tgz. For decryption you can use gpg --pinentry-mode loopback --decrypt --output "cohortname_01_decrypted.tgz" "cohortname_01.tgz.aes".
We already have a script to generate a table with cohort descriptives. This is in script 01 here: https://github.com/genetics-of-dna-methylation-consortium/goDMC_phase2/blob/main/resources/datacheck/collect_descriptives.R. Please add to this if needed.
We advise that each script is tested by other developers. Please get in touch when you are ready for this.
We advise that all code is reviewed by other developers. Please get in touch when you are ready for this.

Data preprocessing

Most analyses will need running modules 01-03. After running module 03 (03a-03f), you will get

1/ Methylation data of autosomal CpGs

transformed_methylation_adjusted.RData

untransformed_methylation_adjusted.RData

transformed_methylation_adjusted_pcs.RData

untransformed_methylation_adjusted_pcs.RData

2/ Methylation data of sex chromosome CpGs

transformed_methylation_adjusted.[Female.chrX / Male.chrX / Male.chrY].RData

untransformed_methylation_adjusted.[Female.chrX / Male.chrX / Male.chrY].RData

transformed_methylation_adjusted_pcs.[Female.chrX / Male.chrX / Male.chrY].RData

untransformed_methylation_adjusted_pcs.[Female.chrX / Male.chrX / Male.chrY].RData

Details about these RData files

RData starting with transformed_methylation_adjusted and untransformed_methylation_adjusted are outcome of 03b (adjusting methylation for age, sex, smoking, cells). These covariates are in ./processed_data/methylation_data/all_covariates.txt). RData starting with transformed_methylation_adjusted_pcs and untransformed_methylation_adjusted_pcs are outcome of 03d (adjusting methylation for covariates in ./processed_data/methylation_data/all_covariates.txt and methylation PCs).

The differences between transformed RData and untransformed RData are:

transformed RData: RINT(CpG) -> adjust covariates -> RINT(residuals)

untransformed RData: CpG -> adjust covariates and retain residuals

Please choose the appropriate data for the following development.