1. Deployment - DKFZ-ODCF/AlignmentAndQCWorkflows GitHub Wiki
The workflow is based on the workflow management system Roddy. In order to run the workflow you need a working installation of Roddy. If you have never worked with Roddy before, please read about Roddy and its installation in the Roddy documentation, in particular about how to resolve plugin dependencies.
Roddy Version and Dependent Plugin Versions
The specific Roddy and COWorkflowsBasePlugin versions needed for the workflow are listed in the buildinfo.txt file associated with the workflow version that you want to install. You should use a tagged version of the workflow to ensure the information in that file is up to date.
Conda
The workflow contains a description of a Conda environment as a Conda YAML file. A number of Conda packages from BioConda are required. You should set up the Conda environment at a centralized position available from all compute hosts.
First install the BioConda channels:
conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels bioconda-legacy
Then install the environment with something like
conda env create -n AlignmentAndQCWorkflows -f $PATH_TO_PLUGIN_DIRECTORY/resources/analysisTools/qcAnalysis/environments/conda.yml
The name of the Conda environment is arbitrary but needs to be consistent with the condaEnvironmentName
variable in the configuration. The default for that variable is set in resources/configurationFiles/qcAnalysis.xml.
We successfully tested the Conda environment imported as described above and using the parameters useBioBamBamSort=false
, markDuplicatesVariant=sambamba
, workflowEnvironmentScript=workflowEnvironment_conda
and condaEnvironmentName=AlignmentAndQCWorkflows
on WGS data.
A Note on the Conda Environment
The AlignmentAndQCWorkflows plugin there are the following differences between the DKFZ-ODCF software stack that is reflected in the resources/analysisTools/qcPipeline/environments/tbi-lsf-cluster.sh
environment file and the XML configurations with the _VERSION
variables, and Conda environment:
Package | DKFZ version | Conda version | Comment |
---|---|---|---|
biobambam | 0.0.148 | 2.0.79 | As long as you do not select markDuplicatesVariant=biobambam this won't be a problem, as biobambam is only used for sorting BAMs. Note further, we did not manage to get bamsort 2 from Conda to run on a CentOS 7 VM. You can also use useBioBamBamSort=false to sort with samtools. |
picard | 1.125 | 1.126 | Probably no big deal. |
bwa | patched 0.7.8 | 0.7.8 | For the WGBS workflow we currently use a patched version of BWA that does not check for the "/1" and "/2" first and second read marks. This version is not available in BioConda and thus the WGBS workflow won't work with the Conda environment. |
R | 3.4.0 | 3.4.1 | Probably no big deal. |
trimmomatic | 0.30 | 0.33 |
Note further that the Conda environment is probably outdated and packages may not be compatible with recent Conda versions or may even be lost from the referenced channels. It might be possible to fix this by including the bioconda-legacy
channel. Because of these and other problems of Conda that render this tool (alone) almost unusable for creating reproducibility, we cannot provide you much support with the environments.
WGBS Data Processing and methylCtools
The current implementation of the WGBS workflow uses methylCtools requires a patched BWA version. Note that the methylCtools version in this repository differs from the one in the official Github repository in that the version .
Recompiling the D-based Components
Two programs in this repository -- genomeCoverage.d
and coverageQc.d
-- were written in the programming language D and are provided as binaries and source code. If the need arises to recompile them you can find the build instructions in resources/analysisTools. For the compilation you will need
- the D-compiler LDC 0.12.1 compiler
- and BioD master branch (commit 8b633de)
mbuffer
The programm mbuffer
is used as a more powerful tee
alternative and to buffer large data amounts against temporary I/O slowdowns. Unfortunately, in particular in old versions of the workflow, some mbuffer
calls are in chains of piped commands and errors associated with mbuffer
are not correctly caught. You may encounter the following situation:
Overall the "alignAndPairSlim" or "mergeAndMarkDuplicatesSlim" job does not finish but running processes in the jobs are blocking (no I/O, no CPU). The actual alignment with bwa
has finished without errors, and bwa
and flags_isizes_PEaberrations.pl
have finished without errors. Still other process are running, in particular genomeCoverage
, while filter_readbins.pl
has ended with an error message, that there is no input ("from 0 lines, kept 0 with selected chromosomes"). The problem may be that the mbuffer
between the latter two processes did not even start to connect genomeCoverage
and filter_readbins.pl
, because it could not allocate memory due to too large "blocksize". Newer versions of mbuffer
choose this value dynamically.
To fix this problem, for the user that executes the workflow on the cluster, configure ~/.mbuffer.rc
blocksize = 4096