Example Pipeline - VUmcCGP/wisecondor GitHub Wiki
Pipeline - Example
At first I planned to release WISECONDOR with a little bash script that would function as an example pipeline. As this turned out a little bit more annoying and confusing for people than I anticipated I decided to leave it out, yet I forgot to remove a reference to this non-existing script in the readme. As some people have asked to release this example anyway I decided to just provide the steps we use in our pipeline on this page.
UPDATE: I released a little pipeline to fill this gap, you can find it here: WISECONDOR on autopilot: run.sh
We don't apply these steps separately, nor do we use a simple bash script to run them. All these steps are integrated in my pipeline system (RoStr, also on my github) but that is not really documented well enough for third party use. I suggest you turn these steps into a bash script that fits your system instead.
CAUTION: I wrote this page in a single day based on what I have made previously, it might still contain errors. If you find any, please let me know so I can fix it.
Variables used
First a little overview of all the variables used in this pipeline. I put down suggestions and example values, you should replace paths and options as you see fit. I doubt I could make a one pipeline fits all solution here.
Paths to binaries
SCRIPT_PYTHON=python
SCRIPT_BWA=bwa
SCRIPT_JAVA=java
SCRIPT_PT_SORTSAM=picardtools/SortSam.jar
SCRIPT_SAMTOOLS=samtools
Systems specific options to optimize BWA
ARG_BWA_THREADS=4 # Multithreading in BWA
Path to WISECONDOR
DIR_WC=./
SCRIPT_WC_COUNTGC=$DIR_WC/countgc.py
SCRIPT_WC_CONSAM=$DIR_WC/consam.py
SCRIPT_WC_GCC=$DIR_WC/gcc.py
SCRIPT_WC_NEWREF=$DIR_WC/newref.py
SCRIPT_WC_PLOT=$DIR_WC/plot.py
Reference files used and produced
FILE_REFERENCE=./hg19.fasta # As downloaded from your source
FILE_WC_GCCOUNT=./gccount.pickle # GC-Count per bin based on fasta reference
DIR_REFSAMPLES=./refsamples # Directory containing GC-Corrected sampels to use as reference data
FILE_WC_REFERENCE=./reference.pickle # Reference file based on refsamples
Sample specific variables
FILE_INPUT=$1 # Argument 1
FILE_OUTPUT=$2 # Argument 2
NAME_SAMPLE="NIPT" # Probably not needed name in the bam file
GC-Correction preparation
For any sample put through WISECONDOR, we apply a GC-Correction. To do this we first need to prepare a GC-count table based on the reference used for mapping reads in other steps:
$SCRIPT_PYTHON $SCRIPT_WC_COUNTGC \
$FILE_REFERENCE \
$FILE_WC_GCCOUNT
Sample preparation
This is how we prepare data using BWA:
We map our reads:
$SCRIPT_BWA \
aln \
-t $ARG_BWA_THREADS \
$FILE_REFERENCE \
$FILE_INPUT \
-f $FILE_OUTPUT.sai \
-n 0 \
-k 0
Followed by:
$SCRIPT_BWA \
samse \
-r "@RG\tID:idname\tLB:libname\tSM:$NAME_SAMPLE\tPL:ILLUMINA" \
$FILE_REFERENCE \
$FILE_OUTPUT.sai \
-n -1 \
$FILE_INPUT \
> $FILE_OUTPUT.sam
Then we sort the data using PicardTools:
$SCRIPT_JAVA \
-Xmx4g -Djava.io.tmpdir=/tmp \
-jar $SCRIPT_PT_SORTSAM \
SO=coordinate \
INPUT=$FILE_OUTPUT.sam \
OUTPUT=$FILE_OUTPUT.sort.bam \
VALIDATION_STRINGENCY=LENIENT \
TMP_DIR=$DIR_OUTPUT \
CREATE_INDEX=true
At this point we have a sorted SAM formatted file that is almost ready to be put into WISECONDOR.
Conversion
We just need to remove duplicate reads and send the reads that are left into the conversion script:
$SCRIPT_SAMTOOLS rmdup -s $FILE_OUTPUT.sort.bam - | \
$SCRIPT_SAMTOOLS view - -q 1 | \
$SCRIPT_PYTHON $SCRIPT_WC_CONSAM \
$FILE_OUTPUT.pickle`
The pickle file contains a python dictionary with a series of values per chromosome. Every value is simply the amount of reads found in that bin, for that chromosome, ordered in sequence of the bins occurrence (i.e. bin 0 contains reads found in bp 0m to 1m, bin 1 the amount of reads in 1m to 2m, and so on).
GC-Correction
This information is basically what could be put into WISECONDOR right away but it turned out some additional pre-processing gets rid of some of the worst biases in the data, so we apply a simple GC-correction first:
$SCRIPT_PYTHON $SCRIPT_WC_GCC \
$FILE_OUTPUT.pickle \
$REF_GCCOUNT \
$FILE_OUTPUT.gcc \
>> $FILE_OUTPUT.log
Now we have GC-corrected information per sample. We can either use it to train WISECONDOR and create a reference, or to test the sample at hand using a previously created reference.
Creating a reference
Without training on data first, WISECONDOR cannot run. If you already have created a reference, you can skip this step. You only need to apply it the first time, or to update your reference.
$SCRIPT_PYTHON $SCRIPT_WC_NEWREF \
$DIR_REFSAMPLES \
$FILE_WC_REFERENCE
Testing a sample
If we want to test a sample using a previously created reference, appoint the test script to the target samples gcc file and reference file to use. It will provide textual information which is saved in the log file in this example, as well as a file used to plot the analysis later on.
$SCRIPT_PYTHON $SCRIPT_WC_TEST \
$FILE_OUTPUT.gcc \
$FILE_WC_REFERENCE \
$FILE_OUTPUT.plot \
>> $FILE_OUTPUT.log
To plot these test results:
$SCRIPT_PYTHON $SCRIPT_WC_PLOT \
$FILE_OUTPUT.plot \
$FILE_OUTPUT \
>> $FILE_OUTPUT.log