Example Pipeline - VUmcCGP/wisecondor GitHub Wiki

Pipeline - Example

At first I planned to release WISECONDOR with a little bash script that would function as an example pipeline. As this turned out a little bit more annoying and confusing for people than I anticipated I decided to leave it out, yet I forgot to remove a reference to this non-existing script in the readme. As some people have asked to release this example anyway I decided to just provide the steps we use in our pipeline on this page.

UPDATE: I released a little pipeline to fill this gap, you can find it here: WISECONDOR on autopilot: run.sh

We don't apply these steps separately, nor do we use a simple bash script to run them. All these steps are integrated in my pipeline system (RoStr, also on my github) but that is not really documented well enough for third party use. I suggest you turn these steps into a bash script that fits your system instead.

CAUTION: I wrote this page in a single day based on what I have made previously, it might still contain errors. If you find any, please let me know so I can fix it.

Variables used

First a little overview of all the variables used in this pipeline. I put down suggestions and example values, you should replace paths and options as you see fit. I doubt I could make a one pipeline fits all solution here.

Paths to binaries

SCRIPT_PYTHON=python
SCRIPT_BWA=bwa
SCRIPT_JAVA=java
SCRIPT_PT_SORTSAM=picardtools/SortSam.jar
SCRIPT_SAMTOOLS=samtools

Systems specific options to optimize BWA

ARG_BWA_THREADS=4 # Multithreading in BWA

Path to WISECONDOR

DIR_WC=./
SCRIPT_WC_COUNTGC=$DIR_WC/countgc.py
SCRIPT_WC_CONSAM=$DIR_WC/consam.py
SCRIPT_WC_GCC=$DIR_WC/gcc.py
SCRIPT_WC_NEWREF=$DIR_WC/newref.py
SCRIPT_WC_PLOT=$DIR_WC/plot.py

Reference files used and produced

FILE_REFERENCE=./hg19.fasta # As downloaded from your source
FILE_WC_GCCOUNT=./gccount.pickle # GC-Count per bin based on fasta reference

DIR_REFSAMPLES=./refsamples # Directory containing GC-Corrected sampels to use as reference data
FILE_WC_REFERENCE=./reference.pickle # Reference file based on refsamples

Sample specific variables

FILE_INPUT=$1 # Argument 1
FILE_OUTPUT=$2 # Argument 2
NAME_SAMPLE="NIPT" # Probably not needed name in the bam file

GC-Correction preparation

For any sample put through WISECONDOR, we apply a GC-Correction. To do this we first need to prepare a GC-count table based on the reference used for mapping reads in other steps:

$SCRIPT_PYTHON $SCRIPT_WC_COUNTGC \
	$FILE_REFERENCE \
	$FILE_WC_GCCOUNT

Sample preparation

This is how we prepare data using BWA:

We map our reads:

$SCRIPT_BWA \
	aln \
	-t $ARG_BWA_THREADS \
	$FILE_REFERENCE \
	$FILE_INPUT \
	-f $FILE_OUTPUT.sai \
	-n 0 \
	-k 0

Followed by:

$SCRIPT_BWA \
	samse \
	-r "@RG\tID:idname\tLB:libname\tSM:$NAME_SAMPLE\tPL:ILLUMINA" \
	$FILE_REFERENCE \
	$FILE_OUTPUT.sai \
	-n -1 \
	$FILE_INPUT \
	> $FILE_OUTPUT.sam

Then we sort the data using PicardTools:

$SCRIPT_JAVA \
	-Xmx4g -Djava.io.tmpdir=/tmp \
	-jar $SCRIPT_PT_SORTSAM \
	SO=coordinate \
	INPUT=$FILE_OUTPUT.sam \
	OUTPUT=$FILE_OUTPUT.sort.bam \
	VALIDATION_STRINGENCY=LENIENT \
	TMP_DIR=$DIR_OUTPUT \
	CREATE_INDEX=true

At this point we have a sorted SAM formatted file that is almost ready to be put into WISECONDOR.

Conversion

We just need to remove duplicate reads and send the reads that are left into the conversion script:

$SCRIPT_SAMTOOLS rmdup -s $FILE_OUTPUT.sort.bam - | \
	$SCRIPT_SAMTOOLS view - -q 1 | \
	$SCRIPT_PYTHON $SCRIPT_WC_CONSAM \
		$FILE_OUTPUT.pickle`

The pickle file contains a python dictionary with a series of values per chromosome. Every value is simply the amount of reads found in that bin, for that chromosome, ordered in sequence of the bins occurrence (i.e. bin 0 contains reads found in bp 0m to 1m, bin 1 the amount of reads in 1m to 2m, and so on).

GC-Correction

This information is basically what could be put into WISECONDOR right away but it turned out some additional pre-processing gets rid of some of the worst biases in the data, so we apply a simple GC-correction first:

$SCRIPT_PYTHON $SCRIPT_WC_GCC \
	$FILE_OUTPUT.pickle \
	$REF_GCCOUNT \
	$FILE_OUTPUT.gcc \
	>> $FILE_OUTPUT.log

Now we have GC-corrected information per sample. We can either use it to train WISECONDOR and create a reference, or to test the sample at hand using a previously created reference.

Creating a reference

Without training on data first, WISECONDOR cannot run. If you already have created a reference, you can skip this step. You only need to apply it the first time, or to update your reference.

$SCRIPT_PYTHON $SCRIPT_WC_NEWREF \
	$DIR_REFSAMPLES \
	$FILE_WC_REFERENCE

Testing a sample

If we want to test a sample using a previously created reference, appoint the test script to the target samples gcc file and reference file to use. It will provide textual information which is saved in the log file in this example, as well as a file used to plot the analysis later on.

$SCRIPT_PYTHON $SCRIPT_WC_TEST \
	$FILE_OUTPUT.gcc \
	$FILE_WC_REFERENCE \
	$FILE_OUTPUT.plot \
	>> $FILE_OUTPUT.log

To plot these test results:

$SCRIPT_PYTHON $SCRIPT_WC_PLOT \
	$FILE_OUTPUT.plot \
	$FILE_OUTPUT \
	>> $FILE_OUTPUT.log