Pisces Paper Analysis - Illumina/Pisces GitHub Wiki
This webpage details how to recreate the Pisces results from "Pisces: An Accurate and Versatile Variant Caller for Somatic and Germline Next-Generation Sequencing Data". The instructions are given using Docker, but they can easily be adapted to a local install. Your host machine will need a TB of space. ( If you want to skip Docker and run locally, you can use this analysis script as a guide and just change the directories as needed: run_analysis.sh )
1) Download the data
Download the required data (genome, bam files, and truth data) to the host machine, with the following directory structure. All .gz files should be unzipped.
[YourHostDataDir]/
-
AppResults/
The contents of AppResults folder from Basespace project Pisces_Supplementary_Data_v1.0.1 go here. Includes bams and bedfiles needed.
-
genome/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/
The genome.fa and GenomeSize.xml from hg19 build of the human genome go here.
-
platinum/
The platinum genomes truth data for NA12877 and NA12878, downloaded from the following links:
2) Install and create the Docker image with both Pisces and Hap.py
(a) Install Docker, if you do not have it.
(b) Download Hap.py v0.3.10
(c) Replace the Hap.py docker file with the docker file located here: Docker image to rerun Pisces Paper analysis. This new Dockerfile fixes some issues with the standard Hap.py install, and installs Pisces on top of Hap.py. Copy run_analysis.sh into the same folder.
d) build the Docker image
docker build -t pisces .
3) Run Docker with the host data folder mounted at /data
docker run -it --mount type=bind,source=[YourHostDataDir]/,target=/data pisces /bin/bash
Running the docker image will set up the analysis software, and copy over the analysis script run_analysis.sh.
4) Run the analysis
Inside the docker container, run run_analysis.sh script. This script will generate the Pisces vcfs and the Hap.py / Som.py .csv files that give the precision and recall results, by sample. The numbers given in the paper are the average precision and recall, over all samples in a given dataset.