Whole genome simulation HG002 experiment - vgteam/toil-vg GitHub Wiki

Setup

The commands below will rely on the following being set to specify output locations, AWS user info, etc.

# Set your aws region
export TOIL_AWS_ZONE="us-west-2a"
# This will be used below for the Toil Jobstore.  "jobstore" can be replaced with any name
export TOIL_JOBSTORE="aws:us-west-2:jobstore"
# All the output will be put in this S3 bucket.  "outstore" can be replaced with any name
export TOIL_OUTSTORE="outstore"
# This must be a valid AWS keypair, with keys appropriately set up on your computer
export KEYPAIR_NAME=my_keypair_name
# This will be the name of the cluster leader node used, it can be any name
export LEADER=leader
# This is a Toil version that seems to work with toil-vg
export TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:3.16.0a1.dev2281-c7d77b028064a739e897f7b1eb158c902b530475

Create a leader node (with current toil-vg master)

scripts/create-ec2-leader.sh $LEADER $KEYPAIR_NAME

Construction

Make a pair of HG002 haplotype threads (for chromosomes 1-22) to simulate from

scripts/construct-hs37d5-ec2.py $TOIL_JOBSTORE $TOIL_OUTSTORE --leader $LEADER --sample HG002 --haplo_graph --xg --chroms $(for i in $(seq 1 22); do echo $i; done) --out_name giab

Make a primary reference and pangenome graph to test on (can be done in parallel to above on different leader/jobstore)

scripts/construct-hs37d5-ec2.py $TOIL_JOBSTORE $TOIL_OUTSTORE --leader $LEADER --xg --gcsa --out_name snp1kg --primary

Optionally make an HG002 sample graph as positive control (can be done in parallel to above on different leader/jobstore)

scripts/construct-hs37d5-ec2.py $TOIL_JOBSTORE $TOIL_OUTSTORE --leader $LEADER --xg --gcsa --sample HG002 --sample_graph --out_name giab --chroms $(for i in $(seq 1 22); do echo $i; done)

Simulation

Simulate 10M read pairs from HG002

scripts/sim-ec2.py $TOIL_JOBSTORE $TOIL_OUTSTORE s3://${TOIL_OUTSTORE}/giab_HG002_haplo 10000000 --leader $LEADER

Mapping Evaluation

Make ROC-plots using the graphs and simulated reads

scripts/mapeval-ec2.py $TOIL_JOBSTORE $TOIL_OUTSTORE s3://{TOIL_OUTSTORE}/giab_HG002_haplo_sim_10.0M_trained-p570-v65-i0.002-I.fq.gz --leader $LEADER --outname mapeval --fasta ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz --index_bases s3://${TOIL_OUTSTORE}/snp1kg s3://${TOIL_OUTSTORE}/primary --names snp1kg primary