Whole genome simulation HG002 experiment - vgteam/toil-vg GitHub Wiki
Setup
The commands below will rely on the following being set to specify output locations, AWS user info, etc.
# Set your aws region
export TOIL_AWS_ZONE="us-west-2a"
# This will be used below for the Toil Jobstore. "jobstore" can be replaced with any name
export TOIL_JOBSTORE="aws:us-west-2:jobstore"
# All the output will be put in this S3 bucket. "outstore" can be replaced with any name
export TOIL_OUTSTORE="outstore"
# This must be a valid AWS keypair, with keys appropriately set up on your computer
export KEYPAIR_NAME=my_keypair_name
# This will be the name of the cluster leader node used, it can be any name
export LEADER=leader
# This is a Toil version that seems to work with toil-vg
export TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:3.16.0a1.dev2281-c7d77b028064a739e897f7b1eb158c902b530475
Create a leader node (with current toil-vg master)
scripts/create-ec2-leader.sh $LEADER $KEYPAIR_NAME
Construction
Make a pair of HG002 haplotype threads (for chromosomes 1-22) to simulate from
scripts/construct-hs37d5-ec2.py $TOIL_JOBSTORE $TOIL_OUTSTORE --leader $LEADER --sample HG002 --haplo_graph --xg --chroms $(for i in $(seq 1 22); do echo $i; done) --out_name giab
Make a primary reference and pangenome graph to test on (can be done in parallel to above on different leader/jobstore)
scripts/construct-hs37d5-ec2.py $TOIL_JOBSTORE $TOIL_OUTSTORE --leader $LEADER --xg --gcsa --out_name snp1kg --primary
Optionally make an HG002 sample graph as positive control (can be done in parallel to above on different leader/jobstore)
scripts/construct-hs37d5-ec2.py $TOIL_JOBSTORE $TOIL_OUTSTORE --leader $LEADER --xg --gcsa --sample HG002 --sample_graph --out_name giab --chroms $(for i in $(seq 1 22); do echo $i; done)
Simulation
Simulate 10M read pairs from HG002
scripts/sim-ec2.py $TOIL_JOBSTORE $TOIL_OUTSTORE s3://${TOIL_OUTSTORE}/giab_HG002_haplo 10000000 --leader $LEADER
Mapping Evaluation
Make ROC-plots using the graphs and simulated reads
scripts/mapeval-ec2.py $TOIL_JOBSTORE $TOIL_OUTSTORE s3://{TOIL_OUTSTORE}/giab_HG002_haplo_sim_10.0M_trained-p570-v65-i0.002-I.fq.gz --leader $LEADER --outname mapeval --fasta ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz --index_bases s3://${TOIL_OUTSTORE}/snp1kg s3://${TOIL_OUTSTORE}/primary --names snp1kg primary