GridssPurpleLinx - umccr/aws_parallel_cluster GitHub Wiki
An AWS parallel cluster walkthrough
- Overview
- Sources:
- Further reading
- Starting the cluster
- Create the logs directory
- Downloading the refdata
- Downloading the input data
- Initialising the input jsons
- Downloading the workflow from GitHub
- Running the workflow through toil
- Uploading outputs to s3
This is a rather complex example of running CWL through parallel cluster.
The following code will:
- Set up a parallel cluster through cloud formation.
- Download the reference data and input data through sbatch commands.
- These jobs are run on compute nodes. You will need to make sure they complete before running the CWL workflow
- Download and configure the input-json for the CWL workflow
- Launch the CWL workflow through toil
- Patiently wait for the job to finish.
- Upload the data to your preferred S3 bucket.
The following workflow is based on the gridss-purple-linx workflow from this repo
Much of the reference data is downloaded from the Hartwig nextcloud.
Assumes sso login has been completed and pcluster conda env is activated.
start_cluster.py \
--cluster-name "${USER}-Gridss-Purple-Linx" \
--file-system-type fsx
ssm i-XXXX
By running through screen, we are able to pop in-and-out of this workflow
without losing our environment variables that we collect along the way.
Please refer to the screen docs for more information.
screen -S "gridss-purple-linx-workflow"
LOG_DIR="${HOME}/logs"
mkdir -p "${LOG_DIR}"
Download the refdata from the nextcloud and our hg38 ref set - I would recommend moving this to your own s3 bucket for multiple uses.
Will make downloading a lot faster and reliable.
REF_DIR="${SHARED_DIR}/reference-data"
mkdir -p "${REF_DIR}"
These are subject to timeouts, so please ensure they do download correctly.
If they have downloaded correctly, there should not be any zip files under ${REF_DIR}/hartwig-nextcloud
You may also need to create the bwa index and gridss-reference cache files in the same directory as the fasta reference.
REF_DIR_HARTWIG="${REF_DIR}/hartwig-nextcloud"
mkdir -p "${REF_DIR_HARTWIG}"
If you do NOT have access to s3://umccr-refdata-dev/gridss-purple-linx
you will need to download
the gzipped tarballs from the hartwig nextcloud repository.
The nextcloud repository does NOT contain the hg38_alt reference data.
s3_refdata_path="s3://umccr-refdata-dev/gridss-purple-linx"
sbatch --job-name="gridss-purple-linx-refdata-download" \
--output "${LOG_DIR}/s3_ref_data_download.%j.log" \
--error "${LOG_DIR}/s3_ref_data_download.%j.log" \
--wrap "aws s3 sync \"${s3_refdata_path}\" \
\"${REF_DIR_HARTWIG}/\""
This step can be skipped if you have downloaded the data from the s3 bucket listed above
HG37 BWA
hg37_fasta_path="${SHARED_DIR}/reference-data/hartwig-nextcloud/GRCh37/refgenomes/Homo_sapiens.GRCh37.GATK.illumina/Homo_sapiens.GRCh37.GATK.illumina.fasta"
sbatch --job-name "build-bwa-index" \
--output "${LOG_DIR}/build-bwa-index.%j.log" \
--error "${LOG_DIR}/build-bwa-index.%j.log" \
--mem-per-cpu=2 \
--cpus-per-task=4 \
--wrap "docker run \
--user \"$(id -u):$(id -g)\" \
--volume \"$(dirname "${hg37_fasta_path}"):/ref-data\" \
\"quay.io/biocontainers/bwa:0.7.17--hed695b0_6\" \
bwa index \"/ref-data/$(basename "${hg37_fasta_path}")\""
HG38 BWA
hg38_fasta_path="${SHARED_DIR}/reference-data/hartwig-nextcloud/hg38/refgenomes/Homo_sapiens.GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna"
sbatch --job-name "build-bwa-index" \
--output "${LOG_DIR}/build-bwa-index.%j.log" \
--error "${LOG_DIR}/build-bwa-index.%j.log" \
--mem-per-cpu=2 \
--cpus-per-task=4 \
--wrap "docker run \
--volume \"$(dirname "${hg38_fasta_path}"):/ref-data\" \
--user \"$(id -u):$(id -g)\" \
\"quay.io/biocontainers/bwa:0.7.17--hed695b0_6\" \
bwa index \"/ref-data/$(basename "${hg38_fasta_path}")\""
HG37 Gridss Cache
sbatch --job-name "build-gridss-reference-cache" \
--output "${LOG_DIR}/build-gridss-cache.%j.log" \
--error "${LOG_DIR}/build-gridss-cache.%j.log" \
--mem-per-cpu=2 \
--cpus-per-task=4 \
--wrap "docker run \
--user \"$(id -u):$(id -g)\" \
--volume \"$(dirname "${hg37_fasta_path}"):/ref-data\" \
\"quay.io/biocontainers/gridss:2.9.4--0\" \
java -Xmx4g \
-Dsamjdk.reference_fasta=\"/ref-data/$(basename "${hg37_fasta_path}")\" \
-Dsamjdk.use_async_io_read_samtools=true \
-Dsamjdk.use_async_io_write_samtools=true \
-Dsamjdk.use_async_io_write_tribble=true \
-Dsamjdk.buffer_size=4194304 \
-Dsamjdk.async_io_read_threads=8 \
-cp \"/usr/local/share/gridss-2.9.4-0/gridss.jar\" \
\"gridss.PrepareReference\" \
REFERENCE_SEQUENCE=\"/ref-data/$(basename "${hg37_fasta_path}")\""
HG38 Gridss Cache
sbatch --job-name "build-gridss-reference-cache" \
--output "${LOG_DIR}/build-gridss-cache.%j.log" \
--error "${LOG_DIR}/build-gridss-cache.%j.log" \
--mem-per-cpu=2 \
--cpus-per-task=4 \
--wrap "docker run \
--user \"$(id -u):$(id -g)\" \
--volume \"$(dirname "${hg38_fasta_path}"):/ref-data\" \
\"quay.io/biocontainers/gridss:2.9.4--0\" \
java -Xmx4g \
-Dsamjdk.reference_fasta=\"/ref-data/$(basename "${hg38_fasta_path}")\" \
-Dsamjdk.use_async_io_read_samtools=true \
-Dsamjdk.use_async_io_write_samtools=true \
-Dsamjdk.use_async_io_write_tribble=true \
-Dsamjdk.buffer_size=4194304 \
-Dsamjdk.async_io_read_threads=8 \
-cp \"/usr/local/share/gridss-2.9.4-0/gridss.jar\" \
\"gridss.PrepareReference\" \
REFERENCE_SEQUENCE=\"/ref-data/$(basename "${hg38_fasta_path}")\""
INPUT_DIR="${SHARED_DIR}/input-data"
mkdir -p "${INPUT_DIR}"
First we use our ica command on our local machine to generate an AWS command
and use the ssm_run to then submit that workflow remotely. If you do not have ssm_run
(a wrapper on aws ssm send-command
, you may wish to copy your download_command
to the clipboard
and then launch the script from there).
INPUT_DIR_ICA="${INPUT_DIR}/ica/SBJ_seqcii_020/"
mkdir -p "${INPUT_DIR_ICA}"
jq magic from this stack overflow thread
Submit job for downloading each file from your local terminal
# These variables need to be defined on your local terminal
instance_id="<instance_id>"
gds_input_data_path="gds://umccr-primary-data-dev/PD/SEQCII/hg38/SBJ_seqcii_020/"
ica_input_path="\${SHARED_DIR}/input-data/ica/SBJ_seqcii_020"
# Get name and presigned url for each file in the inputs
ica_files_list_with_access="$(ica files list "${gds_input_data_path}" \
--output-format json \
--max-items=0 \
--nonrecursive \
--with-access | {
# Capture outputs with jq
# Output is per line 'name,presigned_url'
jq --raw-output \
'.items | keys[] as $k | "\(.[$k] | .name),\(.[$k] | .presignedUrl)"'
})"
# Submit job remotely using the presigned url
while read p; do
# File name
name="$(echo "${p}" | cut -d',' -f1)"
# The presigned url name
presigned_url="$(echo "${p}" | cut -d',' -f2)"
# Submit the sbatch command via ssm_run
echo "sbatch --job-name=\"${name}\" \
--output \"logs/wget-${name}.%j.log\" \
--error \"logs/wget-${name}.%j.log\" \
--wrap \"wget \\\"${presigned_url}\\\" --output-document \\\"${ica_input_path}/${name}\\\"\"" | \
ssm_run --instance-id "${instance_id}"
done <<< "${ica_files_list_with_access}"
Run the git clone component inside a sub-shell,
so cd
does not affect the main shell.
gridss_purple_linx_remote_url="https://github.com/hartwigmedical/gridss-purple-linx"
INPUT_DIR_GRIDSS_PURPLE_LINX_REPO_DIR="${INPUT_DIR}/gridss-purple-linx"
WORKFLOW_VERSION="v1.3.2"
mkdir -p "${INPUT_DIR_GRIDSS_PURPLE_LINX_REPO_DIR}"
(
cd "${INPUT_DIR_GRIDSS_PURPLE_LINX_REPO_DIR}" && \
git init && \
git sparse-checkout init --cone && \
git sparse-checkout set smoke_test/ && \
git remote add -f origin "${gridss_purple_linx_remote_url}" && \
git pull && \
git checkout "${WORKFLOW_VERSION}"
)
The input json for the cwl workflow smoke-test will look something like this.
If using vim, you may wish to set paste (with :set paste) before inserting. This enables the right line recursion
Write the following file to gridss-purple-linx.packed.input-smoketest.json
in your
the home directory on the ec2 master instance.
{
"sample_name": "CPCT12345678",
"normal_sample": "CPCT12345678R",
"tumor_sample": "CPCT12345678T",
"tumor_bam": {
"class": "File",
"location": "__SHARED_DIR__/input-data/gridss-purple-linx/smoke_test/CPCT12345678T.bam"
},
"normal_bam": {
"class": "File",
"location": "__SHARED_DIR__/input-data/gridss-purple-linx/smoke_test/CPCT12345678R.bam"
},
"snvvcf": {
"class": "File",
"location": "__SHARED_DIR__/input-data/gridss-purple-linx/smoke_test/CPCT12345678T.somatic_caller_post_processed.vcf.gz"
},
"fasta_reference": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/refgenomes/Homo_sapiens.GRCh37.GATK.illumina/Homo_sapiens.GRCh37.GATK.illumina.fasta"
},
"fasta_reference_version": "37",
"bwa_reference": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/refgenomes/Homo_sapiens.GRCh37.GATK.illumina/Homo_sapiens.GRCh37.GATK.illumina.fasta.bwt"
},
"reference_cache_gridss": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/refgenomes/Homo_sapiens.GRCh37.GATK.illumina/Homo_sapiens.GRCh37.GATK.illumina.fasta.img"
},
"human_virus_reference_fasta": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/refgenomes/human_virus/human_virus.fa"
},
"gc_profile": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/gc/GC_profile.1000bp.cnp"
},
"blacklist_gridss": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/gridss/ENCFF001TDO.bed"
},
"breakend_pon": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/gridss/pon3792v1/gridss_pon_single_breakend.bed"
},
"breakpoint_pon": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/gridss/pon3792v1/gridss_pon_breakpoint.bedpe"
},
"breakpoint_hotspot": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/KnownFusionPairs.bedpe"
},
"bafsnps_amber": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/germline_het_pon/GermlineHetPon.vcf.gz"
},
"hotspots_purple": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/KnownHotspots.vcf.gz"
},
"known_fusion_data_linx": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/known_fusion_data.csv"
},
"gene_transcripts_dir": {
"class": "Directory",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/ensembl_data_cache/"
},
"viral_hosts_file_linx": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/viral_host_ref.csv"
},
"replication_origins_file_linx": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/heli_rep_origins.bed"
},
"line_element_file_linx": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/line_elements.csv"
},
"fragile_site_file_linx": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/fragile_sites_hmf.csv"
},
"check_fusions_linx": true,
"driver_gene_panel": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/DriverGenePanel.tsv"
},
"configuration_gridss": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/gridss/gridss.properties"
}
}
Write the following file to gridss-purple-linx.packed.input-SBJ_seqcii_020.json
Note in our use case we've moved our reference data from
/reference-data/hartwig-nextcloud/hg38/
to/reference-data/hartwig-nextcloud/hg38_alt/
This is due to our input bams being aligned with the alt reference All complementary files remain the same
{
"sample_name": "SBJ_seqcii_020",
"normal_sample": "seqcii_N020",
"tumor_sample": "seqcii_T020",
"tumor_bam": {
"class": "File",
"location": "__SHARED_DIR__/input-data/ica/SBJ_seqcii_020/SBJ_seqcii_020_tumor.bam"
},
"normal_bam": {
"class": "File",
"location": "__SHARED_DIR__/input-data/ica/SBJ_seqcii_020/SBJ_seqcii_020.bam"
},
"snvvcf": {
"class": "File",
"location": "__SHARED_DIR__/input-data/ica/SBJ_seqcii_020/SBJ_seqcii_020.vcf.gz"
},
"fasta_reference": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/refgenomes/hg38/hg38.fa"
},
"fasta_reference_version": "38",
"bwa_reference": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/refgenomes/hg38/hg38.fa.bwt"
},
"reference_cache_gridss": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/refgenomes/hg38/hg38.fa.img"
},
"human_virus_reference_fasta": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/refgenomes/human_virus/human_virus.fa"
},
"gc_profile": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/gc/GC_profile.1000bp.cnp"
},
"blacklist_gridss": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/gridss/ENCFF001TDO.bed"
},
"breakend_pon": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/gridss_pon/gridss_pon_single_breakend.bed"
},
"breakpoint_pon": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/gridss_pon/gridss_pon_breakpoint.bedpe"
},
"breakpoint_hotspot": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/KnownFusionPairs.bedpe"
},
"bafsnps_amber": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/germline_het_pon/GermlineHetPon.vcf.gz"
},
"known_fusion_data_linx": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/known_fusion_data.csv"
},
"hotspots_purple": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/KnownHotspots.vcf.gz"
},
"gene_transcripts_dir": {
"class": "Directory",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/ensembl_data_cache"
},
"viral_hosts_file_linx": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/viral_host_ref.csv"
},
"replication_origins_file_linx": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/heli_rep_origins.bed"
},
"line_element_file_linx": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/line_elements.csv"
},
"fragile_site_file_linx": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/fragile_sites_hmf.csv"
},
"check_fusions_linx": true,
"driver_gene_panel": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/DriverGenePanel.tsv"
},
"configuration_gridss": {
"class": "File",
"location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/gridss/gridss.properties"
}
}
Replace the __SHARED_DIR__ with the actual absolute path
sed -i "s%__SHARED_DIR__%${SHARED_DIR}%g" gridss-purple-linx.packed.input-smoketest.json
sed -i "s%__SHARED_DIR__%${SHARED_DIR}%g" gridss-purple-linx.packed.input-SBJ_seqcii_020.json
git clone https://github.com/umccr/gridss-purple-linx
# The branch 'cwl-workflow' contains the workflow
( \
cd gridss-purple-linx && \
git checkout cwl-workflow \
)
TOIL_ROOT="${SHARED_DIR}/toil"
mkdir -p "${TOIL_ROOT}"
# Set globals
TOIL_JOB_STORE="${TOIL_ROOT}/job-store"
TOIL_WORKDIR="${TOIL_ROOT}/workdir"
TOIL_TMPDIR="${TOIL_ROOT}/tmpdir"
TOIL_LOG_DIR="${TOIL_ROOT}/logs"
TOIL_OUTPUTS="${TOIL_ROOT}/outputs"
# Create directories
mkdir -p "${TOIL_JOB_STORE}"
mkdir -p "${TOIL_WORKDIR}"
mkdir -p "${TOIL_TMPDIR}"
mkdir -p "${TOIL_LOG_DIR}"
mkdir -p "${TOIL_OUTPUTS}"
# Activate environment
conda activate toil
You may need to first index the vcf file. Due to toil not fully supporting CWL v1.1 and this cwltool bug.
vcf_file="${SHARED_DIR}/input-data/gridss-purple-linx/smoke_test/CPCT12345678T.somatic_caller_post_processed.vcf.gz"
sbatch --job-name "index-smoked-vcf-file" \
--output "${LOG_DIR}/index-smoked-vcf-file.%j.log" \
--error "${LOG_DIR}/index-smoked-vcf-file.%j.log" \
--wrap "docker run \
--volume \"$(dirname "${vcf_file}"):/data\" \
--user \"$(id -u):$(id -g)\" \
\"quay.io/biocontainers/tabix:0.2.6--ha92aebf_0\" \
tabix -p vcf \"/data/$(basename "${vcf_file}")\""
cleanworkdirtype="onSuccess" # Switch to 'never' for further debugging.
gridss_workflow_path="gridss-purple-linx/cwl/workflows/gridss-purple-linx/latest/gridss-purple-linx.latest.cwl"
gridss_purple_input_json="gridss-purple-linx.packed.input-smoketest.json"
partition="copy-long" # Long partition for workflows
sbatch --job-name "toil-gridss-purple-linx-runner" \
--output "${LOG_DIR}/toil.%j.log" \
--error "${LOG_DIR}/toil.%j.log" \
--partition "${partition}" \
--no-requeue \
--wrap "toil-cwl-runner \
--jobStore \"${TOIL_JOB_STORE}/job-\${SLURM_JOB_ID}\" \
--workDir \"${TOIL_WORKDIR}\" \
--outdir \"${TOIL_OUTPUTS}\" \
--batchSystem slurm \
--disableCaching true \
--cleanWorkDir \"${cleanworkdirtype}\" \
\"${gridss_workflow_path}\" \
\"${gridss_purple_input_json}\""
This job will likely fail because the gripss hard-filtering is too strict,
Until this this GitHub issue is fixed.
cleanworkdirtype="onSuccess" # Switch to 'never' for further debugging.
gridss_workflow_path="gridss-purple-linx/cwl/workflows/gridss-purple-linx/latest/gridss-purple-linx.latest.cwl"
gridss_purple_input_json="gridss-purple-linx.packed.input-SBJ_seqcii_020.json"
partition="copy-long" # Long partition for workflows
sbatch --job-name "toil-gridss-purple-linx-runner" \
--output "${LOG_DIR}/toil.%j.log" \
--error "${LOG_DIR}/toil.%j.log" \
--partition "${partition}" \
--no-requeue \
--wrap "toil-cwl-runner \
--jobStore \"${TOIL_JOB_STORE}/job-\${SLURM_JOB_ID}\" \
--workDir \"${TOIL_WORKDIR}\" \
--outdir \"${TOIL_OUTPUTS}\" \
--batchSystem slurm \
--disableCaching true \
--cleanWorkDir \"${cleanworkdirtype}\" \
\"${gridss_workflow_path}\" \
\"${gridss_purple_input_json}\""
This workflow may take a full day to complete!
By default, you do NOT have permissions to upload to S3.
You can temporary credentials using yawsso or aws2-wrap from your local device and then use ssm_run
to upload the data to your s3 bucket.
See Uploading data back to s3 for more information.