Gridss-purple-linx

An AWS parallel cluster walkthrough

Overview
Sources:
Further reading
Starting the cluster
- Command from local machine
- Running through screen
Create the logs directory
Downloading the refdata
- Commands for hartwig nextcloud dataset
  - Hartwig Reference Files
  - Create the reference index files
Downloading the input data
Initialising the input jsons
Downloading the workflow from GitHub
Running the workflow through toil
Uploading outputs to s3

Overview

This is a rather complex example of running CWL through parallel cluster.
The following code will:

Set up a parallel cluster through cloud formation.
Download the reference data and input data through sbatch commands.
- These jobs are run on compute nodes. You will need to make sure they complete before running the CWL workflow
Download and configure the input-json for the CWL workflow
Launch the CWL workflow through toil
Patiently wait for the job to finish.
Upload the data to your preferred S3 bucket.

Sources:

The following workflow is based on the gridss-purple-linx workflow from this repo

Much of the reference data is downloaded from the Hartwig nextcloud.

Starting the cluster

Assumes sso login has been completed and pcluster conda env is activated.

Command from local machine

start_cluster.py \
  --cluster-name "${USER}-Gridss-Purple-Linx" \
  --file-system-type fsx 
ssm i-XXXX

Running through screen

By running through screen, we are able to pop in-and-out of this workflow without losing our environment variables that we collect along the way.
Please refer to the screen docs for more information.

screen -S "gridss-purple-linx-workflow"

Create the logs directory

LOG_DIR="${HOME}/logs"
mkdir -p "${LOG_DIR}"

Downloading the refdata

Download the refdata from the nextcloud and our hg38 ref set - I would recommend moving this to your own s3 bucket for multiple uses.

Will make downloading a lot faster and reliable.

REF_DIR="${SHARED_DIR}/reference-data"
mkdir -p "${REF_DIR}"

Commands for hartwig nextcloud dataset

These are subject to timeouts, so please ensure they do download correctly.

If they have downloaded correctly, there should not be any zip files under ${REF_DIR}/hartwig-nextcloud

You may also need to create the bwa index and gridss-reference cache files in the same directory as the fasta reference.

REF_DIR_HARTWIG="${REF_DIR}/hartwig-nextcloud"
mkdir -p "${REF_DIR_HARTWIG}"

Hartwig Reference Files

If you do NOT have access to s3://umccr-refdata-dev/gridss-purple-linx you will need to download the gzipped tarballs from the hartwig nextcloud repository.
The nextcloud repository does NOT contain the hg38_alt reference data.

s3_refdata_path="s3://umccr-refdata-dev/gridss-purple-linx"
sbatch --job-name="gridss-purple-linx-refdata-download" \
  --output "${LOG_DIR}/s3_ref_data_download.%j.log" \
  --error "${LOG_DIR}/s3_ref_data_download.%j.log" \
  --wrap "aws s3 sync \"${s3_refdata_path}\" \
            \"${REF_DIR_HARTWIG}/\""

Create the reference index files

This step can be skipped if you have downloaded the data from the s3 bucket listed above

HG37 BWA

hg37_fasta_path="${SHARED_DIR}/reference-data/hartwig-nextcloud/GRCh37/refgenomes/Homo_sapiens.GRCh37.GATK.illumina/Homo_sapiens.GRCh37.GATK.illumina.fasta"
sbatch --job-name "build-bwa-index" \
  --output "${LOG_DIR}/build-bwa-index.%j.log" \
  --error "${LOG_DIR}/build-bwa-index.%j.log" \
  --mem-per-cpu=2 \
  --cpus-per-task=4 \
  --wrap "docker run \
            --user \"$(id -u):$(id -g)\" \
            --volume \"$(dirname "${hg37_fasta_path}"):/ref-data\" \  
            \"quay.io/biocontainers/bwa:0.7.17--hed695b0_6\" \
              bwa index \"/ref-data/$(basename "${hg37_fasta_path}")\""

HG38 BWA

hg38_fasta_path="${SHARED_DIR}/reference-data/hartwig-nextcloud/hg38/refgenomes/Homo_sapiens.GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna"
sbatch --job-name "build-bwa-index" \
  --output "${LOG_DIR}/build-bwa-index.%j.log" \
  --error "${LOG_DIR}/build-bwa-index.%j.log" \
  --mem-per-cpu=2 \
  --cpus-per-task=4 \
  --wrap "docker run \
            --volume \"$(dirname "${hg38_fasta_path}"):/ref-data\" \
            --user \"$(id -u):$(id -g)\" \
            \"quay.io/biocontainers/bwa:0.7.17--hed695b0_6\" \
              bwa index \"/ref-data/$(basename "${hg38_fasta_path}")\""

HG37 Gridss Cache

sbatch --job-name "build-gridss-reference-cache" \
  --output "${LOG_DIR}/build-gridss-cache.%j.log" \
  --error "${LOG_DIR}/build-gridss-cache.%j.log" \
  --mem-per-cpu=2 \
  --cpus-per-task=4 \
  --wrap "docker run \
            --user \"$(id -u):$(id -g)\" \
            --volume \"$(dirname "${hg37_fasta_path}"):/ref-data\" \
            \"quay.io/biocontainers/gridss:2.9.4--0\" \
              java -Xmx4g \
                -Dsamjdk.reference_fasta=\"/ref-data/$(basename "${hg37_fasta_path}")\" \
	            -Dsamjdk.use_async_io_read_samtools=true \
	            -Dsamjdk.use_async_io_write_samtools=true \
	            -Dsamjdk.use_async_io_write_tribble=true \
	            -Dsamjdk.buffer_size=4194304 \
	            -Dsamjdk.async_io_read_threads=8 \
	            -cp \"/usr/local/share/gridss-2.9.4-0/gridss.jar\" \
	              \"gridss.PrepareReference\" \
	                REFERENCE_SEQUENCE=\"/ref-data/$(basename "${hg37_fasta_path}")\""

HG38 Gridss Cache

sbatch --job-name "build-gridss-reference-cache" \
  --output "${LOG_DIR}/build-gridss-cache.%j.log" \
  --error "${LOG_DIR}/build-gridss-cache.%j.log" \
  --mem-per-cpu=2 \
  --cpus-per-task=4 \
  --wrap "docker run \
            --user \"$(id -u):$(id -g)\" \
            --volume \"$(dirname "${hg38_fasta_path}"):/ref-data\" \
            \"quay.io/biocontainers/gridss:2.9.4--0\" \
              java -Xmx4g \
                -Dsamjdk.reference_fasta=\"/ref-data/$(basename "${hg38_fasta_path}")\" \
	            -Dsamjdk.use_async_io_read_samtools=true \
	            -Dsamjdk.use_async_io_write_samtools=true \
	            -Dsamjdk.use_async_io_write_tribble=true \
	            -Dsamjdk.buffer_size=4194304 \
	            -Dsamjdk.async_io_read_threads=8 \
	            -cp \"/usr/local/share/gridss-2.9.4-0/gridss.jar\" \
	              \"gridss.PrepareReference\" \
	                  REFERENCE_SEQUENCE=\"/ref-data/$(basename "${hg38_fasta_path}")\""

Downloading the input data

INPUT_DIR="${SHARED_DIR}/input-data"
mkdir -p "${INPUT_DIR}"

First we use our ica command on our local machine to generate an AWS command and use the ssm_run to then submit that workflow remotely. If you do not have ssm_run (a wrapper on aws ssm send-command, you may wish to copy your download_command to the clipboard and then launch the script from there).

Initialise ICA dir

INPUT_DIR_ICA="${INPUT_DIR}/ica/SBJ_seqcii_020/"
mkdir -p "${INPUT_DIR_ICA}"

Get presigned urls

jq magic from this stack overflow thread

Submit job for downloading each file from your local terminal

# These variables need to be defined on your local terminal
instance_id="<instance_id>"
gds_input_data_path="gds://umccr-primary-data-dev/PD/SEQCII/hg38/SBJ_seqcii_020/"
ica_input_path="\${SHARED_DIR}/input-data/ica/SBJ_seqcii_020"

# Get name and presigned url for each file in the inputs
ica_files_list_with_access="$(ica files list "${gds_input_data_path}" \
                                --output-format json \
                                --max-items=0 \
                                --nonrecursive \
                                --with-access | {
                              # Capture outputs with jq
                              # Output is per line 'name,presigned_url'
                              jq --raw-output \
                                '.items | keys[] as $k | "\(.[$k] | .name),\(.[$k] | .presignedUrl)"'
                              })"

# Submit job remotely using the presigned url
while read p; do
  # File name
  name="$(echo "${p}" | cut -d',' -f1)"
  # The presigned url name
  presigned_url="$(echo "${p}" | cut -d',' -f2)"
  
  # Submit the sbatch command via ssm_run
  echo "sbatch --job-name=\"${name}\" \
          --output \"logs/wget-${name}.%j.log\" \
          --error \"logs/wget-${name}.%j.log\" \
          --wrap \"wget \\\"${presigned_url}\\\" --output-document \\\"${ica_input_path}/${name}\\\"\"" | \
  ssm_run --instance-id "${instance_id}"
done <<< "${ica_files_list_with_access}"

Gridss-purple-linx smoke test download

Run the git clone component inside a sub-shell,
so cd does not affect the main shell.

gridss_purple_linx_remote_url="https://github.com/hartwigmedical/gridss-purple-linx"
INPUT_DIR_GRIDSS_PURPLE_LINX_REPO_DIR="${INPUT_DIR}/gridss-purple-linx"
WORKFLOW_VERSION="v1.3.2"
mkdir -p "${INPUT_DIR_GRIDSS_PURPLE_LINX_REPO_DIR}"
(
 cd "${INPUT_DIR_GRIDSS_PURPLE_LINX_REPO_DIR}" && \
 git init && \
 git sparse-checkout init --cone && \
 git sparse-checkout set smoke_test/ && \
 git remote add -f origin "${gridss_purple_linx_remote_url}" && \
 git pull && \
 git checkout "${WORKFLOW_VERSION}"
)

Initialising the input jsons

The input json for the cwl workflow smoke-test will look something like this.

Smoke test

If using vim, you may wish to set paste (with :set paste) before inserting. This enables the right line recursion

Write the following file to gridss-purple-linx.packed.input-smoketest.json in your the home directory on the ec2 master instance.

{
  "sample_name": "CPCT12345678",
  "normal_sample": "CPCT12345678R",
  "tumor_sample": "CPCT12345678T",
  "tumor_bam": {
    "class": "File",
    "location": "__SHARED_DIR__/input-data/gridss-purple-linx/smoke_test/CPCT12345678T.bam"
  },
  "normal_bam": {
    "class": "File",
    "location": "__SHARED_DIR__/input-data/gridss-purple-linx/smoke_test/CPCT12345678R.bam"
  },
  "snvvcf": {
    "class": "File",
    "location": "__SHARED_DIR__/input-data/gridss-purple-linx/smoke_test/CPCT12345678T.somatic_caller_post_processed.vcf.gz"
  },
  "fasta_reference": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/refgenomes/Homo_sapiens.GRCh37.GATK.illumina/Homo_sapiens.GRCh37.GATK.illumina.fasta"
  },
  "fasta_reference_version": "37",
  "bwa_reference": {
      "class": "File",
      "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/refgenomes/Homo_sapiens.GRCh37.GATK.illumina/Homo_sapiens.GRCh37.GATK.illumina.fasta.bwt"
  },
  "reference_cache_gridss": {
      "class": "File",
      "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/refgenomes/Homo_sapiens.GRCh37.GATK.illumina/Homo_sapiens.GRCh37.GATK.illumina.fasta.img"       
  },
  "human_virus_reference_fasta": {
      "class": "File",
      "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/refgenomes/human_virus/human_virus.fa"
  },
  "gc_profile": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/gc/GC_profile.1000bp.cnp"
  },
  "blacklist_gridss": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/gridss/ENCFF001TDO.bed"
  },
  "breakend_pon": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/gridss/pon3792v1/gridss_pon_single_breakend.bed"
  },
  "breakpoint_pon": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/gridss/pon3792v1/gridss_pon_breakpoint.bedpe"
  },
  "breakpoint_hotspot": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/KnownFusionPairs.bedpe"
  },
  "bafsnps_amber": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/germline_het_pon/GermlineHetPon.vcf.gz"
  },
  "hotspots_purple": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/KnownHotspots.vcf.gz"
  },
  "known_fusion_data_linx": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/known_fusion_data.csv"
  },
  "gene_transcripts_dir": {
    "class": "Directory",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/ensembl_data_cache/"
  },
  "viral_hosts_file_linx": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/viral_host_ref.csv"
  },
  "replication_origins_file_linx": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/heli_rep_origins.bed"
  },
  "line_element_file_linx": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/line_elements.csv"
  },
  "fragile_site_file_linx": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/fragile_sites_hmf.csv"
  },
  "check_fusions_linx": true,
  "driver_gene_panel": {
      "class": "File",
      "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/knowledgebases/DriverGenePanel.tsv"
  },
  "configuration_gridss": {
      "class": "File",
      "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/GRCh37/dbs/gridss/gridss.properties"
  }
}

HG38 SBJ - large WGS dataset

Write the following file to gridss-purple-linx.packed.input-SBJ_seqcii_020.json

Note in our use case we've moved our reference data from /reference-data/hartwig-nextcloud/hg38/ to /reference-data/hartwig-nextcloud/hg38_alt/ This is due to our input bams being aligned with the alt reference All complementary files remain the same

{
  "sample_name": "SBJ_seqcii_020",
  "normal_sample": "seqcii_N020",
  "tumor_sample": "seqcii_T020",
  "tumor_bam": {
    "class": "File",
    "location": "__SHARED_DIR__/input-data/ica/SBJ_seqcii_020/SBJ_seqcii_020_tumor.bam"
  },
  "normal_bam": {
    "class": "File",
    "location": "__SHARED_DIR__/input-data/ica/SBJ_seqcii_020/SBJ_seqcii_020.bam"
  },
  "snvvcf": {
    "class": "File",
    "location": "__SHARED_DIR__/input-data/ica/SBJ_seqcii_020/SBJ_seqcii_020.vcf.gz"
  },
  "fasta_reference": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/refgenomes/hg38/hg38.fa"
  },
  "fasta_reference_version": "38",
  "bwa_reference": {
      "class": "File",
      "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/refgenomes/hg38/hg38.fa.bwt"
  },
  "reference_cache_gridss": {
      "class": "File",
      "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/refgenomes/hg38/hg38.fa.img"       
  },
  "human_virus_reference_fasta": {
      "class": "File",
      "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/refgenomes/human_virus/human_virus.fa"
  },
  "gc_profile": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/gc/GC_profile.1000bp.cnp"
  },
  "blacklist_gridss": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/gridss/ENCFF001TDO.bed"
  },
  "breakend_pon": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/gridss_pon/gridss_pon_single_breakend.bed"
  },
  "breakpoint_pon": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/gridss_pon/gridss_pon_breakpoint.bedpe"
  },
  "breakpoint_hotspot": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/KnownFusionPairs.bedpe"
  },
  "bafsnps_amber": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/germline_het_pon/GermlineHetPon.vcf.gz"
  },
  "known_fusion_data_linx": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/known_fusion_data.csv"
  },
  "hotspots_purple": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/KnownHotspots.vcf.gz"
  },
  "gene_transcripts_dir": {
    "class": "Directory",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/ensembl_data_cache"
  },
  "viral_hosts_file_linx": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/viral_host_ref.csv"
  },
  "replication_origins_file_linx": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/heli_rep_origins.bed"
  },
  "line_element_file_linx": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/line_elements.csv"
  },
  "fragile_site_file_linx": {
    "class": "File",
    "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/fragile_sites_hmf.csv"
  },
  "check_fusions_linx": true,
  "driver_gene_panel": {
      "class": "File",
      "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/knowledgebases/DriverGenePanel.tsv"
  },
  "configuration_gridss": {
      "class": "File",
      "location": "__SHARED_DIR__/reference-data/hartwig-nextcloud/hg38_alt/dbs/gridss/gridss.properties"
  }
}

Replace __SHARED_DIR__

Replace the __SHARED_DIR__ with the actual absolute path

sed -i "s%__SHARED_DIR__%${SHARED_DIR}%g" gridss-purple-linx.packed.input-smoketest.json
sed -i "s%__SHARED_DIR__%${SHARED_DIR}%g" gridss-purple-linx.packed.input-SBJ_seqcii_020.json

Downloading the workflow from GitHub

git clone https://github.com/umccr/gridss-purple-linx
# The branch 'cwl-workflow' contains the workflow
( \
    cd gridss-purple-linx && \
    git checkout cwl-workflow \
)

Running the workflow through toil

TOIL_ROOT="${SHARED_DIR}/toil"
mkdir -p "${TOIL_ROOT}"

Set env vars and create directories

# Set globals
TOIL_JOB_STORE="${TOIL_ROOT}/job-store"
TOIL_WORKDIR="${TOIL_ROOT}/workdir"
TOIL_TMPDIR="${TOIL_ROOT}/tmpdir"
TOIL_LOG_DIR="${TOIL_ROOT}/logs"
TOIL_OUTPUTS="${TOIL_ROOT}/outputs"

# Create directories
mkdir -p "${TOIL_JOB_STORE}"
mkdir -p "${TOIL_WORKDIR}"
mkdir -p "${TOIL_TMPDIR}"
mkdir -p "${TOIL_LOG_DIR}"
mkdir -p "${TOIL_OUTPUTS}"

# Activate environment
conda activate toil

Indexing the smoke test vcf file

You may need to first index the vcf file. Due to toil not fully supporting CWL v1.1 and this cwltool bug.

vcf_file="${SHARED_DIR}/input-data/gridss-purple-linx/smoke_test/CPCT12345678T.somatic_caller_post_processed.vcf.gz"

sbatch --job-name "index-smoked-vcf-file" \
  --output "${LOG_DIR}/index-smoked-vcf-file.%j.log" \
  --error "${LOG_DIR}/index-smoked-vcf-file.%j.log" \
  --wrap "docker run \
            --volume \"$(dirname "${vcf_file}"):/data\" \
            --user \"$(id -u):$(id -g)\" \
            \"quay.io/biocontainers/tabix:0.2.6--ha92aebf_0\" \
            tabix -p vcf \"/data/$(basename "${vcf_file}")\""

Launch smoke test job

cleanworkdirtype="onSuccess"  # Switch to 'never' for further debugging.
gridss_workflow_path="gridss-purple-linx/cwl/workflows/gridss-purple-linx/latest/gridss-purple-linx.latest.cwl"
gridss_purple_input_json="gridss-purple-linx.packed.input-smoketest.json"
partition="copy-long"  # Long partition for workflows
sbatch --job-name "toil-gridss-purple-linx-runner" \
       --output "${LOG_DIR}/toil.%j.log" \
       --error "${LOG_DIR}/toil.%j.log" \
       --partition "${partition}" \
       --no-requeue \
       --wrap "toil-cwl-runner \
                 --jobStore \"${TOIL_JOB_STORE}/job-\${SLURM_JOB_ID}\" \
                 --workDir \"${TOIL_WORKDIR}\" \
                 --outdir \"${TOIL_OUTPUTS}\" \
                 --batchSystem slurm \
                 --disableCaching true \
                 --cleanWorkDir \"${cleanworkdirtype}\" \
                 \"${gridss_workflow_path}\" \
                 \"${gridss_purple_input_json}\""

This job will likely fail because the gripss hard-filtering is too strict,
Until this this GitHub issue is fixed.

Launch SBJ Seqcii HG38 Job

cleanworkdirtype="onSuccess"  # Switch to 'never' for further debugging.
gridss_workflow_path="gridss-purple-linx/cwl/workflows/gridss-purple-linx/latest/gridss-purple-linx.latest.cwl"
gridss_purple_input_json="gridss-purple-linx.packed.input-SBJ_seqcii_020.json"
partition="copy-long"  # Long partition for workflows
sbatch --job-name "toil-gridss-purple-linx-runner" \
       --output "${LOG_DIR}/toil.%j.log" \
       --error "${LOG_DIR}/toil.%j.log" \
       --partition "${partition}" \
       --no-requeue \
       --wrap "toil-cwl-runner \
                 --jobStore \"${TOIL_JOB_STORE}/job-\${SLURM_JOB_ID}\" \
                 --workDir \"${TOIL_WORKDIR}\" \
                 --outdir \"${TOIL_OUTPUTS}\" \
                 --batchSystem slurm \
                 --disableCaching true \
                 --cleanWorkDir \"${cleanworkdirtype}\" \
                 \"${gridss_workflow_path}\" \
                 \"${gridss_purple_input_json}\""

This workflow may take a full day to complete!

Uploading outputs to s3

By default, you do NOT have permissions to upload to S3.
You can temporary credentials using yawsso or aws2-wrap from your local device and then use ssm_run to upload the data to your s3 bucket.

See Uploading data back to s3 for more information.

GridssPurpleLinx - umccr/aws_parallel_cluster GitHub Wiki

Gridss-purple-linx

Overview

Sources:

Further reading

Starting the cluster

Command from local machine

Running through screen

Create the logs directory

Downloading the refdata

Commands for hartwig nextcloud dataset

Hartwig Reference Files

Create the reference index files

Downloading the input data

Initialise ICA dir

Get presigned urls

Gridss-purple-linx smoke test download

Initialising the input jsons

Smoke test

HG38 SBJ - large WGS dataset

Replace __SHARED_DIR__

Downloading the workflow from GitHub

Running the workflow through toil

Set env vars and create directories

Indexing the smoke test vcf file

Launch smoke test job

Launch SBJ Seqcii HG38 Job

Uploading outputs to s3

⚠️ GitHub.com Fallback ⚠️

GridssPurpleLinx - umccr/aws_parallel_cluster GitHub Wiki

Gridss-purple-linx

Overview

Sources:

Further reading

Starting the cluster

Command from local machine

Running through screen

Create the logs directory

Downloading the refdata

Commands for hartwig nextcloud dataset

Hartwig Reference Files

Create the reference index files

Downloading the input data

Initialise ICA dir

Get presigned urls

Gridss-purple-linx smoke test download

Initialising the input jsons

Smoke test

HG38 SBJ - large WGS dataset

Replace __SHARED_DIR__

Downloading the workflow from GitHub

Running the workflow through toil

Set env vars and create directories

Indexing the smoke test vcf file

Launch smoke test job

Launch SBJ Seqcii HG38 Job

Uploading outputs to s3

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️