HaplotypeCaller - umccr/aws_parallel_cluster GitHub Wiki
An AWS parallel cluster walkthrough
- Overview
- Sources:
- Further reading
- Starting the cluster
- Setup and Data Staging
- Submit job to cromwell server through cromshell
- Where are my outputs
- Uploading outputs to s3
This is a straight forward example of running a WDL through parallel cluster.
The following code will:
- Set up a parallel cluster through cloud formation.
- Start the cromwell server on the master node
- Download the reference data and input data through sbatch commands.
- These jobs are run on compute nodes.
- You will need to make sure they complete before running the wdl script
- Download and configure the input-json for the WDL workflow
- Submit the wdl job to the cromwell server.
- Patiently wait for the job to finish.
- Upload the data to your preferred S3 bucket.
The following resources are courtesy of the Broad Institute.
The cluster uses the umccr_dev
template.
This command is launched from the umccr/infrastructure repo under the parallel_cluster
directory.
Assumes sso login has been completed and pcluster conda env is activated.
start_cluster.py \
--cluster-name "${USER}-WDLHaplotypeCaller"`
# Log in with
ssm i-XXXX
By running through screen, we are able to pop in-and-out of this workflow without losing our environment variables that we collect along the way Please refer to the screen docs for more information.
screen -S "haplotype_caller_workflow"
SHARED_DIR
should already be in your ~/.bashrc
.
if [[ ! -v SHARED_DIR ]]; then
echo "Error couldn't find 'SHARED_DIR' env var" 1>&2
fi
bash "${HOME}/bin/start-cromwell-server.sh"
The server is now running on port 8000
LOG_DIR="${HOME}/logs"
mkdir -p "${LOG_DIR}"
Will put this in the conda env post install list at some point. 🚧
For now just build it. It doesn't take too long.
conda create --yes \
--name gcloud \
gsutil
Download the refdata from Broad's Google Cloud.
REF_DIR="${SHARED_DIR}/reference-data"
mkdir -p "${REF_DIR}"
This is a 'cloud-to-cloud' download, so it should be pretty fast.
Remember to let these complete for commencing the workflow.
REF_DIR_BROAD="${REF_DIR}/broadinstitute/hg38"
mkdir -p "${REF_DIR_BROAD}"
ref_fasta_gcloud_dir_url="gs://gcp-public-data--broad-references/hg38/v0"
conda activate gcloud
## Download ref fasta files
sbatch --job-name "download_broad_ref_data" \
--output="${LOG_DIR}/download_broad_ref_data.log" --error="${LOG_DIR}/download_broad_ref_fasta.log" \
--wrap="gsutil cp \"${ref_fasta_gcloud_dir_url}/Homo_sapiens_assembly38.fasta*\" \"${REF_DIR_BROAD}/\""
## Download ref dict files
sbatch --job-name "download_broad_ref_data" \
--output="${LOG_DIR}/download_broad_ref_data.log" --error="${LOG_DIR}/download_broad_ref_dict.log" \
--wrap="gsutil cp \"${ref_fasta_gcloud_dir_url}/Homo_sapiens_assembly38.dict\" \"${REF_DIR_BROAD}/\""
conda deactivate
This creates a folder of fifty files which we then write to a file using find.
This intervals file is pivotal for scattering over multiple contigs at once per sample.
## Create intervals dir and list file
interval_list_gcloud_dir="gs://gcp-public-data--broad-references/hg38/v0/scattered_calling_intervals/"
intervals_dir="${REF_DIR_BROAD}/hg38_wgs_scattered_calling_intervals"
intervals_list="${REF_DIR_BROAD}/hg38_wgs_scattered_calling_intervals.txt"
mkdir -p "${intervals_dir}"
conda activate gcloud
## Copy interval list
sbatch \
--job-name "download_scattered_calling_intervals_data" \
--output="${LOG_DIR}/download_scattered_calling_data.log" \
--error="${LOG_DIR}/download_scattered_calling_data.log" \
--wrap="gsutil -m rsync -r \"${interval_list_gcloud_dir}\" \"${intervals_dir}/\""
## Create intervals dir
find "${intervals_dir}" -type f > "${intervals_list}"
conda deactivate
Download the input data from Broad's Google Cloud.
INPUT_DIR="${SHARED_DIR}/input-data"
mkdir -p "${INPUT_DIR}"
These cram files are around 20Gb so may take a few minutes to download.
INPUT_DIR_BROAD="${INPUT_DIR}/broadinstitute"
mkdir -p "${INPUT_DIR_BROAD}"
giab_data_gcloud_dir_url="gs://broad-public-datasets/NA12878/NA12878.cram*"
conda activate gcloud
sbatch --job-name "download_giab_data" \
--output="${LOG_DIR}/download_giab_data.%j.log" \
--error="${LOG_DIR}/download_giab_data.%j.log" \
--wrap="gsutil cp \"${giab_data_gcloud_dir_url}\" \"${INPUT_DIR_BROAD}\""
conda deactivate
Haplotype caller is a standalone workflow - this means we don't need to zip up any helper scripts.
wdl_direct_download_link="https://raw.githubusercontent.com/gatk-workflows/gatk4-germline-snps-indels/master/haplotypecaller-gvcf-gatk4.wdl"
wget --output-document="haplotypecaller-gvcf-gatk4.wdl" \
"${wdl_direct_download_link}"
Copy this file to haplotype-caller.wdl.input.json
in your home directory.
We will then use the sed
command to change SHARED_DIR to the proper absolute paths.
If you use vim, you may wish to run
:set paste
to ensure the json is formatted correctly
{
"HaplotypeCallerGvcf_GATK4.input_bam": "__SHARED_DIR__/input-data/broadinstitute/NA12878.cram",
"HaplotypeCallerGvcf_GATK4.input_bam_index": "__SHARED_DIR__/input-data/broadinstitute/NA12878.cram.crai",
"HaplotypeCallerGvcf_GATK4.ref_dict": "__SHARED_DIR__/reference-data/broadinstitute/hg38/Homo_sapiens_assembly38.dict",
"HaplotypeCallerGvcf_GATK4.ref_fasta": "__SHARED_DIR__/reference-data/broadinstitute/hg38/Homo_sapiens_assembly38.fasta",
"HaplotypeCallerGvcf_GATK4.ref_fasta_index": "__SHARED_DIR__/reference-data/broadinstitute/hg38/Homo_sapiens_assembly38.fasta.fai",
"HaplotypeCallerGvcf_GATK4.scattered_calling_intervals_list": "__SHARED_DIR__/reference-data/broadinstitute/hg38/hg38_wgs_scattered_calling_intervals.txt"
}
Replace the __SHARED_DIR__ with the actual absolute path
sed -i "s%__SHARED_DIR__%${SHARED_DIR}%g" haplotype-caller.wdl.input.json
cromshell is an easy way to talk to the cromwell server.
We will go through the very basic commands in the next step.
cromshell is installed in the cromwell_tools
conda env.
Please activate this environment before continuing.
conda activate cromwell_tools
mkdir -p .cromshell
cromshell_config_file=".cromshell/cromwell_server.config"
cromwell_port="8000"
if [[ ! -f .cromshell/cromwell_server.config ]]; then
echo "http://localhost:${cromwell_port}" > "${cromshell_config_file}"
fi
Set the options file to /opt/cromwell/configs/options.json
.
You may wish to create your own options file.
A list of available options can be found here
## workflow/inputs/options
cromshell submit \
haplotypecaller-gvcf-gatk4.wdl \
haplotype-caller.wdl.input.json \
/opt/cromwell/configs/options.json
Ensure a workflow is running
cromshell status
Useful for debugging a workflow
cromshell metadata <WORKFLOW_ID>
All files should be under ${SHARED_DIR}/cromwell/outputs
.
This may be different if you have specified a different place in the options.json
🚧 Use yawsso, or aws2-wrap instead.
By default, you do NOT have permissions to upload to S3.
The following steps will allow you to have access to upload data to S3.
You may then run a sync command to upload your output data to your bucket of choice.
- Head to the EC2 console, there you will see your master node in the running instances.
- Check the IAM role, it should start with
parallel-cluster..
- Click on the role and it should bring you to the role summary page.
- Click
Attach policies
. Search forAmazonS3FullAccess
.
You may now run aws s3 sync /path/to/outputs s3://my-bucket/
to upload your data for safe keeping.