Primer for DeepVariant - 18liedan/genomics_memo GitHub Wiki
This is a description created by TinyBio (April 9th, 2025)
DeepVariant is a deep learning-based variant caller developed by Google, which can be used to call variants from sequencing data. Setting up and running DeepVariant on a Linux-based supercomputer involves several steps, including installation, configuration, and execution. Here's a detailed guide to help you get started:
Step-by-Step Guide to Setting Up and Running DeepVariant
1. Installation
DeepVariant can be run using Docker, Singularity, or directly from a pre-built binary. On a supercomputer, Singularity is often preferred due to its compatibility with high-performance computing (HPC) environments.
Using Singularity:
-
Install Singularity: Ensure that Singularity is installed on your system. You may need to contact your system administrator if you don't have installation privileges.
-
Pull the DeepVariant Singularity Image: Use Singularity to pull the DeepVariant image from Docker Hub.
singularity pull docker://google/deepvariant:latest
-
This command will create a Singularity image file (deepvariant_latest.sif) in your current directory.
2. Prepare Input Files
- Reference Genome: Ensure you have a reference genome file in FASTA format. It should be indexed with BWA and have a corresponding
.fai
index file. - BAM File: You need a sorted and indexed BAM file of your sequencing reads aligned to the reference genome.
- BED File (Optional): If you want to restrict variant calling to specific regions, prepare a BED file with those regions.
3. Running DeepVariant
Create a script to run DeepVariant using Singularity. Here's an example script:
#!/bin/bash
# Set paths to your input files and output directory
REF="/path/to/reference.fasta"
BAM="/path/to/aligned_reads.bam"
OUTPUT_DIR="/path/to/output_directory"
SINGULARITY_IMAGE="/path/to/deepvariant_latest.sif"
# Create output directory if it doesn't exist
mkdir -p $OUTPUT_DIR
# Run DeepVariant
singularity exec $SINGULARITY_IMAGE \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \ # Use WES for exome sequencing
--ref=$REF \
--reads=$BAM \
--output_vcf=$OUTPUT_DIR/output.vcf.gz \
--output_gvcf=$OUTPUT_DIR/output.g.vcf.gz \
--num_shards=8 # Adjust based on available resources
- Model Type: Choose
WGS
for whole-genome sequencing orWES
for whole-exome sequencing. - Num Shards: Adjust the
--num_shards
parameter based on the number of CPU cores available. This allows DeepVariant to parallelize its work.
To process all BAM files in a directory using a loop in a bash script, you can use a for
loop to iterate over each BAM file in the specified directory. Here's how you can modify your script to achieve this:
#!/bin/bash
# Set paths to your input files and output directory
REF="/lustre10/home/liedan/ref"
BAM_DIR="/lustre10/home/liedan/mhemapped2" # Directory containing BAM files
OUTPUT_DIR="/lustre10/home/liedan/deepvariant_out"
SINGULARITY_IMAGE="/lustre10/home/liedan/deepvariant_latest.sif"
# Create output directory if it doesn't exist
mkdir -p $OUTPUT_DIR
# Loop over each BAM file in the directory
for BAM in $BAM_DIR/*.bam; do
# Extract the base name of the BAM file (without path and extension)
BASENAME=$(basename $BAM .bam)
# Run DeepVariant for each BAM file
singularity exec $SINGULARITY_IMAGE \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \
--ref=$REF \
--reads=$BAM \
--output_vcf=$OUTPUT_DIR/${BASENAME}_deepvariant_out.vcf.gz \
--output_gvcf=$OUTPUT_DIR/${BASENAME}_deepvariant_out.g.vcf.gz \
--num_shards=50 # Adjust based on available resources
done
NOTE by Naito
Bird genomes are small, so I used my interactive node instead of the SLURM job scheduler. One sample took about 45 minutes - much shorter than GATK Haplotype caller! If you do not have an interactive node, proceed to step 4 to use SLURM or some other job scheduler.
4. Submit the Job to the Supercomputer
- Job Scheduler: Use your supercomputer's job scheduler (e.g., SLURM, PBS) to submit the script as a job. Here's an example SLURM script:
#!/bin/bash
#SBATCH --job-name=deepvariant
#SBATCH --output=deepvariant.out
#SBATCH --error=deepvariant.err
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
bash run_deepvariant.sh
- Adjust Resources: Modify the
--time
,--cpus-per-task
, and--mem
parameters based on your data size and available resources.
5. Post-Processing
- Output Files: After DeepVariant completes, you'll have a VCF file (output.vcf.gz) and optionally a GVCF file (output.g.vcf.gz) in your output directory.
- Validation: Validate the VCF file using tools like
bcftools
orvcftools
to ensure the quality of the variant calls.
Summary
- Installation: Use Singularity to run DeepVariant on a supercomputer.
- Input Preparation: Ensure you have the necessary reference and BAM files.
- Execution: Create a script to run DeepVariant and submit it as a job.
- Post-Processing: Validate and analyze the output VCF file.