Primer for DeepVariant - 18liedan/genomics_memo GitHub Wiki

This is a description created by TinyBio (April 9th, 2025)

DeepVariant is a deep learning-based variant caller developed by Google, which can be used to call variants from sequencing data. Setting up and running DeepVariant on a Linux-based supercomputer involves several steps, including installation, configuration, and execution. Here's a detailed guide to help you get started:

Step-by-Step Guide to Setting Up and Running DeepVariant

1. Installation

DeepVariant can be run using Docker, Singularity, or directly from a pre-built binary. On a supercomputer, Singularity is often preferred due to its compatibility with high-performance computing (HPC) environments.

Using Singularity:

Install Singularity: Ensure that Singularity is installed on your system. You may need to contact your system administrator if you don't have installation privileges.
Pull the DeepVariant Singularity Image: Use Singularity to pull the DeepVariant image from Docker Hub. singularity pull docker://google/deepvariant:latest
This command will create a Singularity image file (deepvariant_latest.sif) in your current directory.

2. Prepare Input Files

Reference Genome: Ensure you have a reference genome file in FASTA format. It should be indexed with BWA and have a corresponding .fai index file.
BAM File: You need a sorted and indexed BAM file of your sequencing reads aligned to the reference genome.
BED File (Optional): If you want to restrict variant calling to specific regions, prepare a BED file with those regions.

3. Running DeepVariant

Create a script to run DeepVariant using Singularity. Here's an example script:

#!/bin/bash

# Set paths to your input files and output directory
REF="/path/to/reference.fasta"
BAM="/path/to/aligned_reads.bam"
OUTPUT_DIR="/path/to/output_directory"
SINGULARITY_IMAGE="/path/to/deepvariant_latest.sif"

# Create output directory if it doesn't exist
mkdir -p $OUTPUT_DIR

# Run DeepVariant
singularity exec $SINGULARITY_IMAGE \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \  # Use WES for exome sequencing
  --ref=$REF \
  --reads=$BAM \
  --output_vcf=$OUTPUT_DIR/output.vcf.gz \
  --output_gvcf=$OUTPUT_DIR/output.g.vcf.gz \
  --num_shards=8  # Adjust based on available resources

Model Type: Choose WGS for whole-genome sequencing or WES for whole-exome sequencing.
Num Shards: Adjust the --num_shards parameter based on the number of CPU cores available. This allows DeepVariant to parallelize its work.

To process all BAM files in a directory using a loop in a bash script, you can use a for loop to iterate over each BAM file in the specified directory. Here's how you can modify your script to achieve this:

#!/bin/bash

# Set paths to your input files and output directory
REF="/lustre10/home/liedan/ref"
BAM_DIR="/lustre10/home/liedan/mhemapped2"  # Directory containing BAM files
OUTPUT_DIR="/lustre10/home/liedan/deepvariant_out"
SINGULARITY_IMAGE="/lustre10/home/liedan/deepvariant_latest.sif"

# Create output directory if it doesn't exist
mkdir -p $OUTPUT_DIR

# Loop over each BAM file in the directory
for BAM in $BAM_DIR/*.bam; do
  # Extract the base name of the BAM file (without path and extension)
  BASENAME=$(basename $BAM .bam)

  # Run DeepVariant for each BAM file
  singularity exec $SINGULARITY_IMAGE \
    /opt/deepvariant/bin/run_deepvariant \
    --model_type=WGS \
    --ref=$REF \
    --reads=$BAM \
    --output_vcf=$OUTPUT_DIR/${BASENAME}_deepvariant_out.vcf.gz \
    --output_gvcf=$OUTPUT_DIR/${BASENAME}_deepvariant_out.g.vcf.gz \
    --num_shards=50  # Adjust based on available resources
done

NOTE by Naito

Bird genomes are small, so I used my interactive node instead of the SLURM job scheduler. One sample took about 45 minutes - much shorter than GATK Haplotype caller! If you do not have an interactive node, proceed to step 4 to use SLURM or some other job scheduler.

4. Submit the Job to the Supercomputer

Job Scheduler: Use your supercomputer's job scheduler (e.g., SLURM, PBS) to submit the script as a job. Here's an example SLURM script:

#!/bin/bash
#SBATCH --job-name=deepvariant
#SBATCH --output=deepvariant.out
#SBATCH --error=deepvariant.err
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G

bash run_deepvariant.sh

Adjust Resources: Modify the --time, --cpus-per-task, and --mem parameters based on your data size and available resources.

5. Post-Processing

Output Files: After DeepVariant completes, you'll have a VCF file (output.vcf.gz) and optionally a GVCF file (output.g.vcf.gz) in your output directory.
Validation: Validate the VCF file using tools like bcftools or vcftools to ensure the quality of the variant calls.

Summary

Installation: Use Singularity to run DeepVariant on a supercomputer.
Input Preparation: Ensure you have the necessary reference and BAM files.
Execution: Create a script to run DeepVariant and submit it as a job.
Post-Processing: Validate and analyze the output VCF file.