Long Read Mapping and Calling SVs with CuteSV - mestato/EPP622

Map Reads with Minimap2

Create directory and link to reference genome and read files

cd /lustre/isaac/proj/UTK0318/dogwood_genome/analysis/<yourusername>
mkdir 2_minimap2
cd 2_minimap2
ln -s /lustre/isaac/proj/UTK0318/dogwood_genome/raw_data/CherBrave_Run1_m64310e_211208_183918.fastq.gz
ln -s /lustre/isaac/proj/UTK0318/dogwood_genome/raw_data/AppSpring_1.0.0_Hap1.Chr11* .

Install minimap2

conda create -n minimap2 bioconda::minimap2
conda activate minimap2

I already did the indexing via an sbatch job (minimap2 -t 3 -d AppSpring_1.0.0_Hap1.Chr11.mmi AppSpring_1.0.0_Hap1.Chr11.fasta), so you won't need to.

Run read mapping

#!/bin/bash
#SBATCH -J minimap2
#SBATCH --nodes=1
#SBATCH --cpus-per-task=48
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 03:00:00
#SBATCH --mail-type=END
#SBATCH [email protected]

eval "$(conda shell.bash hook)"
conda activate minimap2
minimap2 -ax map-hifi -t 47 AppSpring_1.0.0_Hap1.Chr11.mmi CherBrave_Run1_m64310e_211208_183918.fastq.gz > CherBrave_Run1.sam

We are using the short queue because the mapping would take a very long time for all of us on the class condo which is only one node. Even with 48 cores this will take an hour.

We are requesting 48 threads but only using 47 because the minimap manual says about the -t option: "uses up to INT+1 threads when mapping (the extra thread is for I/O, which is frequently idle and takes little CPU time)."

Convert sam to sorted bam. Samtools is part of the module system, so we don't have to install it.

#!/bin/bash
#SBATCH -J samtools
#SBATCH --nodes=1
#SBATCH --cpus-per-task=7
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 00:30:00
#SBATCH --mem=28G
#SBATCH --mail-type=END
#SBATCH [email protected]

module load samtools
samtools sort -m 4G -@ 6 -o CherBrave_Run1.bam CherBrave_Run1.sam
samtools flagstat CherBrave_Run1.bam > CherBrave_Run1.stats
samtools index -@ 6 CherBrave_Run1.bam

I requested more memory at the top (28G) from slurm than on the command line for samtools (6 cores x 4G = 24G) because requesting the same amount initially gave me an out of memory error. I also requested one additional cpu as samtools documentation says the -@ species "Number of input/output compression threads to use in addition to main thread". This job took about 10 minutes for me.

Make sure the bam has content. The sam is very large (68Gb!), lets delete it

ls -lh *am
rm CherBrave_Run1.sam

We already got the flagstat results, but we can also get more informative stats from qualimap.

conda create -n qualimap bioconda::qualimap
conda activate qualimap

Create script to run:

#!/bin/bash
#SBATCH -J qualimap
#SBATCH --nodes=1
#SBATCH --cpus-per-task=3
#SBATCH -A ISAAC-UTK0318
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:30:00
#SBATCH --mem=12G

eval "$(conda shell.bash hook)"
conda activate qualimap
qualimap bamqc -bam CherBrave_Run1.bam -outdir bamstats -nt 3 --java-mem-size=10G

Call SVs with CuteSV

Create directory

mkdir 3_cutesv
cd 3_cutesv
ln -s ../2_minimap2/CherBrave_Run1.bam
ln -s ../2_minimap2/CherBrave_Run1.bam.bai
ln -s /lustre/isaac/proj/UTK0318/dogwood_genome/raw_data/AppSpring_1.0.0_Hap1.Chr11.fasta

Install CuteSV

conda create -n cutesv bioconda::cutesv
conda activate cutesv

Run CuteSV with pacbio HIFI recommended options

#!/bin/bash
#SBATCH -J cutesv 
#SBATCH --nodes=1
#SBATCH --cpus-per-task=11
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 01:00:00
#SBATCH --mail-type=END
#SBATCH [email protected]

eval "$(conda shell.bash hook)"
conda activate cutesv 
# Create output directory if it doesn't exist
mkdir -p cuteSV_tmp

cuteSV \
--max_cluster_bias_INS 1000 \
--diff_ratio_merging_INS 0.9 \
--max_cluster_bias_DEL 1000 \
--diff_ratio_merging_DEL 0.5 \
--threads 10 \
--genotype \
CherBrave_Run1.bam \
AppSpring_1.0.0_Hap1.Chr11.fasta \
CherBrave_Run1.gt.vcf \
./cuteSV_tmp

The last four arguments of cuteSV are: <sorted.bam> <reference.fa> <output.vcf> <work_dir>

This was super fast! Less than one minute for me.

Now lets get some stats. We're going to use SURVIVOR, which does lots of nifty things for structural variants. Its super fast, so we won't bother with a slurm script

conda create -n survivor bioconda::survivor
conda activate survivor
SURVIVOR stats
SURVIVOR stats CherBrave_Run1.gt.vcf -1 -1 -1 CherBrave_Run1.gt.vcf.stats
cat CherBrave_Run1.gt.vcf.stats

Visualize with IGV

Even with only one sample and one chromsome, the bam file is huge, so I subsetted it to get just a little bit to view in IGV (commands: samtools view CherBrave_Run1.bam AppSpring_1.0.0_Hap1_Chr11:140000000 -o CherBrave_Run1.Chr11.140M.bam then samtools index CherBrave_Run1.Chr11.140M.bam).

Copy all the files:

scp <yourusername>@dtn1.isaac.utk.edu:/lustre/isaac/proj/UTK0318/dogwood_genome/raw_data/AppSpring_1.0.0_Hap1.Chr11.fasta\* .
scp <yourusername>@dtn1.isaac.utk.edu:/lustre/isaac/proj/UTK0318/dogwood_genome/intermediate_files/CherBrave_Run1.Chr11.140M.bam\* .

Check out

deletion at 140743245
insertion at 140743907
duplication at 140567563 (for example find line with grep 'SVTYPE=DUP' CherBrave_Run1.gt.vcf | tail -n 1)

Visidata

Cool tool to look at tab-delimited files on command line.

Install

conda create -n visidata visidata

Check out our vcf file

conda activate visidata
vd -f

q to quit application
up, down, left, right arrows to move around
Enter to look at fields for a single line, q to exit that detail view
/pattern to search, n to move to next match, escape to stop searching

Visidata Mman page

Long Read Mapping and Calling SVs with CuteSV - mestato/EPP622_2024 GitHub Wiki

Map Reads with Minimap2

Call SVs with CuteSV

Visualize with IGV

Visidata

⚠️ GitHub.com Fallback ⚠️

Long Read Mapping and Calling SVs with CuteSV - mestato/EPP622_2024 GitHub Wiki

Map Reads with Minimap2

Call SVs with CuteSV

Visualize with IGV

Visidata

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️