Long Read Mapping and Calling SVs with CuteSV - mestato/EPP622_2024 GitHub Wiki
Create directory and link to reference genome and read files
cd /lustre/isaac/proj/UTK0318/dogwood_genome/analysis/<yourusername>
mkdir 2_minimap2
cd 2_minimap2
ln -s /lustre/isaac/proj/UTK0318/dogwood_genome/raw_data/CherBrave_Run1_m64310e_211208_183918.fastq.gz
ln -s /lustre/isaac/proj/UTK0318/dogwood_genome/raw_data/AppSpring_1.0.0_Hap1.Chr11* .
Install minimap2
conda create -n minimap2 bioconda::minimap2
conda activate minimap2
I already did the indexing via an sbatch job (minimap2 -t 3 -d AppSpring_1.0.0_Hap1.Chr11.mmi AppSpring_1.0.0_Hap1.Chr11.fasta
), so you won't need to.
Run read mapping
#!/bin/bash
#SBATCH -J minimap2
#SBATCH --nodes=1
#SBATCH --cpus-per-task=48
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 03:00:00
#SBATCH --mail-type=END
#SBATCH [email protected]
eval "$(conda shell.bash hook)"
conda activate minimap2
minimap2 -ax map-hifi -t 47 AppSpring_1.0.0_Hap1.Chr11.mmi CherBrave_Run1_m64310e_211208_183918.fastq.gz > CherBrave_Run1.sam
We are using the short queue because the mapping would take a very long time for all of us on the class condo which is only one node. Even with 48 cores this will take an hour.
We are requesting 48 threads but only using 47 because the minimap manual says about the -t option: "uses up to INT+1 threads when mapping (the extra thread is for I/O, which is frequently idle and takes little CPU time)."
Convert sam to sorted bam. Samtools is part of the module system, so we don't have to install it.
#!/bin/bash
#SBATCH -J samtools
#SBATCH --nodes=1
#SBATCH --cpus-per-task=7
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 00:30:00
#SBATCH --mem=28G
#SBATCH --mail-type=END
#SBATCH [email protected]
module load samtools
samtools sort -m 4G -@ 6 -o CherBrave_Run1.bam CherBrave_Run1.sam
samtools flagstat CherBrave_Run1.bam > CherBrave_Run1.stats
samtools index -@ 6 CherBrave_Run1.bam
I requested more memory at the top (28G) from slurm than on the command line for samtools (6 cores x 4G = 24G) because requesting the same amount initially gave me an out of memory error. I also requested one additional cpu as samtools documentation says the -@ species "Number of input/output compression threads to use in addition to main thread". This job took about 10 minutes for me.
Make sure the bam has content. The sam is very large (68Gb!), lets delete it
ls -lh *am
rm CherBrave_Run1.sam
We already got the flagstat results, but we can also get more informative stats from qualimap.
conda create -n qualimap bioconda::qualimap
conda activate qualimap
Create script to run:
#!/bin/bash
#SBATCH -J qualimap
#SBATCH --nodes=1
#SBATCH --cpus-per-task=3
#SBATCH -A ISAAC-UTK0318
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:30:00
#SBATCH --mem=12G
eval "$(conda shell.bash hook)"
conda activate qualimap
qualimap bamqc -bam CherBrave_Run1.bam -outdir bamstats -nt 3 --java-mem-size=10G
Create directory
mkdir 3_cutesv
cd 3_cutesv
ln -s ../2_minimap2/CherBrave_Run1.bam
ln -s ../2_minimap2/CherBrave_Run1.bam.bai
ln -s /lustre/isaac/proj/UTK0318/dogwood_genome/raw_data/AppSpring_1.0.0_Hap1.Chr11.fasta
Install CuteSV
conda create -n cutesv bioconda::cutesv
conda activate cutesv
Run CuteSV with pacbio HIFI recommended options
#!/bin/bash
#SBATCH -J cutesv
#SBATCH --nodes=1
#SBATCH --cpus-per-task=11
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 01:00:00
#SBATCH --mail-type=END
#SBATCH [email protected]
eval "$(conda shell.bash hook)"
conda activate cutesv
# Create output directory if it doesn't exist
mkdir -p cuteSV_tmp
cuteSV \
--max_cluster_bias_INS 1000 \
--diff_ratio_merging_INS 0.9 \
--max_cluster_bias_DEL 1000 \
--diff_ratio_merging_DEL 0.5 \
--threads 10 \
--genotype \
CherBrave_Run1.bam \
AppSpring_1.0.0_Hap1.Chr11.fasta \
CherBrave_Run1.gt.vcf \
./cuteSV_tmp
The last four arguments of cuteSV are: <sorted.bam> <reference.fa> <output.vcf> <work_dir>
This was super fast! Less than one minute for me.
Now lets get some stats. We're going to use SURVIVOR, which does lots of nifty things for structural variants. Its super fast, so we won't bother with a slurm script
conda create -n survivor bioconda::survivor
conda activate survivor
SURVIVOR stats
SURVIVOR stats CherBrave_Run1.gt.vcf -1 -1 -1 CherBrave_Run1.gt.vcf.stats
cat CherBrave_Run1.gt.vcf.stats
Even with only one sample and one chromsome, the bam file is huge, so I subsetted it to get just a little bit to view in IGV (commands: samtools view CherBrave_Run1.bam AppSpring_1.0.0_Hap1_Chr11:140000000 -o CherBrave_Run1.Chr11.140M.bam
then samtools index CherBrave_Run1.Chr11.140M.bam
).
Copy all the files:
scp <yourusername>@dtn1.isaac.utk.edu:/lustre/isaac/proj/UTK0318/dogwood_genome/raw_data/AppSpring_1.0.0_Hap1.Chr11.fasta\* .
scp <yourusername>@dtn1.isaac.utk.edu:/lustre/isaac/proj/UTK0318/dogwood_genome/intermediate_files/CherBrave_Run1.Chr11.140M.bam\* .
Check out
- deletion at 140743245
- insertion at 140743907
- duplication at 140567563 (for example find line with
grep 'SVTYPE=DUP' CherBrave_Run1.gt.vcf | tail -n 1
)
Cool tool to look at tab-delimited files on command line.
Install
conda create -n visidata visidata
Check out our vcf file
conda activate visidata
vd -f
- q to quit application
- up, down, left, right arrows to move around
- Enter to look at fields for a single line, q to exit that detail view
- /pattern to search, n to move to next match, escape to stop searching