4B. Day 4. Variant Calling - bioinfokushwaha/Livestock_Genomics GitHub Wiki

Variant Calling Lecture

Lecture_Variant_calling.pdf

Login to server

ssh -X [email protected]
Password 123456
cd /home/nanobioinfo22/NGSWorkshop_2024/

cd your_name/
cp ../variant_analysis.tar.gz ./
tar -zxvf variant_analysis.tar.gz ./

There are three raw reads paired-end data

cd raw_data
ls
bachaur_R1.fastq.gz 
bachaur_R2.fastq.gz
hariana_R1.fastq.gz 
hariana_R2.fastq.gz
tharparkar_R1.fastq.gz 
tharparkar_R2.fastq.gz

1.Data filtering and trimming

  • Data filtering and trimming of low-quality reads using fastp
cd ../1fastp/

[!IMPORTANT]
Do for all samples with the same parameter or as per requirement.

../app/fastp -i ../raw_data/bachaur_R1.fastq.gz -I ../raw_data/bachaur_R2.fastq.gz -o bachaur_trim_R1.fastq.gz -O bachaur_trim_R2.fastq.gz -p --overrepresentation_analysis -l 50 -q 20 -u 20 -h bachaur_report.html -j bachaur_report.json -R bachaur_report --thread 2 --trim_front1 3 --trim_tail1 2 --trim_front2 3 --trim_tail2 2 --dont_overwrite
../app/fastp -i ../raw_data/hariana_R1.fastq.gz -I ../raw_data/hariana_R2.fastq.gz -o hariana_trim_R1.fastq.gz -O hariana_trim_R2.fastq.gz -p --overrepresentation_analysis -l 50 -q 20 -u 20 -h hariana_report.html -j hariana_report.json -R hariana_report --thread 2 --trim_front1 3 --trim_tail1 2 --trim_front2 3 --trim_tail2 2 --dont_overwrite
../app/fastp -i ../raw_data/tharparkar_R1.fastq.gz -I ../raw_data/tharparkar_R2.fastq.gz -o tharparkar_trim_R1.fastq.gz -O tharparkar_trim_R2.fastq.gz -p --overrepresentation_analysis -l 50 -q 20 -u 20 -h tharparkar_report.html -j tharparkar_report.json -R tharparkar_report --thread 2 --trim_front1 3 --trim_tail1 2 --trim_front2 3 --trim_tail2 2 --dont_overwrite

2.Alignment and preparing files for variant calling

cd ../2bwa_mem/
  • Alignment using BWA-MEM2

[!IMPORTANT]
Do for all samples with the same parameter or as per requirement.

../app/bwa-mem2/bwa-mem2 mem -t 2 ../ref/brahman.fna ../1fastp/bachaur_trim_R1.fastq.gz ../1fastp/bachaur_trim_R2.fastq.gz -o bachaur.sam
../app/bwa-mem2/bwa-mem2 mem -t 2 ../ref/brahman.fna ../1fastp/tharparkar_trim_R1.fastq.gz ../1fastp/tharparkar_trim_R2.fastq.gz -o tharparkar.sam
../app/bwa-mem2/bwa-mem2 mem -t 2 ../ref/brahman.fna ../1fastp/hariana_trim_R1.fastq.gz ../1fastp/hariana_trim_R2.fastq.gz -o hariana.sam
  • Sort the SAM file & convert it to bam using PICARD
java -jar ../app/picard.jar SortSam I=bachaur.sam O=bachur_sorted.bam SORT_ORDER=coordinate TMP_DIR=./tmp/
java -jar ../app/picard.jar SortSam I=hariana.sam O=hariana_sorted.bam SORT_ORDER=coordinate TMP_DIR=./tmp/
java -jar ../app/picard.jar SortSam I=tharparkar.sam O=tharparkar_sorted.bam SORT_ORDER=coordinate TMP_DIR=./tmp/
  • Mark duplicates
java -jar ../app/picard.jar MarkDuplicates I=bachur_sorted.bam O=bachur_markedup.bam M=bachur_marked_dup_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR=./tmp/
java -jar ../app/picard.jar MarkDuplicates I=hariana_sorted.bam O=hariana_markedup.bam M=hariana_marked_dup_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR=./tmp/
java -jar ../app/picard.jar MarkDuplicates I=tharparkar_sorted.bam O=tharparkar_markedup.bam M=tharparkar_marked_dup_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR=./tmp/
  • Add or replace groups
java -jar ../app/picard.jar AddOrReplaceReadGroups I=bachur_markedup.bam O=bachur_RG.bam RGID=bachur RGLB=bachur RGPL=illumina RGPU=bachur RGSM=bachur TMP_DIR=./tmp/
java -jar ../app/picard.jar AddOrReplaceReadGroups I=hariana_markedup.bam O=hariana_RG.bam RGID=hariana RGLB=hariana RGPL=illumina RGPU=hariana RGSM=hariana TMP_DIR=./tmp/
java -jar ../app/picard.jar AddOrReplaceReadGroups I=tharparkar_markedup.bam O=tharparkar_RG.bam RGID=tharparkar RGLB=tharparkar RGPL=illumina RGPU=tharparkar RGSM=tharparkar TMP_DIR=./tmp/
  • Build bam index file
java -jar ../app/picard.jar BuildBamIndex I=bachur_RG.bam O=bachur_RG.bai R=../ref/brahman.fna TMP_DIR=./tmp/
java -jar ../app/picard.jar BuildBamIndex I=hariana_RG.bam O=hariana_RG.bai R=../ref/brahman.fna TMP_DIR=./tmp/
java -jar ../app/picard.jar BuildBamIndex I=tharparkar_RG.bam O=tharparkar_RG.bai R=../ref/brahman.fna TMP_DIR=./tmp/

3.Variant calling

Using GATK

cd ../3GATK/

[!IMPORTANT]
Do for all samples with the same parameter or as per requirement, Here for a single sample it was given

  • Run Haplotype caller for each of your bams obtained from previous final steps
../app/gatk/gatk HaplotypeCaller -R ../ref/brahman.fna -I ../2bwa_mem/bachur_RG.bam -O bachaur.bam.vcf.gz -ERC GVCF
../app/gatk/gatk HaplotypeCaller -R ../ref/brahman.fna -I ../2bwa_mem/hariana_RG.bam -O hariana.bam.vcf.gz -ERC GVCF
../app/gatk/gatk HaplotypeCaller -R ../ref/brahman.fna -I ../2bwa_mem/tharparkar_RG.bam -O tharparkar.bam.vcf.gz -ERC GVCF
rm *.gz *.gz.*
cp ../*.gz ../*.gz.* ./
  • Combine all gvcf files

[!WARNING]
Get GVCF for all samples to proceed further.

../app/gatk/gatk CombineGVCFs -R  ../ref/brahman.fna --variant bachaur.bam.vcf.gz --variant hariana.bam.vcf.gz --variant tharparkar.bam.vcf.gz -O combined.gvcf.gz
  • Perform joint-genotyping on the combined gvcf
../app/gatk/gatk GenotypeGVCFs -R ../ref/brahman.fna -V combined.gvcf.gz -O final.vcf.gz
  • Filter only SNPs using GATK.
../app/gatk/gatk SelectVariants -R ../ref/brahman.fna -V final.vcf.gz --select-type-to-include SNP -O gatksnp.vcf

Using freebayes

cd ../4freebayes/
cp ../bam_list.txt ./
freebayes -f ../ref/brahman.fna --bam-list bam_list.txt -v freebayes_snp.vcf

rm freebayes_snp.vcf
cp ../freebayes_snp.vcf ./

4.SNP filtering from vcf

cd ../5vcffilter/
  • Filtering as per requirements.
vcftools --vcf  ../3GATK/gatksnp.vcf --remove-indels --maf 0.1 --max-missing 0.9 --minDP 1 --minQ 40 --recode --recode-INFO-all --out snp_q40_true.vcf

Here,

--vcf -> input path for vcf file

--remove-indels -> remove all indels (SNPs only)

--maf -> set minor allele frequency - 0.1 here

--max-missing -> set minimum non-missing data. A little counterintuitive - 0 is totally missing, and 1 is none missing. Here 0.9 means we will tolerate 10% missing data.

--minQ -> this is just the minimum quality score required for a site to pass our filtering threshold. Here we set it to 30.

--minDP -> the minimum depth allowed for a genotype - any individual failing this threshold is marked as having a missing genotype.

--recode -> recode the output - necessary to output a vcf

5.SNP annotation

cd ../6snpeff/
  • Comparing and taking out the annotated files along with filter vcf file.
java -jar ../app/snpEff/snpEff.jar -classic -v UOA_Brahman_1.99 ../5vcffilter/snp_q40_true.vcf.recode.vcf > snpEff_filtered_classic.vcf

  • Took out the SNPs in Exon or as per requirement.
grep -e EXON -e EXON_DELETED -e NON_SYNONYMOUS_CODING -e SYNONYMOUS_CODING -e FRAME_SHIFT -e CODON_CHANGE -e CODON_INSERTION -e CODON_CHANGE_PLUS_CODON_INSERTION -e CODON_DELETION -e CODON_CHANGE_PLUS_CODON_DELETION -e STOP_GAINED -e SYNONYMOUS_STOP -e STOP_LOST -e RARE_AMINO_ACID snpEff_filtered_classic.vcf > output.txt