4B. Day 4. Variant Calling - bioinfokushwaha/Livestock_Genomics GitHub Wiki
Variant Calling Lecture
Login to server
ssh -X [email protected]
Password 123456
cd /home/nanobioinfo22/NGSWorkshop_2024/
cd your_name/
cp ../variant_analysis.tar.gz ./
tar -zxvf variant_analysis.tar.gz ./
There are three raw reads paired-end data
cd raw_data
ls
bachaur_R1.fastq.gz 
bachaur_R2.fastq.gz
hariana_R1.fastq.gz 
hariana_R2.fastq.gz
tharparkar_R1.fastq.gz 
tharparkar_R2.fastq.gz
1.Data filtering and trimming
- Data filtering and trimming of low-quality reads using fastp
cd ../1fastp/
[!IMPORTANT]
Do for all samples with the same parameter or as per requirement.
../app/fastp -i ../raw_data/bachaur_R1.fastq.gz -I ../raw_data/bachaur_R2.fastq.gz -o bachaur_trim_R1.fastq.gz -O bachaur_trim_R2.fastq.gz -p --overrepresentation_analysis -l 50 -q 20 -u 20 -h bachaur_report.html -j bachaur_report.json -R bachaur_report --thread 2 --trim_front1 3 --trim_tail1 2 --trim_front2 3 --trim_tail2 2 --dont_overwrite
../app/fastp -i ../raw_data/hariana_R1.fastq.gz -I ../raw_data/hariana_R2.fastq.gz -o hariana_trim_R1.fastq.gz -O hariana_trim_R2.fastq.gz -p --overrepresentation_analysis -l 50 -q 20 -u 20 -h hariana_report.html -j hariana_report.json -R hariana_report --thread 2 --trim_front1 3 --trim_tail1 2 --trim_front2 3 --trim_tail2 2 --dont_overwrite
../app/fastp -i ../raw_data/tharparkar_R1.fastq.gz -I ../raw_data/tharparkar_R2.fastq.gz -o tharparkar_trim_R1.fastq.gz -O tharparkar_trim_R2.fastq.gz -p --overrepresentation_analysis -l 50 -q 20 -u 20 -h tharparkar_report.html -j tharparkar_report.json -R tharparkar_report --thread 2 --trim_front1 3 --trim_tail1 2 --trim_front2 3 --trim_tail2 2 --dont_overwrite
2.Alignment and preparing files for variant calling
cd ../2bwa_mem/
- Alignment using BWA-MEM2
[!IMPORTANT]
Do for all samples with the same parameter or as per requirement.
../app/bwa-mem2/bwa-mem2 mem -t 2 ../ref/brahman.fna ../1fastp/bachaur_trim_R1.fastq.gz ../1fastp/bachaur_trim_R2.fastq.gz -o bachaur.sam
../app/bwa-mem2/bwa-mem2 mem -t 2 ../ref/brahman.fna ../1fastp/tharparkar_trim_R1.fastq.gz ../1fastp/tharparkar_trim_R2.fastq.gz -o tharparkar.sam
../app/bwa-mem2/bwa-mem2 mem -t 2 ../ref/brahman.fna ../1fastp/hariana_trim_R1.fastq.gz ../1fastp/hariana_trim_R2.fastq.gz -o hariana.sam
- Sort the SAM file & convert it to bam using PICARD
java -jar ../app/picard.jar SortSam I=bachaur.sam O=bachur_sorted.bam SORT_ORDER=coordinate TMP_DIR=./tmp/
java -jar ../app/picard.jar SortSam I=hariana.sam O=hariana_sorted.bam SORT_ORDER=coordinate TMP_DIR=./tmp/
java -jar ../app/picard.jar SortSam I=tharparkar.sam O=tharparkar_sorted.bam SORT_ORDER=coordinate TMP_DIR=./tmp/
- Mark duplicates
java -jar ../app/picard.jar MarkDuplicates I=bachur_sorted.bam O=bachur_markedup.bam M=bachur_marked_dup_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR=./tmp/
java -jar ../app/picard.jar MarkDuplicates I=hariana_sorted.bam O=hariana_markedup.bam M=hariana_marked_dup_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR=./tmp/
java -jar ../app/picard.jar MarkDuplicates I=tharparkar_sorted.bam O=tharparkar_markedup.bam M=tharparkar_marked_dup_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR=./tmp/
- Add or replace groups
java -jar ../app/picard.jar AddOrReplaceReadGroups I=bachur_markedup.bam O=bachur_RG.bam RGID=bachur RGLB=bachur RGPL=illumina RGPU=bachur RGSM=bachur TMP_DIR=./tmp/
java -jar ../app/picard.jar AddOrReplaceReadGroups I=hariana_markedup.bam O=hariana_RG.bam RGID=hariana RGLB=hariana RGPL=illumina RGPU=hariana RGSM=hariana TMP_DIR=./tmp/
java -jar ../app/picard.jar AddOrReplaceReadGroups I=tharparkar_markedup.bam O=tharparkar_RG.bam RGID=tharparkar RGLB=tharparkar RGPL=illumina RGPU=tharparkar RGSM=tharparkar TMP_DIR=./tmp/
- Build bam index file
java -jar ../app/picard.jar BuildBamIndex I=bachur_RG.bam O=bachur_RG.bai R=../ref/brahman.fna TMP_DIR=./tmp/
java -jar ../app/picard.jar BuildBamIndex I=hariana_RG.bam O=hariana_RG.bai R=../ref/brahman.fna TMP_DIR=./tmp/
java -jar ../app/picard.jar BuildBamIndex I=tharparkar_RG.bam O=tharparkar_RG.bai R=../ref/brahman.fna TMP_DIR=./tmp/
3.Variant calling
Using GATK
cd ../3GATK/
[!IMPORTANT]
Do for all samples with the same parameter or as per requirement, Here for a single sample it was given
- Run Haplotype caller for each of your bams obtained from previous final steps
../app/gatk/gatk HaplotypeCaller -R ../ref/brahman.fna -I ../2bwa_mem/bachur_RG.bam -O bachaur.bam.vcf.gz -ERC GVCF
../app/gatk/gatk HaplotypeCaller -R ../ref/brahman.fna -I ../2bwa_mem/hariana_RG.bam -O hariana.bam.vcf.gz -ERC GVCF
../app/gatk/gatk HaplotypeCaller -R ../ref/brahman.fna -I ../2bwa_mem/tharparkar_RG.bam -O tharparkar.bam.vcf.gz -ERC GVCF
rm *.gz *.gz.*
cp ../*.gz ../*.gz.* ./
- Combine all gvcf files
[!WARNING]
Get GVCF for all samples to proceed further.
../app/gatk/gatk CombineGVCFs -R  ../ref/brahman.fna --variant bachaur.bam.vcf.gz --variant hariana.bam.vcf.gz --variant tharparkar.bam.vcf.gz -O combined.gvcf.gz
- Perform joint-genotyping on the combined gvcf
../app/gatk/gatk GenotypeGVCFs -R ../ref/brahman.fna -V combined.gvcf.gz -O final.vcf.gz
- Filter only SNPs using GATK.
../app/gatk/gatk SelectVariants -R ../ref/brahman.fna -V final.vcf.gz --select-type-to-include SNP -O gatksnp.vcf
Using freebayes
cd ../4freebayes/
cp ../bam_list.txt ./
freebayes -f ../ref/brahman.fna --bam-list bam_list.txt -v freebayes_snp.vcf
rm freebayes_snp.vcf
cp ../freebayes_snp.vcf ./
4.SNP filtering from vcf
cd ../5vcffilter/
- Filtering as per requirements.
vcftools --vcf  ../3GATK/gatksnp.vcf --remove-indels --maf 0.1 --max-missing 0.9 --minDP 1 --minQ 40 --recode --recode-INFO-all --out snp_q40_true.vcf
Here,
--vcf -> input path for vcf file
--remove-indels -> remove all indels (SNPs only)
--maf -> set minor allele frequency - 0.1 here
--max-missing -> set minimum non-missing data. A little counterintuitive - 0 is totally missing, and 1 is none missing. Here 0.9 means we will tolerate 10% missing data.
--minQ -> this is just the minimum quality score required for a site to pass our filtering threshold. Here we set it to 30.
--minDP -> the minimum depth allowed for a genotype - any individual failing this threshold is marked as having a missing genotype.
--recode -> recode the output - necessary to output a vcf
5.SNP annotation
cd ../6snpeff/
- Comparing and taking out the annotated files along with filter vcf file.
java -jar ../app/snpEff/snpEff.jar -classic -v UOA_Brahman_1.99 ../5vcffilter/snp_q40_true.vcf.recode.vcf > snpEff_filtered_classic.vcf
- Took out the SNPs in Exon or as per requirement.
grep -e EXON -e EXON_DELETED -e NON_SYNONYMOUS_CODING -e SYNONYMOUS_CODING -e FRAME_SHIFT -e CODON_CHANGE -e CODON_INSERTION -e CODON_CHANGE_PLUS_CODON_INSERTION -e CODON_DELETION -e CODON_CHANGE_PLUS_CODON_DELETION -e STOP_GAINED -e SYNONYMOUS_STOP -e STOP_LOST -e RARE_AMINO_ACID snpEff_filtered_classic.vcf > output.txt