Variant Calling by GATK - Bioinfo-niab/VariantAnalysis GitHub Wiki

RNAseq short variant identification

Full List of Tools:

  • hisat2
  • GATK4
  • Picard Tools
  • Samtools
  • SnpEff
  • bcftools

Main Steps:

Step1   Validate Sam or Bam files

Tool : Picard
Input : Sam or Bam file
Output:  Sam or Bam file

Command: 
java -jar /usr/local/bin/picard.jar ValidateSamFile I=Input.sam MODE=SUMMARY

Step2  Alignment – Map to Reference

Tools :  hisat2 and Samtools
Input :  fastq files and Reference Genome
Output :  aligned_reads.sam

Command: 
hisat2 -f -x referenceGenome_indexed -1 reads_R1.fq -2 reads_R2.fq -S aligned.sam| samtools view -bS aligned.sam > aligned.bam 

Step3  AddOrReplaceReadGroups

Tool :  Picard
Input :  .bam file
Output :  .bam file

Command:
java -jar /usr/local/bin/picard.jar AddOrReplaceReadGroups I=input.bam O=output.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=20

RGID=Read-Group ID
RGLB=Read-Group library
RGPL=Read-Group platform (e.g. ILLUMINA, SOLID)
RGPU=Read-Group platform unit (eg. run barcode)
RGSM=Read-Group sample name

###Note:  If .bam file without read groups in the header, so I am trying to correct this by using picard tools AddOrReplaceReadGroups.

Step4  Prepare a genome fasta file to use as reference

Tool :  Picard
Input :  ReferenceGenome.fasta
Output :  ReferenceGenome.dict

Command:
java -jar /usr/local/bin/picard.jar CreateSequenceDictionary R=ReferenceGenome.fasta O=ReferenceGenome.dict

Step5  Creating fasta index file

Tool :  Samtools
Input :  ReferenceGenome.fasta
Output :  ReferenceGenome.fa

Command:
samtools faidx ReferenceGenome.fasta

Step6  MarkDuplicates

Tools :  GATK4 MarkDuplicates
Input :  .bam file
Output :  marked_duplicates.bam, marked_dup_metrics.txt

Command:
java -jar /usr/local/bin/picard.jar MarkDuplicates I=input.bam O=marked_duplicates.bam M=marked_dup_metrics.txt

Step6  BuildBamIndex

Tool :  Picard
Input :  marked_duplicates.bam
Output :  marked_duplicates.bai

Command:
java "Xmx16G"-jar /usr/local/bin/picard.jar BuildBamIndex I=marked_duplicates.bam

Step7  HaplotypeCaller

Tool :  GATK4
Input :  marked_duplicates.bam, ReferenceGenome.fa
Output :  marked_duplicates.g.vcf.gz, marked_duplicates.g.vcf.gz.tbi

Command:
gatk --java-options -Xmx64G HaplotypeCaller -R ReferenceGenome.fa -I marked_duplicates.bam -O marked_duplicates.g.vcf.gz -G StandardAnnotation -ERC GVCF -G AS_StandardAnnotation

G=One or more groups of annotations to apply to variant calls
ERC=Mode for emitting reference confidence scores (For Mutect2, this is a BETA feature)

Step7  Merged multiple samples

Tool :  GATK4
Input :  marked_duplicates.g.vcf.gz, ReferenceGenome.fa
Output:  merged.g.vcf.gz, merged.g.vcf.gz.tbi


Command:
gatk CombineGVCFs -R ReferenceGenome.fa  --variant marked_duplicates.g.vcf.gz --variant marked_duplicates.g.vcf.gz -O merged.g.vcf.gz

Step8  GenotypeGVCFs for variants calling

Input :  ReferenceGenome.fa, merged.g.vcf.gz
Output :  merged.vcf.gz


Command:
gatk --java-options -Xmx64g GenotypeGVCFs -R ReferenceGenome.fa -V merged.g.vcf.gz -O merged.vcf.gz
V=A VCF file containing variants

Step9  Selectvariants for SNPs selection

Tool :  GATK4
Input :  merged.vcf.gz
Output :  merged_SNPs.vcf.gz, merged_SNPs.vcf.gz.idx


Command:
gatk --java-options -Xmx16g SelectVariants -V merged.vcf.gz  -select-type SNP -O merged_SNPs.vcf.gz

Step10  Variantfiltration

Tool :  GATK4
Input :  merged_SNPs.vcf.gz
Output :  merged_SNP_filtered.vcf, merged_SNP_filtered.vcf.idx


Command:
gatk --java-options -Xmx16g VariantFiltration -V merged_SNPs.vcf.gz -filter "MQ >40.0" --filter-name "MQ40" -filter "QUAL >30.0" --filter-name "QUAL30" -filter "DP >=100" --filter-name "DP100" -filter "FS <40.0" --filter-name "FS40" -O  merged_SNP_filtered.vcf
V=A VCF file containing variants
MQ=RMSMappingQuality (MQ) 40.0
QUAL=A parameter value for QUAL in GATK. Low quality. (default=50)
DP=A parameter value for DP in GATK. Low depth. (default=5)
FS=A parameter value for FS in GATK. FisherStrand. (default=30.0)

Step11  Filtering using bcftools

Tool :  bcftool
Input :  merged_SNP_filtered.vcf
Output :  merged_bcf_SNP_filtered.vcf


Command:
bcftools view -f "PASS" merged_SNP_filtered.vcf > merged_bcf_SNP_filtered.vcf 

Step12  Count number of SNPs per chromosome

Input :  merged_bcf_SNP_filtered.vcf
Output :  Chromosome_wise_SNP.text

Command:
grep -v "^#" input.vcf | cut -f 1 | sort | uniq -c > Chromosome_wise_SNP.text

Step13  Count number of SNPs per Gene

Script :&nbsp Python3 Input :&nbsp merged_bcf_SNP_filtered.vcf and gemone.gff file Output :&nbsp ENSBTAG00000009496 15 (No. of SNPs)

Script:
#!/usr/bin/python3
gff = open("genome.gff").read().splitlines()
vcf = open("merged_bcf_SNP_filtered.vcf").read().splitlines()

for gene in gff:
    nos = 0
    gchr = gene.split("\t")[0]
    gstrt = gene.split("\t")[3]
    gend = gene.split("\t")[4]
    gname = gene.split("\t")[8].split(";")[0].split(":")[1]
    for snp in vcf:
        if not snp.startswith("#"):
            schr = snp.split("\t")[0]
            spos = snp.split("\t")[1]
            if gchr == schr and int(spos) > int(gstrt) and int(spos) < int(gend):
                nos += 1
    print(gname+"\t"+str(nos))

Step14  Retrieve common SNPs b/w two vcf files

Tool :  bcftools
Input :  1.vcf.gz, 2.vcf.gz
Output :  Common_SNP.vcf

Command:
bcftools isec -p dir -n=2 1.vcf.gz  2.vcf.gz > Common_SNP.vcf      

Step15  Retrieve Unique SNPs b/w multiple vcf files

Tool :&nbsp bcftools
Input :&nbsp 1.vcf 2.vcf 3.vcf
Output :&nbsp Unique_SNPs.vcf

Command:
bcftools isec -C 1.vcf 2.vcf 3.vcf > 1.unique.vcf 
Note- Rather than finding the intersection and filtering it out, I think you could do the filtering in one step with bcftools isec -C 1.vcf 2.vcf 3.vcf > 1.unique.vcf, and repeat that for the other.

⚠️ **GitHub.com Fallback** ⚠️