Variant Calling by GATK - Bioinfo-niab/VariantAnalysis GitHub Wiki
- hisat2
- GATK4
- Picard Tools
- Samtools
- SnpEff
- bcftools
Tool : Picard
Input : Sam or Bam file
Output: Sam or Bam file
Command:
java -jar /usr/local/bin/picard.jar ValidateSamFile I=Input.sam MODE=SUMMARY
Tools : hisat2 and Samtools
Input : fastq files and Reference Genome
Output : aligned_reads.sam
Command:
hisat2 -f -x referenceGenome_indexed -1 reads_R1.fq -2 reads_R2.fq -S aligned.sam| samtools view -bS aligned.sam > aligned.bam
Tool : Picard
Input : .bam file
Output : .bam file
Command:
java -jar /usr/local/bin/picard.jar AddOrReplaceReadGroups I=input.bam O=output.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=20
RGID=Read-Group ID
RGLB=Read-Group library
RGPL=Read-Group platform (e.g. ILLUMINA, SOLID)
RGPU=Read-Group platform unit (eg. run barcode)
RGSM=Read-Group sample name
###Note: If .bam file without read groups in the header, so I am trying to correct this by using picard tools AddOrReplaceReadGroups.
Tool : Picard
Input : ReferenceGenome.fasta
Output : ReferenceGenome.dict
Command:
java -jar /usr/local/bin/picard.jar CreateSequenceDictionary R=ReferenceGenome.fasta O=ReferenceGenome.dict
Tool : Samtools
Input : ReferenceGenome.fasta
Output : ReferenceGenome.fa
Command:
samtools faidx ReferenceGenome.fasta
Tools : GATK4 MarkDuplicates
Input : .bam file
Output : marked_duplicates.bam, marked_dup_metrics.txt
Command:
java -jar /usr/local/bin/picard.jar MarkDuplicates I=input.bam O=marked_duplicates.bam M=marked_dup_metrics.txt
Tool : Picard
Input : marked_duplicates.bam
Output : marked_duplicates.bai
Command:
java "Xmx16G"-jar /usr/local/bin/picard.jar BuildBamIndex I=marked_duplicates.bam
Tool : GATK4
Input : marked_duplicates.bam, ReferenceGenome.fa
Output : marked_duplicates.g.vcf.gz, marked_duplicates.g.vcf.gz.tbi
Command:
gatk --java-options -Xmx64G HaplotypeCaller -R ReferenceGenome.fa -I marked_duplicates.bam -O marked_duplicates.g.vcf.gz -G StandardAnnotation -ERC GVCF -G AS_StandardAnnotation
G=One or more groups of annotations to apply to variant calls
ERC=Mode for emitting reference confidence scores (For Mutect2, this is a BETA feature)
Tool : GATK4
Input : marked_duplicates.g.vcf.gz, ReferenceGenome.fa
Output: merged.g.vcf.gz, merged.g.vcf.gz.tbi
Command:
gatk CombineGVCFs -R ReferenceGenome.fa --variant marked_duplicates.g.vcf.gz --variant marked_duplicates.g.vcf.gz -O merged.g.vcf.gz
Input : ReferenceGenome.fa, merged.g.vcf.gz
Output : merged.vcf.gz
Command:
gatk --java-options -Xmx64g GenotypeGVCFs -R ReferenceGenome.fa -V merged.g.vcf.gz -O merged.vcf.gz
V=A VCF file containing variants
Tool : GATK4
Input : merged.vcf.gz
Output : merged_SNPs.vcf.gz, merged_SNPs.vcf.gz.idx
Command:
gatk --java-options -Xmx16g SelectVariants -V merged.vcf.gz -select-type SNP -O merged_SNPs.vcf.gz
Tool : GATK4
Input : merged_SNPs.vcf.gz
Output : merged_SNP_filtered.vcf, merged_SNP_filtered.vcf.idx
Command:
gatk --java-options -Xmx16g VariantFiltration -V merged_SNPs.vcf.gz -filter "MQ >40.0" --filter-name "MQ40" -filter "QUAL >30.0" --filter-name "QUAL30" -filter "DP >=100" --filter-name "DP100" -filter "FS <40.0" --filter-name "FS40" -O merged_SNP_filtered.vcf
V=A VCF file containing variants
MQ=RMSMappingQuality (MQ) 40.0
QUAL=A parameter value for QUAL in GATK. Low quality. (default=50)
DP=A parameter value for DP in GATK. Low depth. (default=5)
FS=A parameter value for FS in GATK. FisherStrand. (default=30.0)
Tool : bcftool
Input : merged_SNP_filtered.vcf
Output : merged_bcf_SNP_filtered.vcf
Command:
bcftools view -f "PASS" merged_SNP_filtered.vcf > merged_bcf_SNP_filtered.vcf
Input : merged_bcf_SNP_filtered.vcf
Output : Chromosome_wise_SNP.text
Command:
grep -v "^#" input.vcf | cut -f 1 | sort | uniq -c > Chromosome_wise_SNP.text
Script :  Python3 Input :  merged_bcf_SNP_filtered.vcf and gemone.gff file Output :  ENSBTAG00000009496 15 (No. of SNPs)
Script:
#!/usr/bin/python3
gff = open("genome.gff").read().splitlines()
vcf = open("merged_bcf_SNP_filtered.vcf").read().splitlines()
for gene in gff:
nos = 0
gchr = gene.split("\t")[0]
gstrt = gene.split("\t")[3]
gend = gene.split("\t")[4]
gname = gene.split("\t")[8].split(";")[0].split(":")[1]
for snp in vcf:
if not snp.startswith("#"):
schr = snp.split("\t")[0]
spos = snp.split("\t")[1]
if gchr == schr and int(spos) > int(gstrt) and int(spos) < int(gend):
nos += 1
print(gname+"\t"+str(nos))
Tool : bcftools
Input : 1.vcf.gz, 2.vcf.gz
Output : Common_SNP.vcf
Command:
bcftools isec -p dir -n=2 1.vcf.gz 2.vcf.gz > Common_SNP.vcf
Tool :  bcftools
Input :  1.vcf 2.vcf 3.vcf
Output :  Unique_SNPs.vcf
Command:
bcftools isec -C 1.vcf 2.vcf 3.vcf > 1.unique.vcf
Note- Rather than finding the intersection and filtering it out, I think you could do the filtering in one step with bcftools isec -C 1.vcf 2.vcf 3.vcf > 1.unique.vcf, and repeat that for the other.