RNAseq short variant identification

Full List of Tools:

hisat2
GATK4
Picard Tools
Samtools
SnpEff
bcftools

Main Steps:

Step1 Validate Sam or Bam files

Tool : Picard
Input : Sam or Bam file
Output: Sam or Bam file

Command: 
java -jar /usr/local/bin/picard.jar ValidateSamFile I=Input.sam MODE=SUMMARY

Step2 Alignment – Map to Reference

Tools : hisat2 and Samtools
Input : fastq files and Reference Genome
Output : aligned_reads.sam

Command: 
hisat2 -f -x referenceGenome_indexed -1 reads_R1.fq -2 reads_R2.fq -S aligned.sam| samtools view -bS aligned.sam > aligned.bam

Step3 AddOrReplaceReadGroups

Tool : Picard
Input : .bam file
Output : .bam file

Command:
java -jar /usr/local/bin/picard.jar AddOrReplaceReadGroups I=input.bam O=output.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=20

RGID=Read-Group ID
RGLB=Read-Group library
RGPL=Read-Group platform (e.g. ILLUMINA, SOLID)
RGPU=Read-Group platform unit (eg. run barcode)
RGSM=Read-Group sample name

###Note: If .bam file without read groups in the header, so I am trying to correct this by using picard tools AddOrReplaceReadGroups.

Step4 Prepare a genome fasta file to use as reference

Tool : Picard
Input : ReferenceGenome.fasta
Output : ReferenceGenome.dict

Command:
java -jar /usr/local/bin/picard.jar CreateSequenceDictionary R=ReferenceGenome.fasta O=ReferenceGenome.dict

Step5 Creating fasta index file

Tool : Samtools
Input : ReferenceGenome.fasta
Output : ReferenceGenome.fa

Command:
samtools faidx ReferenceGenome.fasta

Step6 MarkDuplicates

Tools : GATK4 MarkDuplicates
Input : .bam file
Output : marked_duplicates.bam, marked_dup_metrics.txt

Command:
java -jar /usr/local/bin/picard.jar MarkDuplicates I=input.bam O=marked_duplicates.bam M=marked_dup_metrics.txt

Step6 BuildBamIndex

Tool : Picard
Input : marked_duplicates.bam
Output : marked_duplicates.bai

Command:
java "Xmx16G"-jar /usr/local/bin/picard.jar BuildBamIndex I=marked_duplicates.bam

Step7 HaplotypeCaller

Tool : GATK4
Input : marked_duplicates.bam, ReferenceGenome.fa
Output : marked_duplicates.g.vcf.gz, marked_duplicates.g.vcf.gz.tbi

Command:
gatk --java-options -Xmx64G HaplotypeCaller -R ReferenceGenome.fa -I marked_duplicates.bam -O marked_duplicates.g.vcf.gz -G StandardAnnotation -ERC GVCF -G AS_StandardAnnotation

G=One or more groups of annotations to apply to variant calls
ERC=Mode for emitting reference confidence scores (For Mutect2, this is a BETA feature)

Step7 Merged multiple samples

Tool : GATK4
Input : marked_duplicates.g.vcf.gz, ReferenceGenome.fa
Output: merged.g.vcf.gz, merged.g.vcf.gz.tbi


Command:
gatk CombineGVCFs -R ReferenceGenome.fa  --variant marked_duplicates.g.vcf.gz --variant marked_duplicates.g.vcf.gz -O merged.g.vcf.gz

Step8 GenotypeGVCFs for variants calling

Input : ReferenceGenome.fa, merged.g.vcf.gz
Output : merged.vcf.gz


Command:
gatk --java-options -Xmx64g GenotypeGVCFs -R ReferenceGenome.fa -V merged.g.vcf.gz -O merged.vcf.gz
V=A VCF file containing variants

Step9 Selectvariants for SNPs selection

Tool : GATK4
Input : merged.vcf.gz
Output : merged_SNPs.vcf.gz, merged_SNPs.vcf.gz.idx


Command:
gatk --java-options -Xmx16g SelectVariants -V merged.vcf.gz  -select-type SNP -O merged_SNPs.vcf.gz

Step10 Variantfiltration

Tool : GATK4
Input : merged_SNPs.vcf.gz
Output : merged_SNP_filtered.vcf, merged_SNP_filtered.vcf.idx


Command:
gatk --java-options -Xmx16g VariantFiltration -V merged_SNPs.vcf.gz -filter "MQ >40.0" --filter-name "MQ40" -filter "QUAL >30.0" --filter-name "QUAL30" -filter "DP >=100" --filter-name "DP100" -filter "FS <40.0" --filter-name "FS40" -O  merged_SNP_filtered.vcf
V=A VCF file containing variants
MQ=RMSMappingQuality (MQ) 40.0
QUAL=A parameter value for QUAL in GATK. Low quality. (default=50)
DP=A parameter value for DP in GATK. Low depth. (default=5)
FS=A parameter value for FS in GATK. FisherStrand. (default=30.0)

Step11 Filtering using bcftools

Tool : bcftool
Input : merged_SNP_filtered.vcf
Output : merged_bcf_SNP_filtered.vcf


Command:
bcftools view -f "PASS" merged_SNP_filtered.vcf > merged_bcf_SNP_filtered.vcf

Step12 Count number of SNPs per chromosome

Input : merged_bcf_SNP_filtered.vcf
Output : Chromosome_wise_SNP.text

Command:
grep -v "^#" input.vcf | cut -f 1 | sort | uniq -c > Chromosome_wise_SNP.text

Step13 Count number of SNPs per Gene

Script :&nbsp Python3 Input :&nbsp merged_bcf_SNP_filtered.vcf and gemone.gff file Output :&nbsp ENSBTAG00000009496 15 (No. of SNPs)

Script:
#!/usr/bin/python3
gff = open("genome.gff").read().splitlines()
vcf = open("merged_bcf_SNP_filtered.vcf").read().splitlines()

for gene in gff:
    nos = 0
    gchr = gene.split("\t")[0]
    gstrt = gene.split("\t")[3]
    gend = gene.split("\t")[4]
    gname = gene.split("\t")[8].split(";")[0].split(":")[1]
    for snp in vcf:
        if not snp.startswith("#"):
            schr = snp.split("\t")[0]
            spos = snp.split("\t")[1]
            if gchr == schr and int(spos) > int(gstrt) and int(spos) < int(gend):
                nos += 1
    print(gname+"\t"+str(nos))

Step14 Retrieve common SNPs b/w two vcf files

Tool : bcftools
Input : 1.vcf.gz, 2.vcf.gz
Output : Common_SNP.vcf

Command:
bcftools isec -p dir -n=2 1.vcf.gz  2.vcf.gz > Common_SNP.vcf

Step15 Retrieve Unique SNPs b/w multiple vcf files

Tool :&nbsp bcftools
Input :&nbsp 1.vcf 2.vcf 3.vcf
Output :&nbsp Unique_SNPs.vcf

Command:
bcftools isec -C 1.vcf 2.vcf 3.vcf > 1.unique.vcf 
Note- Rather than finding the intersection and filtering it out, I think you could do the filtering in one step with bcftools isec -C 1.vcf 2.vcf 3.vcf > 1.unique.vcf, and repeat that for the other.

Variant Calling by GATK - Bioinfo-niab/VariantAnalysis GitHub Wiki

RNAseq short variant identification

Full List of Tools:

Main Steps:

Step1 Validate Sam or Bam files

Step2 Alignment – Map to Reference

Step3 AddOrReplaceReadGroups

Step4 Prepare a genome fasta file to use as reference

Step5 Creating fasta index file

Step6 MarkDuplicates

Step6 BuildBamIndex

Step7 HaplotypeCaller

Step7 Merged multiple samples

Step8 GenotypeGVCFs for variants calling

Step9 Selectvariants for SNPs selection

Step10 Variantfiltration

Step11 Filtering using bcftools

Step12 Count number of SNPs per chromosome

Step13 Count number of SNPs per Gene

Step14 Retrieve common SNPs b/w two vcf files

Step15 Retrieve Unique SNPs b/w multiple vcf files

⚠️ GitHub.com Fallback ⚠️

Variant Calling by GATK - Bioinfo-niab/VariantAnalysis GitHub Wiki

RNAseq short variant identification

Full List of Tools:

Main Steps:

Step1 Validate Sam or Bam files

Step2 Alignment – Map to Reference

Step3 AddOrReplaceReadGroups

Step4 Prepare a genome fasta file to use as reference

Step5 Creating fasta index file

Step6 MarkDuplicates

Step6 BuildBamIndex

Step7 HaplotypeCaller

Step7 Merged multiple samples

Step8 GenotypeGVCFs for variants calling

Step9 Selectvariants for SNPs selection

Step10 Variantfiltration

Step11 Filtering using bcftools

Step12 Count number of SNPs per chromosome

Step13 Count number of SNPs per Gene

Step14 Retrieve common SNPs b/w two vcf files

Step15 Retrieve Unique SNPs b/w multiple vcf files

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️