Germline SNVs and small Indels - genome/analysis-workflows GitHub Wiki
Introduction
Tutorials
Command Line
Setup
Docker Environment
docker run -v `pwd`:/analysis-workflows -it mgibio/gatk-cwl:3.5.0 /bin/bash -l
Change working directory
cd analysis-workflows/example_data/exome_workflow
Haplotype Caller
/usr/bin/java -Xmx8g -jar /opt/GenomeAnalysisTK.jar -T HaplotypeCaller -R chr17_test.fa -I H_NJ-HCC1395-HCC1395_BL.cram -ERC GVCF -L chr17_test_genes.interval_list -o H_NJ-HCC1395-HCC1395_BL.g.vcf.gz
/usr/bin/java -Xmx8g -jar /opt/GenomeAnalysisTK.jar -T HaplotypeCaller -R chr17_test.fa -I H_NJ-HCC1395-HCC1395.cram -ERC GVCF -L chr17_test_genes.interval_list -o H_NJ-HCC1395-HCC1395.g.vcf.gz
View GVCF Output
zgrep -v '^##' H_NJ-HCC1395-HCC1395_BL.g.vcf.gz | head
Potential variant sites
zgrep -v '^##' H_NJ-HCC1395-HCC1395_BL.g.vcf.gz | grep ',<' | head
GenotypeGVCFs
/usr/bin/java -Xmx8g -jar /opt/GenomeAnalysisTK.jar -T GenotypeGVCFs -R chr17_test.fa --variant H_NJ-HCC1395-HCC1395_BL.g.vcf.gz -o H_NJ-HCC1395-HCC1395_BL_genotype.vcf.gz
/usr/bin/java -Xmx8g -jar /opt/GenomeAnalysisTK.jar -T GenotypeGVCFs -R chr17_test.fa --variant H_NJ-HCC1395-HCC1395.g.vcf.gz -o H_NJ-HCC1395-HCC1395_genotype.vcf.gz
View VCF output with genotypes
zgrep -v '^##' H_NJ-HCC1395-HCC1395_BL_genotype.vcf.gz | head
How many potential variations in each sample?
zgrep -v '^##' H_NJ-HCC1395-HCC1395_BL_genotype.vcf.gz | wc -l
zgrep -v '^##' H_NJ-HCC1395-HCC1395_genotype.vcf.gz | wc -l
THAT'S A LOT!!! Next, annotation and filtering.
Annotation
Setup VEP
exit
docker run -v `pwd`:/analysis-workflows -it mgibio/vep_helper-cwl:1.0.0 /bin/bash -l
cd /analysis-workflows/example_data/exome_workflow
Run VEP
/usr/bin/perl -I /opt/lib/perl/VEP/Plugins /usr/bin/variant_effect_predictor.pl --format vcf --vcf --plugin Downstream --plugin Wildtype --symbol --term SO --flag_pick --transcript_version --tsl --offline --cache --dir $PWD --check_existing --custom chr17_test_gnomADe.vcf.gz,gnomADe,vcf,exact,0,AF,AF_AFR,AF_AMR,AF_ASJ,AF_EAS,AF_FIN,AF_NFE,AF_OTH,AF_SAS --hgvs --fasta chr17_test.fa --synonyms chromAlias.ensembl.txt -o H_NJ-HCC1395-HCC1395_BL_annotated.vcf -i H_NJ-HCC1395-HCC1395_BL_genotype.vcf.gz
/usr/bin/perl -I /opt/lib/perl/VEP/Plugins /usr/bin/variant_effect_predictor.pl --format vcf --vcf --plugin Downstream --plugin Wildtype --symbol --term SO --flag_pick --transcript_version --tsl --offline --cache --dir $PWD --check_existing --custom chr17_test_gnomADe.vcf.gz,gnomADe,vcf,exact,0,AF,AF_AFR,AF_AMR,AF_ASJ,AF_EAS,AF_FIN,AF_NFE,AF_OTH,AF_SAS --hgvs --fasta chr17_test.fa --synonyms chromAlias.ensembl.txt -o H_NJ-HCC1395-HCC1395_annotated.vcf -i H_NJ-HCC1395-HCC1395_genotype.vcf.gz
View the annotated VCF
grep -v '^##' H_NJ-HCC1395-HCC1395_BL_annotated.vcf | head -n 3
Binary compression and index
/usr/local/bin/bgzip H_NJ-HCC1395-HCC1395_annotated.vcf
/usr/local/bin/bgzip H_NJ-HCC1395-HCC1395_BL_annotated.vcf
/usr/local/bin/tabix -p vcf H_NJ-HCC1395-HCC1395_BL_annotated.vcf.gz
/usr/local/bin/tabix -p vcf H_NJ-HCC1395-HCC1395_annotated.vcf.gz
Filter Variants
Coding Only Variant Filter
/usr/bin/perl /opt/vep/src/ensembl-vep/filter_vep --format vcf --ontology --filter "Consequence is coding_sequence_variant" -o H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.vcf -i H_NJ-HCC1395-HCC1395_BL_annotated.vcf.gz
/usr/bin/perl /opt/vep/src/ensembl-vep/filter_vep --format vcf --ontology --filter "Consequence is coding_sequence_variant" -o H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.vcf -i H_NJ-HCC1395-HCC1395_annotated.vcf.gz
How many where filtered?
grep -v '^#' H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.vcf | wc -l
grep -v '^#' H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.vcf | wc -l
Remove common alleles based on gnomADe population allele frequencies
/usr/bin/perl /opt/vep/src/ensembl-vep/filter_vep --format vcf --filter "gnomADe_AF < 0.001 or not gnomADe_AF" -i H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.vcf.gz -o H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf
/usr/bin/perl /opt/vep/src/ensembl-vep/filter_vep --format vcf --filter "gnomADe_AF < 0.001 or not gnomADe_AF" -i H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.vcf.gz -o H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf
How many were filtered?
zgrep -v '^#' H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf | wc -l
zgrep -v '^#' H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf | wc -l
Compress and index
/usr/local/bin/bgzip H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf
/usr/local/bin/bgzip H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf
/usr/local/bin/tabix H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf.gz
/usr/local/bin/tabix H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf.gz
Setup GATK SelectVariants
exit
docker run -v `pwd`:/analysis-workflows -it mgibio/gatk-cwl:3.6.0 /bin/bash -l
cd /analysis-workflows/example_data/exome_workflow
Run SelectVariants with Target Regions, ie. Exons
/usr/bin/java -jar /opt/GenomeAnalysisTK.jar -T SelectVariants -R chr17_test.fa -L chr17_test_target.interval_list --excludeFiltered --variant H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf.gz -o H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.gnomADe_AF_filtered.target_interval_filtered.vcf.gz
/usr/bin/java -jar /opt/GenomeAnalysisTK.jar -T SelectVariants -R chr17_test.fa -L chr17_test_target.interval_list --excludeFiltered --variant H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf.gz -o H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.target_interval_filtered.vcf.gz
Reporting
VariantsToTable
/usr/bin/java -jar /opt/GenomeAnalysisTK.jar -T VariantsToTable -R chr17_test.fa -F CHROM -F POS -F REF -F ALT -GF GT -GF AD -F AF -GF DP --variant H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.target_interval_filtered.vcf.gz -o H_NJ-HCC1395-HCC1395_variants.tsv
/usr/bin/java -jar /opt/GenomeAnalysisTK.jar -T VariantsToTable -R chr17_test.fa -F CHROM -F POS -F REF -F ALT -GF GT -GF AD -F AF -GF DP --variant H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.gnomADe_AF_filtered.target_interval_filtered.vcf.gz -o H_NJ-HCC1395-HCC1395_BL_variants.tsv
/usr/bin/java -jar /opt/GenomeAnalysisTK.jar -T VariantsToTable -R chr17_test.fa -F CHROM -F POS -F REF -F ALT -GF GT -GF AD -F AF -GF DP --variant H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.vcf.gz -o H_NJ-HCC1395-HCC1395_all_coding_variants.tsv
Add VEP fields to table
exit
docker run -v `pwd`:/analysis-workflows -it mgibio/annotation_table-cwl:1.0.0 /bin/bash -l
cd /analysis-workflows/example_data/exome_workflow
mkdir H_NJ-HCC1395-HCC1395
mkdir H_NJ-HCC1395-HCC1395_BL
mkdir H_NJ-HCC1395-HCC1395_all_coding
/usr/bin/python /usr/bin/add_annotations_to_table_helper.py H_NJ-HCC1395-HCC1395_variants.tsv H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.target_interval_filtered.vcf.gz Consequence,SYMBOL,Feature_type,Feature,HGVSc,HGVSp,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,HGNC_ID,Existing_variation,gnomADe_AF,CLIN_SIG,SOMATIC,PHENO $PWD/H_NJ-HCC1395-HCC1395
/usr/bin/python /usr/bin/add_annotations_to_table_helper.py H_NJ-HCC1395-HCC1395_all_coding_variants.tsv H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.vcf.gz Consequence,SYMBOL,Feature_type,Feature,HGVSc,HGVSp,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,HGNC_ID,Existing_variation,gnomADe_AF,CLIN_SIG,SOMATIC,PHENO $PWD/H_NJ-HCC1395-HCC1395_all_coding
/usr/bin/python /usr/bin/add_annotations_to_table_helper.py H_NJ-HCC1395-HCC1395_BL_variants.tsv H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.target_interval_filtered.vcf.gz Consequence,SYMBOL,Feature_type,Feature,HGVSc,HGVSp,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,HGNC_ID,Existing_variation,gnomADe_AF,CLIN_SIG,SOMATIC,PHENO $PWD/H_NJ-HCC1395-HCC1395_BL
View the final Germline variants for the tumor and normal sample
cat H_NJ-HCC1395-HCC1395_BL/variants.annotated.tsv
cat H_NJ-HCC1395-HCC1395/variants.annotated.tsv
Variant ENSP00000269305.4:p.Arg175His in TP53 is likely a pathogenic somatic variant since it's only seen in the tumor sample.
Variant ENSP00000418960.2:p.Arg1772Ter in BRCA1 is likely a pathogenic germline variant that shows up in both the tumor and the normal.
Notice anything about the allele frequencies (AF) between the tumor and normal? What would explain this?
CWL Workflow
Steps
INSERT PROCESS DIAGRAM