Outil 2 : UnifiedGenotyper - Dioufamad/SNPs_Calling GitHub Wiki

Overview

see here : https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php This tool uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples, emitting a genotype for each sample. The system can either emit just the variant sites or complete genotypes (which includes homozygous reference calls) satisfying some phred-scaled confidence value
NB1 : The caller can be very aggressive in calling variants in order to be very sensitive, so the raw output will contain many false positives. We use extensive post-calling filters to eliminate most of these FPs. See the documentation on filtering (especially by Variant Quality Score Recalibration) for more details.
NB2 : This tool is able to handle almost any ploidy (except very high ploidies in large pooled experiments); the ploidy can be specified using the -ploidy argument for non-diploid organisms.

Downsampling settings This tool applies the following downsampling settings by default. Mode: BY_SAMPLE To coverage: 250

Window size This tool uses a sliding window on the reference. Window start: -200 bp before the locus Window stop: 200 bp after the locus

Multi-sample SNP calling

 java -jar GenomeAnalysisTK.jar \
   -T UnifiedGenotyper \
   -R reference.fasta \
   -I sample1.bam [-I sample2.bam ...] \
   --dbsnp dbSNP.vcf \
   -o snps.raw.vcf \
   -stand_call_conf [50.0] \
   [-L targets.interval_list]

Generate calls at all sites

 java -jar GenomeAnalysisTK.jar \
   -T UnifiedGenotyper \
   -R reference.fasta \
   -I input.bam \
   -o raw_variants.vcf \
   --output_mode EMIT_ALL_SITES

Options to watch out for :

NB : after '-->' is the value used in the first batch of HC trials. use those value to make a comparable run
========================ADDITIONAL INPUT FILES================================
--alleles / -alleles (none)
Set of alleles to use in genotyping
When --genotyping_mode is set to GENOTYPE_GIVEN_ALLELES mode, the caller will genotype the samples using only the alleles provide in this callset. Note that this is not well tested in HaplotypeCaller, and is definitely not suitable for use with HaplotypeCaller in -ERC GVCF mode. In addition, it does not apply to MuTect2 at all. This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3
--comp / -comp ([])
Comparison VCF file
If a call overlaps with a record from the provided comp track, the INFO field will be annotated as such in the output with the track name (e.g. -comp:FOO will have 'FOO' in the INFO field). Records that are filtered in the comp track will be ignored. Note that 'dbSNP' has been special-cased (see the --dbsnp argument). This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3
--dbsnp / -D (none)
dbSNP file
rsIDs from this file are used to populate the ID column of the output. Also, the DB INFO flag will be set when appropriate. dbSNP is not used in any way for the calculations themselves. This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3
--onlyEmitSamples / -onlyEmitSamples ([])
If provided, only these samples will be emitted into the VCF, regardless of which samples are present in the BAM file
=============================ANNOTATIONS=====================================================================
--annotateNDA / -nda (false)
Annotate number of alleles observed
Depending on the value of the --max_alternate_alleles argument, we may genotype only a fraction of the alleles being sent on for genotyping. Using this argument instructs the genotyper to annotate (in the INFO field) the number of alternate alleles that were originally discovered (but not necessarily genotyped) at the site.
--annotation / -A ([])
One or more specific annotations to apply to variant calls
Which annotations to add to the output VCF file. See the VariantAnnotator -list argument to view available annotations.
--group / -G [Standard, StandardUG] (try 'none' to remove the default or make a second none run to compare with)
One or more classes/groups of annotations to apply to variant calls. The single value 'none' removes the default group
If specified, all available annotations in the group will be applied. See the VariantAnnotator -list argument to view available groups. Keep in mind that RODRequiringAnnotations are not intended to be used as a group, because they require specific ROD inputs.
--excludeAnnotation / -XA ([])
One or more specific annotations to exclude
Which annotations to exclude from output in the VCF file. Note that this argument has higher priority than the -A or -G arguments, so annotations will be excluded even if they are explicitly included with the other options.
=======================GENOTYPING====================================================
--genotype_likelihoods_model / -glm (SNP)
Genotype likelihoods calculation model to employ -- SNP is the default option, while INDEL is also available for calling indels and BOTH is available for calling both together
The --genotype_likelihoods_model argument is an enumerated type (Model), which can have one of the following values:
SNP
INDEL
GENERALPLOIDYSNP
GENERALPLOIDYINDEL
BOTH
--genotyping_mode / -gt_mode (DISCOVERY)
Specifies how to determine the alternate alleles to use for genotyping
The --genotyping_mode argument is an enumerated type (GenotypingOutputMode), which can have one of the following values:
DISCOVERY
The genotyper will choose the most likely alternate allele
GENOTYPE_GIVEN_ALLELES
Only the alleles passed by the user should be considered.
==========================PLOIDY=====================================================
--sample_ploidy / -ploidy (2)
Ploidy per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy).
Sample ploidy - equivalent to number of chromosome copies per pool. For pooled experiments this should be set to the number of samples in pool multiplied by individual sample ploidy.
======================HETEROZYGOSITY==============================================
--heterozygosity / -hets (0.001)
Heterozygosity value used to compute prior likelihoods for any locus
The expected heterozygosity value used to compute prior probability that a locus is non-reference. See https://software.broadinstitute.org/gatk/documentation/article?id=8603 for more details.
--heterozygosity_stdev / -heterozygosityStandardDeviation (0.01)
Standard deviation of eterozygosity for SNP and indel calling.
The standard deviation of the distribution of alt allele fractions. The above heterozygosity parameters give the *mean* of this distribution; this parameter gives its spread.
--indel_heterozygosity / -indelHeterozygosity (1.25E-4)
Heterozygosity for indel calling
This argument informs the prior probability of having an indel at a site.
============================INDELGAPS===========================================
--indelGapContinuationPenalty / -indelGCP (10)
Indel gap continuation penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10
--indelGapOpenPenalty / -indelGOP (45)
Indel gap open penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10
=============================FOR VERY WELL KNOWN POPULATIONS=============================
--pcr_error_rate / -pcr_error (1.0E-4)
The PCR error rate to be used for computing fragment-based likelihoods
The PCR error rate is independent of the sequencing error rate, which is necessary because we cannot necessarily distinguish between PCR errors vs. sequencing errors. The practical implication for this value is that it effectively acts as a cap on the base qualities.
--input_prior / -inputPrior ([])
Input prior for calls. 
option used when the tendencies in the population are very well known. it is not detailled here as most of the callings of variants we''ll do are on populations we doesnt know like that. see here for more (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php)
==============================MAXIMUM VALUES==========================================================
--max_alternate_alleles / -maxAltAlleles (6)
Maximum number of alternate alleles to genotype
If there are more than this number of alternate alleles presented to the genotyper (either through discovery or GENOTYPE_GIVEN_ALLELES), then only this many alleles will be used. Note that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter. See also {@link #MAX_GENOTYPE_COUNT}.
--max_genotype_count / -maxGT (1024)
Maximum number of genotypes to consider at any site
If there are more than this number of genotypes at a locus presented to the genotyper, then only this many genotypes will be used. This is intended to deal with sites where the combination of high ploidy and high alt allele count can lead to an explosion in the number of possible genotypes, with extreme adverse effects on runtime performance. How does it work? The possible genotypes are simply different ways of partitioning alleles given a specific ploidy assumption. Therefore, we remove genotypes from consideration by removing alternate alleles that are the least well supported. The estimate of allele support is based on the ranking of the candidate haplotypes coming out of the graph building step. Note however that the reference allele is always kept. The maximum number of alternative alleles used in the genotyping step will be the lesser of the two: 1. the largest number of alt alleles, given ploidy, that yields a genotype count no higher than {@link #MAX_GENOTYPE_COUNT} 2. the value of {@link #MAX_ALTERNATE_ALLELES} As noted above, genotyping sites with large genotype counts is both CPU and memory intensive. Unless you have a good reason to change the default value, we highly recommend that you not play around with this parameter. See also {@link #MAX_ALTERNATE_ALLELES}.
--max_deletion_fraction / -deletions (0.05)
Maximum fraction of reads with deletions spanning this locus for it to be callable
If the fraction of reads with deletions spanning a locus is greater than this value, the site will not be considered callable and will be skipped. To disable the use of this parameter, set its value to >1.
===========================MINIMUM VALUES=======================================================
--min_base_quality_score / -mbq (17) --> (10)
Minimum base quality required to consider a base for calling
The minimum confidence needed in a given base for it to be used in variant calling. Note that the base quality of a base is capped by the mapping quality so that bases on reads with low mapping quality may get filtered out depending on this value. Note too that this argument is ignored in indel calling. In indel calling, low-quality ends of reads are clipped off (with fixed threshold of Q20).
--min_indel_count_for_genotyping / -minIndelCnt (5)
Minimum number of consensus indels required to trigger genotyping run
A candidate indel is genotyped (and potentially called) if there are this number of reads with a consensus indel at a site. Decreasing this value will increase sensitivity but at the cost of larger calling time and a larger number of false positives.
--min_indel_fraction_per_sample / -minIndelFrac (0.25)
Minimum fraction of all reads at a locus that must contain an indel (of any allele) for that sample to contribute to the indel count for alleles
Complementary argument to minIndelCnt. Only samples with at least this fraction of indel-containing reads will contribute to counting and overcoming the threshold minIndelCnt. This parameter ensures that in deep data you don't end up summing lots of super rare errors up to overcome the 5 read default threshold. Should work equally well for low-coverage and high-coverage samples, as low coverage samples with any indel containing reads should easily over come this threshold.
--standard_min_confidence_threshold_for_calling / -stand_call_conf (10.0) (same as HC runs at 30.0 because as Phred-scaled probability. I.e., 30 => 10^-30/10=phread scale)
The minimum phred-scaled confidence threshold at which variants should be called. Only variant sites with QUAL equal or greater than this threshold will be called. Note that since version 3.7, we no longer differentiate high confidence from low confidence calls at the calling step. The default call confidence threshold is set low intentionally to achieve high sensitivity, which will allow false positive calls as a side effect. Be sure to perform some kind of filtering after calling to reduce the amount of false positives in your final callset. Note that when HaplotypeCaller is used in GVCF mode (using either -ERC GVCF or -ERC BP_RESOLUTION) the call threshold is automatically set to zero. Call confidence thresholding will then be performed in the subsequent GenotypeGVCFs command.
=========================MANDATORY ARGUMENTS=========================================================
--out / -o (stdout) (but give a file name)
File to which variants should be written
A raw, unfiltered, highly sensitive callset in VCF format.
=============================ENGINES USED===========================================
--output_mode / -out_mode (EMIT_VARIANTS_ONLY)
Which type of calls we should output
Experimental argument FOR USE WITH UnifiedGenotyper ONLY. When using HaplotypeCaller, use -ERC instead. When using GenotypeGVCFs, see -allSites.
The --output_mode argument is an enumerated type (OutputMode), which can have one of the following values:
EMIT_VARIANTS_ONLY
produces calls only at variant sites
EMIT_ALL_CONFIDENT_SITES
produces calls at variant sites and confident reference sites
EMIT_ALL_SITES
produces calls at any callable site regardless of confidence; this argument is intended only for point mutations (SNPs) in DISCOVERY mode or generally when running in GENOTYPE_GIVEN_ALLELES mode; it will by no means produce a comprehensive set of indels in DISCOVERY mode
--pair_hmm_implementation / -pairHMM (LOGLESS_CACHING)
The PairHMM implementation to use for -glm INDEL genotype likelihood calculations
The PairHMM implementation to use for -glm INDEL genotype likelihood calculations. The various implementations balance a tradeoff of accuracy and runtime.
The --pair_hmm_implementation argument is an enumerated type (HMM_IMPLEMENTATION), which can have one of the following values:
EXACT
ORIGINAL
LOGLESS_CACHING
VECTOR_LOGLESS_CACHING
VECTOR_LOGLESS_CACHING_OMP
VECTOR_LOGLESS_CACHING_FPGA_EXPERIMENTAL
FASTEST_AVAILABLE
DEBUG_VECTOR_LOGLESS_CACHING
ARRAY_LOGLESS
--useNewAFCalculator / -newQual (false)
Use new AF model instead of the so-called exact model
This activates a model for calculating QUAL that was introduced in version 3.7 (November 2016). We expect this model will become the default in future versions.
==============================TO ADD FOR SIMILARITY WITH HC BATCH 1 RUNS===================
-rf BadCigar
-allowPotentiallyMisencodedQuals
=============================================================================================

models for command line :

java -Xmx12g -jar /usr/local/gatk-3.3/GenomeAnalysisTK.jar -T UnifiedGenotyper --help

java -Xmx12g -jar /usr/local/gatk-3.3/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ../test8_0_REFERENCE_with_faidx/MSU7.fa -I ../test8_2_IndelRealigner/AX227_vsMSU7_bwamem_pp_nodup_sorted_idxed_realigned_aMBQSsinceRTC.bam -o ./AX227_vsMSU7_bwamem_pp_nodup_sorted_idxed_realignedbam_aMBQSsinceRTC.bam_raw_snp_indels.vcf -rf BadCigar -allowPotentiallyMisencodedQuals

now try it on script and let it run with a normal indelrealignedbam pp nop dup (Your job 1378367 ("amad_ug_pp_nodup_bioinfq") has been submitted--> done)