Tutorial - asoltis/MutEnricher GitHub Wiki

MutEnricher Tutorial

Last updated: June 15, 2021

Contents:

Requirements and Installation:

See installation guide.

Introduction

This package performs somatic mutation recurrence analysis in two distinct fashions:

  1. coding analysis, which tests for enrichment above background of non-silent somatic mutations in protein coding genes
  2. non-coding analysis, which tests for enrichment above background of somatic mutations in user-defined non-coding regions of the genome (e.g. promoters, UTRs, enhancers, etc.)

Both tools are accessible through the main package driver script mutEnricher.py. To see the main help page, use python mutEnricher.py -h. To see the detailed help pages for the two analysis types, use python mutEnricher.py <command> -h, where <command> is either coding or noncoding.

Example data for test execution is provided in the example_data sub-directory provided with this package. There are three additional sub-directories within containing example data:

  • annotation_files – includes an example GTF file and BED file of coding gene promoters.
  • covariates – includes example covariate and covariate weights files
  • precomputed_apcluster – includes pre-computed gene/region covariate clustering results for use/testing.
  • vcfs – 100 example somatic mutation files (with tabix-indexed .tbi files)

Additional files include:

  • nonsilent_terms.txt
  • quickstart_commands.txt
  • vcf_files.txt

The example somatic VCF files were generated by randomly assigning “somatic mutations” throughout the genome at a global frequency of 2 mutations per megabase, and thus DO NOT represent true whole genome somatic mutation data. Included with these randomly distributed “mutations” are known cancer mutations in the coding regions of the genes KRAS and TP53 (coding examples) and in the promoter of the TERT gene (non-coding example). These mutations were included at target cohort frequencies of 30%, 90%, and 40%, respectively, and serve as known true positives for testing.

Inputs

Somatic mutations and annotations

The main inputs to MutEnricher are per-sample somatic mutation calls. These should be provided as sorted, bgzipped, and tabix-indexed VCF files for both analysis types (though other options are described later). Such files can be generated from a variety of somatic mutation callers comparing tumor versus normal DNA, though discussion of such tools is outside the scope of this tutorial. MutEnricher considers only PASS variants from the individual VCFs; beyond this, no other per-variant filtering is performed. It is highly recommended that users provide VCF files filtered for only the most reliable variant calls to prevent false discoveries. Several programs, including bcftools, are capable of performing filtering by a variety of variables.

Somatic VCFs for each sample are provided with an input tab-delimited text file (one sample per row). Each row has two columns:

  1. The VCF file path (e.g. /path/to/vcfs/sample1.vcf.gz)
  2. A unique name for the sample (e.g. sample1) An error is thrown if the sample names provided are not unique or if any of the VCF files listed are not found on the system.

For non-coding analyses, no specific VCF annotations are necessary as somatic variants are considered based on genomic location only. Variant annotation is necessary for coding analyses, however. If a non-silent term is found (e.g. nonsynonymous, frameshift, etc.), the variant is recorded as non-silent; otherwise, it is recorded as background. VCFs may be annotated during calling (depending on workflows), added later through various programs (e.g. bcftools or ANNOVAR), or added manually. These annotations must be present in the INFO field of each VCF. For example, a nonsynonymous variant in an exon of the TP53 gene may be annotated in the INFO field with ANNOVAR as:

SOMATIC;Gene.refGene=TP53;ExonicFunc.refGene=nonsynonymous_SNV;Func.refGene=exonic

Here, the gene name is encoded in the Gene.refGene field and the relevant effect on the gene is encoded in the ExonicFunc.refGene field.

In another example:

SOMATIC;Gene.refGene=TP53;ExonicFunc.refGene=.;Func.refGene=splicing

Here, the variant occurs adjacent to an exon boundary and potentially modulates gene splicing. This information is contained in the Func.refGene field. For annotating VCFs with ANNOVAR, see the annotating with ANNOVAR page.

Users can define what terms are considered non-silent in a few ways using the --anno-type option. If VCFs are annotated with ANNOVAR, pre-defined terms can be accessed by setting this option to either annovar-refGene (default --anno-type setting), annovar-knownGene, or annovar-ensGene for refGene, knownGene, or Ensemble gene models, respectively. VCFs annotated with SnpEff can be parsed by setting this option to SnpEff, which will instruct MutEnricher to parse the "ANN" INFO fields of the input VCFs. VCFs annotated with VEP can be parsed by setting this option to VEP, which will parse the "CSQ" INFO fields of the input VCFs. If using somatic VCFs generated and annotated by Illumina pipelines (e.g. annotated by the Illumina annotation engine), users can use pre-defined terms by setting this option to illumina which will parse "CSQT" INFO fields. Alternatively, users can provide a tab-delimited text file containing these terms with this same option (e.g. --anno-type=/path/to/nonsilent_terms.txt). An example of such a file is:

Gene    Gene.refGene    x
Effect  ExonicFunc.refGene      frameshift_deletion
Effect  ExonicFunc.refGene      frameshift_insertion
Effect  ExonicFunc.refGene      frameshift_substitution
Effect  ExonicFunc.refGene      nonframeshift_deletion
Effect  ExonicFunc.refGene      nonframeshift_insertion
Effect  ExonicFunc.refGene      nonframeshift_substitution
Effect  ExonicFunc.refGene      nonsynonymous_SNV
Effect  ExonicFunc.refGene      stopgain
Effect  ExonicFunc.refGene      stoploss
Effect  Func.refGene    splicing

The first column entry of each row must be either Gene or Effect - this signifies which INFO fields are to be searched when extracting gene name and gene effect information, respectively. There should only be one Gene row, but multiple Effect rows are possible. The second column defines the field(s) to be searched from the INFO column of each VCF. For Effect, multiple fields can be considered (e.g. ExonicFunc.refGene and Func.refGene are scanned in the above example). The third column defines the actual non-silent terms. For Gene, this field is not considered (e.g. can be blank or contain a placeholder). Note also that gene names in the Gene field of each VCF should match the nomencalture used in the appropriate genefield from the input GTF file (see below).

The coding analysis code allows users to alternatively input coding variants in mutation annotation format (MAF, see specifications). Non-silent terms following MAF specification are used if this input type is used.

NOTE: the non-coding analysis code does not currently accept MAF files.

Genes and non-coding regions

For coding analyses, gene models must be provided with a gene transfer format (GTF) file. Such files are available from several online sources (e.g. the UCSC table browser from https://genome.ucsc.edu). Gene names are read using the gene_id field by default (the field choice can be modified with the --gene-field option) and exon, coding sequences (CDS), and introns are determined and consolidated into one composite gene in cases of multi-isoform transcript models. As a filtering step, any genes that are annotated to multiple loci in the genome (e.g. the same gene name present on two different chromosomes) are removed from further consideration.

Users may also restrict the program to testing only specific genes in the GTF. This is accomplished by providing a text file with gene names (one per line) with the -g or --gene-list option. Only genes included in this list (and present in the provided GTF) will be considered in the final analysis.

For non-coding analyses, regions for testing (e.g. promoters) are provided with a simple input BED file (see specification). The first three columns are required (chromosome, 0-based start position, 1-based end position). Any text information provided in the fourth column is used as the name for the region (e.g. TERT_promoter) and can be useful for identifying regions in the output. If the fourth column is empty, default names are created.

It is recommended that the input BED file contain sorted and non-overlapping intervals. To accomplish and/or check this, the user can run a simple BEDTools command:

bedtools sort -i input.bed | bedtools merge -c 4 -o collapse > output.bed

Here, input.bed and output.bed are the original input and the sorted, merged output BED files, respectively. The first command sorts the records by position and the second merges any overlapping intervals into one. The options -c 4 and -o collapse instruct bedtools to join the entries in the fourth column (i.e. region names) by a delimiter (comma in this case).

Covariate and covariate weight files

As described earlier, users can optionally cluster features by genomic covariates. To perform this clustering, an input covariates file must be supplied (with the -c or --covariates-file option). The provided file should be tab-delimited with a single header line. The first column contains the feature name, which is either the gene name (coding analysis) or a region string (non-coding) analysis. The latter is of the format :<1-based start>-<1-based end>, e.g. chr10:100-200, and should match the regions in the input BED file (note the 1-based start differs from the 0-based start in the BED!). The remaining columns contain values for the feature covariates, named with their respective column labels in the header line.

As mentioned earlier, we include two utilities to assist with creating these input files. For coding analyses, users can run get_gene_covariates.py, which takes the GTF needed for the coding analysis and an indexed genome sequence file (e.g. for hg19):

python get_gene_covariates.py /path/to/genes.gtf /path/to/genome.fa

The above run will use the gene models included in the GTF and calculate the full gene length, coding length, and GC and CpG contents of the underlying DNA sequence. For coding analysis, the lengths reported are converted to per kilobase and log2 transformed. For example:

Gene    full_length     coding_length   GC      CpG
A1BG    2.74286855102   0.570462931026  0.658   0.672
A1CF    6.43073688093   0.929790997719  0.469   0.315
A2M     5.60056711218   2.14469902515   0.493   0.291
A2ML1   5.76314595815   2.12598165385   0.497   0.286
A3GALT2 3.84126870254   0.0285691521968 0.668   0.873
A4GALT  4.86740230573   0.0827025893302 0.637   0.682
A4GNT   3.11603199345   0.0285691521968 0.507   0.275
AAAS    3.82507326084   0.711935356979  0.575   0.281
AACS    6.28479171873   1.01149563884   0.557   0.528

For non-coding analysis, users can run get_region_covariates.py in very similar fashion:

python get_region_covariates.py /path/to/regions.bed /path/to/genome.fa

The only difference in this base run (other than the function name) is the use of the regions BED file instead of the GTF file used above. This also reports the region length (in basepairs with no transformation) and GC and CpG contents:

Region  length  GC      CpG
chr1:68549-69749        1201    0.356   0.315
chr1:367128-368328      1201    0.416   0.155
chr1:621365-622565      1201    0.415   0.136
chr1:860150-861350      1201    0.689   0.917
chr1:894437-896273      1837    0.655   0.943
chr1:900935-902135      1201    0.676   0.471
chr1:917271-918471      1201    0.642   0.502
chr1:935112-936399      1288    0.731   0.962
chr1:947901-949101      1201    0.318   0.828

Additional information can be supplied to both functions in one of two formats:

  1. One or multiple “table” files, which are tab-delimited with one header each. The first column is reserved for the gene or region name (and should match the features provided in the input GTF/BED file), while the remaining columns contain the desired covariate information. For example, a user could supply a table of gene expression levels from one or more samples to control for gene expression level in the analysis. Data supplied in this way is not transformed or re-scaled and any genes (regions) in the input GTF (BED) file not containing information from these tables are reported as NA.
  2. A tab-delimited text file listing one or more "interval" files, which are bgzipped and tabix-indexed bedGraph files (using the -i/--interval-files option). The input to this option is a two-column tab-delimited text file with no header with columns 1) interval file path and 2) interval dataset name. The code will search the genomic coordinates of the genes/regions in these files and report an average value (in case of multiple overlaps) from the data within. If no data is found in the immediate vicinity, the code will expand its search window from the feature midpoint until it finds a valid interval. If no overlapping intervals are found after the expanded search, “None” is reported. Examples of such data are replication timing data from Repli-seq experiments (e.g. from ENCODE).

Example covariate file generation runs:

python get_gene_covariates.py /path/to/genes.gtf /path/to/genome.fa -t /path/to/RNA_1.txt -t /path/to/RNA_2.txt

The above will produce the same gene covariate information as above, but with additional columns for the information provided in the two additional tables RNA_1.txt and RNA_2.txt.

python get_gene_covariates.py /path/to/genes.gtf /path/to/genome.fa -t /path/to/RNA_1.txt --interval-files /path/to/repliseq.txt

The above will produce the same gene covariate information as above, but with additional columns for the information provided in the table as well as the replication timing information from the file paths included in the interval file paths text file.

In addition, this code can be parallelized across multiple processors with the -p or --processors option. For coding analysis, users can also restrict the list of genes to include in the output with the -g or --gene-list option.

Along with the covariate files themselves, users can also define weights to each covariate that are used in the feature-wise similarity calculations. This allows users to increase or decrease the emphasis of specific covariates. Such weights should be provided as a simple two-column tab-delimited text file with no header line and columns:

  1. Covariate name – this should match the headers in the matching covariates file
  2. The weight value

Weights are numeric values and can be on an arbitrary scale. The enrichment analysis code will automatically re-scale the supplied weights such that they sum to 1. Larger values represent more weight, while smaller values should reflect lower weighted covariates.

Example covariate and covariate weights files are included in the example_data/covariates sub-direcotry.

Run Options

After all desired input files are prepared, the user is now ready to execute either the coding analysis portion of the package or the non-coding portion. Throughout the remaining tutorial, we refer to run protocols using the example files in the example_data sub-directory. We assume the working directory for the analysis is from this directory.

Statistical testing type

MutEnricher now allows users to choose between two methods for statistical testing using the --stat-type option (see methods page). The two options are:

  1. nsamples (default as of version 1.3.0). This options uses the binomial test to compute the significance of the number of samples containing somatic mutations.
  2. nmutations. This option uses the negative binomial test to compute the significance of mutation counts within genes/regions.

The default binomial testing option is used for the majority of the examples in this tutorial.

Coding analysis

The most basic coding analysis run can be executed with:

python ../mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt

This calls the main driver script and tells it to run the coding module. To explicitly define the output directory of the analysis (default is current working directory) as well as to define a prefix to the output analysis files:

python ../mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt 
-o /path/to/output/directory 
--prefix my_analysis

The default run options here tell the code to consider only global gene background frequencies when performing enrichment analyses. Also, the code expects non-silent annotations in the default format, which is currently set to annovar-refGene. To run with a different set of annotation terms relevant to your data:

python ../mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt 
--anno-type nonsilent_terms.txt 

If running the test data, the user should run the above option as the example somatic VCFs were annotated with RefGene information using ANNOVAR.

If the user wishes to run the analysis using local background frequencies for the genes, run the analysis with the --use-local option:

python ../mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt 
--anno-type nonsilent_terms.txt 
--use-local

If the user wishes to run the analysis with covariate information, one run option is:

python mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt 
--anno-type nonsilent_terms.txt 
-c ucsc.refFlat.20170829.no_chrMY.covariates.txt 
-w ucsc.refFlat.20170829.no_chrMY.covariate_weights.txt

Here, a covariates file and an associated covariate weights file is supplied. By default for coding analysis, affinity propagation is run considering all genes. While affinity propagation is indeed capable of handling large numbers of data points, this run can be quite slow as all pairwise similarities between genes must be calculated and written to a file. To reduce the problem complexity and to take advantage of parallelization, the user can split the similarity computations by contigs (e.g. by chromosomes) with the --by-contig option. This tells the program to cluster only genes on the same chromosome and use these clusters in the downstream analysis. This can dramatically reduce run time while still grouping genes by similar characteristics. Also, MutEnricher contains two implementations of affinity propagation that the user can select: fast and slow. The fast algorithm, as the name implies, is faster but consumes more memory than slow. An example run with clustering is:

python ../mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt 
--anno-type nonsilent_terms.txt 
-c ucsc.refFlat.20170829.no_chrMY.covariates.txt 
-w ucsc.refFlat.20170829.no_chrMY.covariate_weights.txt 
--by-contig 
-p 10

Note that the -p or --processors sets the number of processors the code can use during execution. The default value is 1, but this can always be set higher if resources are available. When running with covariate information and splitting by contigs, similarity calculations and subsequent affinity propagation runs are split across the available processors. Additional steps in the analysis also benefit from parallelization.

If the user has already run the analysis and computed covariates, but would like to run the workflow with new samples or re-run the same analysis with different parameters (outside of those that influence the covariate clustering), he or she may provide pre-computed covariate clusters to the code. This is accomplished with the --precomputed-covars option, which accepts a path to already computed clustering output. For example:

python ../mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt 
--anno-type nonsilent_terms.txt 
-p 10 
--precomputed-covars /path/to/apcluster_genes

When running coding analysis with covariate clustering, an output directory apcluster_genes will be created in the main output directory and houses the covariate information. This path can be supplied in subsequent runs.

Users can also run with a combined covariate clustering plus local background method. As of version 1.3.0, a "fourth" background rate option was implemented that combines the covariate clustering and local background methods. Here, per-sample and per-feature background rates are computed from covariate members as above, though a local background around each feature is now computed (as opposed to the direct rate in the gene window). To execute, set the --use-local option along with covariate information (i.e. via -c/-w or --precomputed-covars):

python ../mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt 
--anno-type nonsilent_terms.txt 
-p 10 
--precomputed-covars /path/to/apcluster_genes
--use-local

An additional feature of the coding analysis code is the ability to accept MAF format files. Typically, such files are generated from experiments of somatic variation using whole exome sequencing. To supply a MAF file (e.g. mutations.maf), the user should use the --maf option and supply a place-holder character (e.g. -) for the VCF file list input:

python ../mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz - 
--anno-type nonsilent_terms.txt 
-p 10 
--maf mutations.maf

If the data contained in the MAF was generated using whole exome sequencing or a related methodology, the assumptions used in the background calculations will be violated due to the reduced coverage space. In such cases, the user can set the --exome-only option to tell the program to only consider exonic sequence coordinates for genes.

python ../mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz - 
--anno-type nonsilent_terms.txt 
-p 10 
--maf mutations.maf 
--exome-only

With MAF input, the default (global) and covariate clustering background methodologies are available. However, given that the local background method depends on the ability to scan regions in and outside the gene limits, this option is not as of now available with MAF input.

Additional details and suggestions for running on MAF input with covariate clustering as the background mutation rate method can be found here

NOTE: In some tests, the p-value distributions resulting from analyses with MAF input can be highly conservative when using the covariate clustering background rate method. Users may wish to try the global background method if this is encountered. Alternatively, users can adjust the mutations considered in the background to only silent mutations with the --bg-vars-type flag (i.e. --bg-vars-type silent). This latter option will produce less conservative p-values, though they may now be too optimistic. Users can run several versions of the analysis and select the most appropriate for the problem at hand.

Non-coding analysis:

The most basic run of the non-coding analysis tool is very similar to that of the coding analysis. The major difference is the requirement of an input BED file as opposed to a GTF:

python ../mutEnricher.py noncoding 
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt 
-o /path/to/output/directory 
--prefix my_noncoding_analysis

Overall, there are fewer options associated with the non-coding analysis. However, those that are present follow the same structure as in the coding analysis. All background options available for the coding analysis are available here.

Non-coding analysis run with "local" background method:

python ../mutEnricher.py noncoding 
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt 
-o test_out_noncoding 
--use-local 
--prefix test_local

Non-coding analysis run with covariate clustering enabled (through pre-computed clusters):

python ../mutEnricher.py noncoding 
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt 
-o test_out_noncoding 
--precomputed-covars precomputed_apcluster/noncoding.ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY/apcluster_regions 
--prefix test_precompAP

Non-coding analysis run with covariate clustering enabled (enable affinity propagation):

python ../mutEnricher.py noncoding 
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt 
-o test_out_noncoding 
-c covariates/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.covariates.txt 
-w covariates/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.covariate_weights.txt 
--prefix test_APclust 
-p 10

A few subtle differences when running with covariate clustering:

  • The default behavior is to split the analysis by chromosome. This reduces the problem size and takes advantage of parallelization. This is generally useful, if not necessary, as the potential space of non-coding regions is much greater than that of coding genes, depending on how the user defines such features.
  • If using pre-computed covariates, the run procedure is the same as for coding analysis. Be sure to point to the correct clustering output directory (/path/to/apcluster_regions) as before.

Non-coding analysis run with the combined covariate clustering plus local background method enabled:

python ../mutEnricher.py noncoding 
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt 
-o test_out_noncoding 
-c covariates/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.covariates.txt 
-w covariates/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.covariate_weights.txt 
--use-local
--prefix test_APclust_with_local
-p 10

As of version 1.3.0, users can now run analyses with a combined covariate clustering plus local background method. This method can be particularly useful when running non-coding analyses on short regions, where background mutation rates estimated via the covariate clustering method may be overestimated and produce overly conservative results. In such cases, use of the local method (through --use-local alone) or this combined method can alleviate such issues.

Other run options

The below options apply to both coding and non-coding analysis methods.

Statistical testing type:

  • --stat-type
    • This option can be set to either nsamples (default), which implements the binomial test to compute the significance of the number of samples containing somatic mutations in the features, or nmutations, which uses the negative binomial testing strategy on the number of somatic mutations found in the feature of interest.

Non-silent terms (coding analysis only):

  • --anno-type
    • Option to set terms for consideration of non-silent mutations in somatic VCF files. Pre-sets are included for ANNOVAR annotations (annovar-refGene, annovar-knownGene, and annovar-ensGene for refGene, knownGene, and Ensemble gene annotations, respectively), SnpEff annotations (SnpEff for parsing "ANN" INFO field), VEP annotations (VEP for parsing "CSQ" INFO field), as well as Illumina annotation engine annotations (illumina). Custom annotations can also be set with this option by supplying a tab-delimited text file - this procedure is described above in the somatic mutations and annotations section.

Mutations used in background calculations:

  • --bg-vars-type
    • Option to control definition of “background.” Options are all, which instructs MutEnricher to consider both silent and non-silent mutations during its background estimation procedures (the default). Alternatively, users can adjust this definition by using the silent option with this flag, which instructs MutEnricher to only consider silent mutations in the background calculations.

Parallelization:

  • -p or --processors
    • Option to set the number of run processors.

Mappable regions filtering:

  • -m or --mappable-regions
    • Not all regions of the genome are uniquely mappable by most short read based whole genome/exome sequencing technologies. Users can optionally supply a BED file (bgzipped and tabix-indexed) of mappable genomic regions (e.g. based on known callable regions or user-calculated data from protocol benchmarks) that is used to restrict the analysis space. Variants existing only in genic/non-coding regions overlapping these callable regions are considered and the lengths used in the enrichment calculations are adjusted accordingly.

GTF and gene input options (coding analysis):

  • --gene-field
    • Explicitly define the field name in the input GTF containing the desired gene IDs. The default is gene_id, but users can adjust to suit their needs. NOTE: the format of the IDs in the selected field should be consistent with what was used to annotate the VCF files (e.g. gene symbols).
  • -g or --gene-list
    • Provide a simple text file of gene IDs/symbols to which the coding analysis should be restricted. For instance, this could be a list of genes believed to be expressed in the particular tumor type. Useful for avoiding analysis of non- or lowly-expressed genes, reducing problem size, and reducing multiple hypothesis testing burden.

Covariate clustering:

  • --min-clust-size (coding analysis) or --min-rclust-size (noncoding)
    • During covariate clustering, features grouped together are used for background estimation. In some cases, singletons or very small clusters can be generated. In such cases, a local background can be calculated to better estimate this value for features that are more or less “distinct.” This option takes an integer value and tells the programs to calculate a local background for all features belonging to a cluster with fewer than this many members.
  • --ap-iters
    • This parameter controls the maximum number of affinity propagation iterations. Iterations continue until convergence is attained or after this many iterations are run. In cases where the clustering does not converge, the program will automatically re-run the algorithm with a slightly perturbed value for the self-similarity paramter. This value can be set higher to run more iterations, but is not necessarily recommended (default is 1000).
  • --ap-convits
    • This parameter sets the number of convergence iterations (i.e. the number of consecutive iterations for which the exemplars remain constant) required before terminating affinity propagation runs (default is 50).
  • --ap-algorithm
    • Choose between one of two implementations of the affinity propagation algorithm: slow or fast. As the names imply, 'fast' is a faster implementation, but requires more memory for use. The 'slow' version produces the same results with a moderately slower implementation, but uses less memory.

“Hotspots”:

  • -d or --hotspot-distance
    • Set maximum distance (in base pairs) between two variants for consideration as part of a hotspot. Default value is 50.
  • --min-hs-vars
    • Minimum number of variants that must be present in a candidate hotspot for statistical testing. Default is 3.
  • --min-hs-samps
    • Minimum number of samples that must contain a somatic mutation in candidate hotspot region for statistical testing. Default is 2.

Use only SNPs in analysis:

  • --snps-only
    • Set this flag to only consider SNPs in analysis (i.e. skip indels).

Variant blacklist:

  • --blacklist
    • Provide a tab-delimited text file with no header containing blacklist variants, i.e. variants that are potential false positives that should not be considered in analysis. The required columns are: 1) chromosome, 2) position (1-based index), 3) reference sequence, 4) alternate. The format of reference/alternate should match that of your input VCFs. NOTE: this option is only recommended for a small set of variants (e.g. tens to a few hundreds) as large numbers of variants can cause memory issues if using several parallel processors. In cases where large numbers of variants potentially should be removed, we suggest removing these directly from the input VCFs prior to running the analyses.

Outputs

Detailed explanations of the MutEnricher output files can be found on the output file descriptions page.

Coding analysis outputs:

Five output files are generated from a successful coding analysis run:

  1. [prefix]_gene_enrichments.txt
  2. [prefix]_hotspot.txt
  3. [prefix]_gene_hotspot_Fisher_enrichments.txt
  4. [prefix]_gene_data.pkl
  5. [prefix].log

(1) contains the overall gene significance results, (2) contains the results of the gene hotspot analysis, and (3) contains output combining the p-values from (1) and (2) above (where appropriate). (4) is a python list object containing Gene class objects from the coding analysis run, stored as a python pickle object. (5) is a log file containing run info, including set parameters, input files, etc.

Non-coding analysis outputs:

Four output files are generated from a successful non-coding analysis run:

  1. [prefix]_region_WAP_enrichments.txt
  2. [prefix]_hotspot.txt
  3. [prefix]_region_WAP_hotspot_Fisher_enrichments.txt
  4. [prefix]_region_data.pkl
  5. [prefix].log

(1) contains the combined p-value results from the overall region significance analysis with the weighted average proximity (WAP) procedure. (2) contains the results of the region hotspot analysis. (3) Contains combined results from (1) and (2). (4) is a python list object containing Region class objects from the non-coding analysis run, stored as a python pickle object. (4) is a log file containing run info, including set parameters, input files, etc.

⚠️ **GitHub.com Fallback** ⚠️