MutEnricher Methods - asoltis/MutEnricher GitHub Wiki

MutEnricher Methods Summary

Introduction

This section describes the basic analytical methods implemented in MutEnricher for performing coding and non-coding somatic mutation enrichment assessments. For technical run details and examples, see the quickstart guide or the more extensive tutorial.

Further analytical details are provided in the MutEnricher manuscript.

Features and somatic mutations

Genes are the features of interest for coding analyses. Genes are defined by an input gene transfer format (GTF) file; annotated coding domain sequences (CDS) are used to define coding regions. A gene's total length is defined as the distance from the beginning of its lowest position annotated exon (i.e. 5' most exon) to the end of its highest position annotated exon (i.e. 3' most exon), while a gene's coding length is defined as the total number of positions annotated as CDS. All annotated gene isoforms possessing the same name are merged into a single composite gene, with CDS annotation taking precedence over untranslated region or intron. Any gene name/symbol annotated to multiple distinct locations in the genome (e.g. the same gene symbol annotated to two different chromosomes) are removed from the analysis.

Regions are the features of interest for non-coding analyses. Regions are defined as contiguous loci within the target genome (e.g. chr1:100-200) and are defined by an input BED-formatted file.

Somatic mutations for individual samples are provided with input somatic variant call format (VCF) files (see Tutorial or Quickstart guide for exceptions). In coding analyses, somatic mutations are classified as either: 1) non-silent if they reside in an annotated CDS and are protein altering, or if they are annotated to a splice site; and 2) silent if they are not protein altering when residing in a CDS or if they are present in an intron (and not annotated as altering a splice site). No distinction between silent and non-silent is made during non-coding analyses.

Background mutation rates

For statistical testing, MutEnricher calculates per-sample and per-feature background mutation rates from the somatic mutation profiles of the input samples. Users can select from three distinct background rate calculation methods: global, local, or covariate clustered. NOTE: All mutations (non-silent + silent) are considered when computing background mutation rates; this behavior can be modified during coding analyses (see --bg-vars-type in run options section).

Global

With this method (the default), MutEnricher counts the total number of somatic mutations within all input features (i.e. genes in the input GTF for coding analyses, regions in the input BED for non-coding) in each sample. The per-sample background mutation rates are then simply calculated as the total number of mutations divided by the total queried space in base-pairs. In this scheme, all features within a sample have the same background mutation rate.

Local

With this method (selected with the --use-local optional flag), MutEnricher computes per-sample and per-feature background mutation rates by scanning the local neighborhoods around each feature for somatic mutations.

During coding analysis, if the gene's total length is greater than 10 kb and there is at least one somatic mutation present in its boundary, the local background is reported as the total number of somatic mutations within the gene boundary divided by its total length; otherwise, the somatic VCFs are scanned for mutations in variable length windows (1-2 Mb) and the window producing the highest overall background mutation rate for the gene in a given sample is used.

During non-coding analysis, the same sliding window approach is used for all input regions. Again, the maximum observed background mutation rate from this analysis is used.

Covariate clustered

The third background rate calculation method clusters features by genomic covariates (e.g. GC content, length, replication timing, etc.) and computes per-sample and per-feature background rates from cluster members. MutEnricher uses affinity propagation to cluster features by genomic covariates. Unlike k-means or k-mediods clustering, affinity propagation does not generate a pre-defined number of clustes; rather, the cluster number emerges from the data, influenced by a "self similarity" parameter. MutEnricher sets this parameter to the median of all pairwise feature similarities; these similarities are set to the weighted negative squared distance between feature covariates:

s = -wi * sum( (pi - qi)^2 ) .

Covariate weights can be provided by the user to influence the importance of specific features and are normalized such that they sum to one. Affinity propagation does not require that the similarity from node i -> j equal the similarity from j -> i; however, MutEnricher sets these to the same value. In cases where affinity propagation does not converge, the self similarity parameter is slightly perturbed and re-run until convergence is achieved.

Following the clustering procedure, per-sample feature background mutation rates are calculated from the somatic mutations within all members of its assigned cluster. In cases where a feature is a singleton or a member of small cluster (e.g. 3 or fewer features), MutEnricher uses the local background method to compute its background mutation rate.

Covariate clustered plus local

MutEnricher also implements an additional “fourth” background mutation rate calculation method, which combines the local and covariate clustering methods described above. With this method, gene/region covariate clusters are identified with affinity propagation and background mutation rates are calculated with respect to these clusters as described; the difference here is that sample-wise feature rates are computed using the local strategy, whereby the mutation density in a 1-2 Mb window around each feature is used to estimate the background. This method can be appropriate in cases where small length features are of interest, which may produce overly strict background rate estimates leading to overly conservative p-values when the default covariate clustering strategy is implemented.

Once all gene/region per-sample and per-feature background mutation rates are calculated, the negative binomial probability parameter for each feature is set to the geometric mean of all samples possessing a foreground mutation. In the coding module, a foreground mutation is any somatic mutation classified as non-silent; in the non-coding module, all somatic mutations in a region are considered foreground.

Statistical testing

MutEnricher implements two statistical strategies for determining somatic mutation enrichments. The first method (the default) uses the binomial distribution to determine the significance of observing n samples from N total containing somatic mutations in a given feature of length L with background probability pn; the background probability pn is calculated from the estimated nucleotide mutation rate p obtained from one of the available background calculation methods according to:

pn = 1 - (1 - p)^L

In coding analyses, n is the number of samples with at least one non-silent somatic mutation in a gene, while n in non-coding analyses is the number of samples with any observed somatic mutation in a region. Binomial test p-values are computed with the binom_test function in SciPy according to: binom_test(n, N, pn, alternative = 'greater').

Alternatively, MutEnricher can also use the negative binomial distribution to determine somatic mutation enrichment in genes and regions. Negative binomial p-values are computed with the regularized incomplete beta function for each gene/region:

NB p-value = Ip(k, x - k + 1).

In coding analyses, k is the total number of non-silent somatic mutations observed from all samples in the gene, x is the gene's coding length multiplied by the total number of samples, and p is the background mutation rate for the gene. In non-coding analyses, k is the total number of somatic mutations found within the region's boundary, x is the region's total length multiplied by the number of samples queried, and p is the region's background mutation rate. Internally, the above function is computed with the betainc function in SciPy with parameters k, x - k + 1, and p.

"Hotspot" detection

In addition to overall gene/region burden significance calls, MutEnricher also identifies "hotspots" and reports independent and combined significance estimates for such focal loci. For hotspots, MutEnricher progressively aggregates somatic mutations within close proximity of one another (e.g. 50 bp, a user-defined parameter) and tests these focal regions separately using either the binomial distribution or negative binomial distribution. In coding analyses, candidate hotspots are determined from neighboring mutations within exons; in non-coding analyses, candidate hotspots are found within the full region. Note that a given gene/region can have more than one candidate hotspots that is tested. The noncoding module additionally implements a weighted average proximity (WAP) strategy as an additional test for mutation clustering. All statistical tests, including gene/region burden, hotspots, and combined burden and hotspot procedures, are subsequently corrected for multiple hypotheses using the Benjamini-Hochberg FDR procedure.