Quickstart guide - asoltis/MutEnricher GitHub Wiki

MutEnricher Quickstart Guide

Last updated: June 15, 2021

Contents:

Requirements and Installation:

Usage

NOTE: For the below example runs, the working directory is assumed to be the example_data sub-direcotry within the main install directory.

Required inputs:

Coding analysis:

A GTF file containing gene model information, matching the genome used to align DNA sequencing data. This file can be gzip-compressed.
A two-column tab-delimited text file (no header) listing paths to somatic VCF files in the first column and associated sample names in the second.

NOTE: For the coding module, a user can provide a MAF format file as input instead of the VCF files list. An example of such usage is described in the following sections.

Non-coding analysis:

A BED-formatted file listing the regions on which the analysis should be run. The first three BED columns are required: chromosome (should match contig names from VCFs), start (0-based), and end (1-based). Optionally, a 4th column providing a name/identifier for the region is also usable for the analysis. This file can be gzip-compressed.
A two-column tab-delimited text file (no header) listing paths to somatic VCF files in the first column and associated sample names in the second.

Output files:

See output file descriptions page for output file details.

Coding analysis:

[prefix]_gene_enrichments.txt
[prefix]_hotspot.txt
[prefix]_gene_hotspot_Fisher_enrichments.txt
[prefix]_gene_data.pkl
[prefix].log

Non-coding analysis:

[prefix]_region_WAP_enrichments.txt
[prefix]_hotspot.txt
[prefix]_region_WAP_hotspot_Fisher_enrichments.txt
[prefix]_region_data.pkl
[prefix].log

Statistical testing type

MutEnricher now allows users to choose between two methods for statistical testing using the --stat-type option (see methods page). The two options are:

nsamples (default, as of version 1.3.0). This options uses the binomial test to compute the significance of the number of samples containing somatic mutations.
nmutations. This option uses the negative binomial test to compute the significance of mutation counts within genes/regions.

The default binomial testing option is used in the below examples.

Coding analysis examples:

Annotating mutations:

For coding gene analyses, gene-based annotations indicating non-silent somatic mutations are required. Several bioinformatic tools can provide such annoations - we provide a tutorial for how this can be done with the popular ANNOVAR tool here. Pre-set options for several annotation tools are available through the --anno-type option: annovar-refgene, annovar-knownGene, or annovar-ensGene for ANNOVAR annotations; SnpEff for SnpEff annotations through the "ANN" INFO field; VEP for VEP annotations through the "CSQ" INFO field; illumina for Illumina annotation engine annotations through the "CSQT" INFO field.

Basic run using default “global” background mutation frequency calculation method:

By default, MutEnricher uses the implemented global method to compute per-gene background mutation rates. A simple example execution is:

python ../mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt 
--anno-type nonsilent_terms.txt 
-o test_out_coding 
--prefix test_global

To see the top significantly mutated gene results on the command line (overall gene p-value only):

cat test_global_gene_enrichments.txt | cut -f1,3,10 | head

Gene    num_nonsilent   FDR_BH
TP53    93      1.77e-231
KRAS    29      2.86e-56
MYOF    7       0.228
RPS24   3       0.518
SPG21   3       0.518
CTSH    3       0.518
SSUH2   3       0.518
GJC2    3       0.518
PCDHGA1 4       0.518

Checking the top genes from the combined overall gene and hotspot enrichments output:

cat test_AP_precomp_gene_hotspot_Fisher_enrichments.txt | cut -f1,3,12 | head

Gene    num_nonsilent   Fisher_FDR
TP53    93      9.48e-305
KRAS    29      7e-186
MYOF    7       0.228
RPS24   3       0.518
SPG21   3       0.518
CTSH    3       0.518
SSUH2   3       0.518
GJC2    3       0.518
PCDHGA1 4       0.518

You can see the genes TP53 and KRAS at the top of the list with very significant associated FDR-corrected p-values. These are expected from the example files.

Basic run using "local" background mutation frequency calculation method:

To use the local background method, set the --use-local option in the MutEnricher call:

python ../mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt 
--anno-type nonsilent_terms.txt 
-o test_out_coding 
--prefix test_local 
--use-local

The output of this run on the example data should be very similar to the results with the global background method, but the numbers will differ slightly due to the different methodology.

Run using covariate clustering as the background method (through pre-computed clusters):

For the below run, use pre-computed gene covariate clustering results for the background frequency calculation method. We provide two pre-computed sets of clusters for coding genes contained in the supplied GTF annotation file: all genes (i.e. clustering performed on all ~19,000 genes contained) and by contig (i.e. clustering performed for all genes on each individual chromosome). NOTE: the affinity propagation code is not invoked when using pre-computed clusters.

To execute with pre-computed covariates, use the --precomputed-covars option, which accepts the path to the clustering results directory:

python ../mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt 
--precomputed-covars precomputed_apcluster/coding.ucsc.refFlat.20170829.no_chrMY/all_genes/apcluster_genes 
--anno-type nonsilent_terms.txt 
-o test_out_coding 
--prefix test_precompAP_all_genes

Again, the results from this run on the sample data will be similar, but differ in the per-gene background mutation rates and resulting p-values.

Run with covariate clustering enabled (enable affinity propagation)

To execute the analysis with covariate clustering enabled, provide the covariates table with the -c option. Optionally, the user can also supply covariate weights with the -w option. NOTE: the below example performs covariate clustering by contig (i.e. by chrmosome) using --by-contig; this option speeds the clustering calculations by performing multiple affinity propagation runs for genes on the same chromosome only. If this option is not set, the clustering analysis will be performed on all genes (~19,000 protein coding in the supplied GTF). With this latter option, the run time for the clustering analysis is considerably longer due to the large number of data points. Practically, the user may wish to restrict the number of genes considered with the -g option (e.g. for specific genes of interest or expressed genes).

python ../mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt 
-c covariates/ucsc.refFlat.20170829.no_chrMY.covariates.txt 
-w covariates/ucsc.refFlat.20170829.no_chrMY.covariate_weights.txt 
--by-contig 
--anno-type nonsilent_terms.txt 
-o test_out_coding
--prefix test_APclust_byContig 
-p 10

Note in the above execution the number of processors is set to a larger value (i.e. -p 10) to run parallel operations.

Run with combined covariate clustering and local background method

As of MutEnricher version 1.3.0, a combined covariate clustering with local background rate method is available for WGS data. When enabled, per-sample and per-feature background rates are computed from covariate cluster members as above, though a wider local background around each feature is scanned to determine the background rate (using the local background procedure methods here). To enable this method, run with the covariate clustering method inputs along with the --use-local flag:

python ../mutEnricher.py coding 
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt 
--precomputed-covars precomputed_apcluster/coding.ucsc.refFlat.20170829.no_chrMY/all_genes/apcluster_genes
--use-local 
--anno-type nonsilent_terms.txt 
-o test_out_coding_with_local 
--prefix test_precompAP_all_genes

Non-coding analysis examples:

The general run scheme for non-coding analysis is similar to that of coding. A major difference is the requirement of an input BED file with non-coding regions of interest (as opposed to a GTF of gene models for coding analysis). In addition, as the non-coding analysis does not require definitions for silent versus non-silent mutations, no flag or input file is required for such definitions here.

Basic run using default “global” background mutation rate calculation method:

python ../mutEnricher.py noncoding 
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt 
-o test_out_noncoding 
--prefix test_global

To view the top recurrently mutated non-coding regions:

cat test_global_region_WAP_enrichments.txt | cut -f2,3,12 | head

region_name     num_mutations   FDR_BH
TERT    38      9.08e-59
TP53_2,TP53_3   23      2.51e-29
ZSCAN5A_2,ZSCAN5A_1,ZSCAN5A_3   3       0.0245
LY6G6D  5       0.238
AGPAT3_2        4       0.248
ZFP92   5       0.248
PHACTR4_1       5       0.453
SDC3    5       0.478
CYP4X1_2        3       0.478

Note the TERT promoter enrichment is expected (and is highly significant). The TP53 result in this case is due to an overlapping definition of a TP53 exon with a definition for its promoter (due to varying gene models in the GTF).

Sample output from “hotspot” analysis:

cat test_global_hotspot.txt | cut -f1,3,10 | head

Hotpsot region_name     FDR_BH
chr17:7578555-7578555   TP53_2,TP53_3   2.72e-108
chr5:1295773-1296014    TERT    8.62e-30
chr5:1295124-1295288    TERT    5.56e-27
chr5:1295461-1295611    TERT    5.04e-15
chr19:56827049-56827058 ZSCAN5A_2,ZSCAN5A_1,ZSCAN5A_3   1.95e-09
chr5:1295688-1295698    TERT    1.95e-09
chr6:46049504-46049566  CLIC5_2 3.4e-07

Again, some significant sub-regions (“hotspots”) within the TERT promoter show up as significant.

Basic run using “local” background mutation rate calculation method:

python ../mutEnricher.py noncoding 
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt 
-o test_out_noncoding 
--use-local 
--prefix test_local

Run with covariate clustering enabled (through pre-computed clusters):

python ../mutEnricher.py noncoding 
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt 
-o test_out_noncoding 
--precomputed-covars precomputed_apcluster/noncoding.ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY/apcluster_regions 
--prefix test_precompAP

Run with covariate clustering enabled (enable affinity propagation):

A subtle differences when running non-coding analysis with covariate clustering: The default behavior is to split the analysis by chromosome. This is generally useful, if not necessary, as the potential space of non-coding regions can be much greater than that of coding genes, depending on how the user defines such features.

python ../mutEnricher.py noncoding 
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt 
-o test_out_noncoding 
-c covariates/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.covariates.txt 
-w covariates/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.covariate_weights.txt 
--prefix test_APclust 
-p 10

Run with combined covariate clustering and local background method

python ../mutEnricher.py noncoding 
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt 
--precomputed-covars precomputed_apcluster/noncoding.ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY/apcluster_regions 
--use-local
-o test_out_noncoding_APclust_with_local 
--prefix test_precompAP

This method can be particularly useful for combating overly conservative p-values resulting from background estimates computed from short intervals.

Coding analysis from MAF input

MutEnricher can also perform coding analysis from Mutation Annotation Format (MAF) input files (specification). For example, the Broad Institute's GDAC Firehose database contains a wide array of somatic mutation data from various cancer studies in MAF format (see https://gdac.broadinstitute.org/).

Nearly all coding analysis options available for VCF inputs are also available for MAF; however, the local background method is not available as this requires the ability to scan neighboring regions in and outside of the regions of interest – these often are not included in MAF files.

An example run on a MAF using pre-computed covariate clustering results (e.g. using somatic mutations from GDAC Firehose for a glioblastoma study) is:

python ../mutEnricher.py coding annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz - 
--maf GBM-TP.final_analysis_set.maf 
--exome-only 
--min-clust-size 0 
--precomputed-covars precomputed_apcluster/coding.ucsc.refFlat.20170829.no_chrMY/all_genes/apcluster_genes 
-o test_out_maf 
--prefix GBM

A few notes about this run:

The MAF input is supplied with the --maf option.
Instead of the vcfs_list.txt input, a place-holder character (e.g. “-”) is used. With the --maf option set this input is ignored, but a value is required for run.
The --exome-only flag is set. This tells the program to only cosider exonic regions (as opposed to exonic plus intronic in normal analyses). This is useful when the input data derives from non whole genome sequencing data (e.g. whole exome sequencing).
Setting --min-clust-size to 0 is appropriate when using covariate clustering data as anything >0 can require local analysis, which is not available with MAF input.

During run, you may receive warnings similar to: “KLRC3 var chr12 10588555_C_C not in gene limits.” This generally results from mismatches between the gene coordinates described in the input GTF and those defined in the MAF. MutEnricher simply skips such variants. In addition, gene IDs present in the MAF that are not recognized in the input GTF are skipped and written to an output file called unrecognized_genes.txt.