Quickstart guide - asoltis/MutEnricher GitHub Wiki
MutEnricher Quickstart Guide
Last updated: June 15, 2021
Contents:
- Requirements and Installation
- Required inputs
- Output files
- Statistical testing type
- Coding analysis examples
- Non-coding analysis examples
- Coding analysis from MAF input
Requirements and Installation:
See installation guide.
Usage
NOTE: For the below example runs, the working directory is assumed to be the example_data
sub-direcotry within the main install directory.
Required inputs:
Coding analysis:
- A GTF file containing gene model information, matching the genome used to align DNA sequencing data. This file can be gzip-compressed.
- A two-column tab-delimited text file (no header) listing paths to somatic VCF files in the first column and associated sample names in the second.
NOTE: For the coding
module, a user can provide a MAF format file as input instead of the VCF files list. An example of such usage is described in the following sections.
Non-coding analysis:
- A BED-formatted file listing the regions on which the analysis should be run. The first three BED columns are required: chromosome (should match contig names from VCFs), start (0-based), and end (1-based). Optionally, a 4th column providing a name/identifier for the region is also usable for the analysis. This file can be gzip-compressed.
- A two-column tab-delimited text file (no header) listing paths to somatic VCF files in the first column and associated sample names in the second.
Output files:
See output file descriptions page for output file details.
Coding analysis:
- [prefix]_gene_enrichments.txt
- [prefix]_hotspot.txt
- [prefix]_gene_hotspot_Fisher_enrichments.txt
- [prefix]_gene_data.pkl
- [prefix].log
Non-coding analysis:
- [prefix]_region_WAP_enrichments.txt
- [prefix]_hotspot.txt
- [prefix]_region_WAP_hotspot_Fisher_enrichments.txt
- [prefix]_region_data.pkl
- [prefix].log
Statistical testing type
MutEnricher now allows users to choose between two methods for statistical testing using the --stat-type
option (see methods page). The two options are:
nsamples
(default, as of version 1.3.0). This options uses the binomial test to compute the significance of the number of samples containing somatic mutations.nmutations
. This option uses the negative binomial test to compute the significance of mutation counts within genes/regions.
The default binomial testing option is used in the below examples.
Coding analysis examples:
Annotating mutations:
For coding gene analyses, gene-based annotations indicating non-silent somatic mutations are required. Several bioinformatic tools can provide such annoations - we provide a tutorial for how this can be done with the popular ANNOVAR tool here. Pre-set options for several annotation tools are available through the --anno-type
option: annovar-refgene
, annovar-knownGene
, or annovar-ensGene
for ANNOVAR annotations; SnpEff
for SnpEff annotations through the "ANN" INFO field; VEP
for VEP annotations through the "CSQ" INFO field; illumina
for Illumina annotation engine annotations through the "CSQT" INFO field.
Basic run using default “global” background mutation frequency calculation method:
By default, MutEnricher uses the implemented global method to compute per-gene background mutation rates. A simple example execution is:
python ../mutEnricher.py coding
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt
--anno-type nonsilent_terms.txt
-o test_out_coding
--prefix test_global
To see the top significantly mutated gene results on the command line (overall gene p-value only):
cat test_global_gene_enrichments.txt | cut -f1,3,10 | head
Gene num_nonsilent FDR_BH
TP53 93 1.77e-231
KRAS 29 2.86e-56
MYOF 7 0.228
RPS24 3 0.518
SPG21 3 0.518
CTSH 3 0.518
SSUH2 3 0.518
GJC2 3 0.518
PCDHGA1 4 0.518
Checking the top genes from the combined overall gene and hotspot enrichments output:
cat test_AP_precomp_gene_hotspot_Fisher_enrichments.txt | cut -f1,3,12 | head
Gene num_nonsilent Fisher_FDR
TP53 93 9.48e-305
KRAS 29 7e-186
MYOF 7 0.228
RPS24 3 0.518
SPG21 3 0.518
CTSH 3 0.518
SSUH2 3 0.518
GJC2 3 0.518
PCDHGA1 4 0.518
You can see the genes TP53 and KRAS at the top of the list with very significant associated FDR-corrected p-values. These are expected from the example files.
Basic run using "local" background mutation frequency calculation method:
To use the local background method, set the --use-local
option in the MutEnricher call:
python ../mutEnricher.py coding
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt
--anno-type nonsilent_terms.txt
-o test_out_coding
--prefix test_local
--use-local
The output of this run on the example data should be very similar to the results with the global background method, but the numbers will differ slightly due to the different methodology.
Run using covariate clustering as the background method (through pre-computed clusters):
For the below run, use pre-computed gene covariate clustering results for the background frequency calculation method. We provide two pre-computed sets of clusters for coding genes contained in the supplied GTF annotation file: all genes (i.e. clustering performed on all ~19,000 genes contained) and by contig (i.e. clustering performed for all genes on each individual chromosome). NOTE: the affinity propagation code is not invoked when using pre-computed clusters.
To execute with pre-computed covariates, use the --precomputed-covars
option, which accepts the path to the clustering results directory:
python ../mutEnricher.py coding
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt
--precomputed-covars precomputed_apcluster/coding.ucsc.refFlat.20170829.no_chrMY/all_genes/apcluster_genes
--anno-type nonsilent_terms.txt
-o test_out_coding
--prefix test_precompAP_all_genes
Again, the results from this run on the sample data will be similar, but differ in the per-gene background mutation rates and resulting p-values.
Run with covariate clustering enabled (enable affinity propagation)
To execute the analysis with covariate clustering enabled, provide the covariates table with the -c
option. Optionally, the user can also supply covariate weights with the -w
option. NOTE: the below example performs covariate clustering by contig (i.e. by chrmosome) using --by-contig
; this option speeds the clustering calculations by performing multiple affinity propagation runs for genes on the same chromosome only. If this option is not set, the clustering analysis will be performed on all genes (~19,000 protein coding in the supplied GTF). With this latter option, the run time for the clustering analysis is considerably longer due to the large number of data points. Practically, the user may wish to restrict the number of genes considered with the -g
option (e.g. for specific genes of interest or expressed genes).
python ../mutEnricher.py coding
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt
-c covariates/ucsc.refFlat.20170829.no_chrMY.covariates.txt
-w covariates/ucsc.refFlat.20170829.no_chrMY.covariate_weights.txt
--by-contig
--anno-type nonsilent_terms.txt
-o test_out_coding
--prefix test_APclust_byContig
-p 10
Note in the above execution the number of processors is set to a larger value (i.e. -p 10
) to run parallel operations.
Run with combined covariate clustering and local background method
As of MutEnricher version 1.3.0, a combined covariate clustering with local background rate method is available for WGS data. When enabled, per-sample and per-feature background rates are computed from covariate cluster members as above, though a wider local background around each feature is scanned to determine the background rate (using the local background procedure methods here). To enable this method, run with the covariate clustering method inputs along with the --use-local
flag:
python ../mutEnricher.py coding
annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt
--precomputed-covars precomputed_apcluster/coding.ucsc.refFlat.20170829.no_chrMY/all_genes/apcluster_genes
--use-local
--anno-type nonsilent_terms.txt
-o test_out_coding_with_local
--prefix test_precompAP_all_genes
Non-coding analysis examples:
The general run scheme for non-coding analysis is similar to that of coding. A major difference is the requirement of an input BED file with non-coding regions of interest (as opposed to a GTF of gene models for coding analysis). In addition, as the non-coding analysis does not require definitions for silent versus non-silent mutations, no flag or input file is required for such definitions here.
Basic run using default “global” background mutation rate calculation method:
python ../mutEnricher.py noncoding
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt
-o test_out_noncoding
--prefix test_global
To view the top recurrently mutated non-coding regions:
cat test_global_region_WAP_enrichments.txt | cut -f2,3,12 | head
region_name num_mutations FDR_BH
TERT 38 9.08e-59
TP53_2,TP53_3 23 2.51e-29
ZSCAN5A_2,ZSCAN5A_1,ZSCAN5A_3 3 0.0245
LY6G6D 5 0.238
AGPAT3_2 4 0.248
ZFP92 5 0.248
PHACTR4_1 5 0.453
SDC3 5 0.478
CYP4X1_2 3 0.478
Note the TERT promoter enrichment is expected (and is highly significant). The TP53 result in this case is due to an overlapping definition of a TP53 exon with a definition for its promoter (due to varying gene models in the GTF).
Sample output from “hotspot” analysis:
cat test_global_hotspot.txt | cut -f1,3,10 | head
Hotpsot region_name FDR_BH
chr17:7578555-7578555 TP53_2,TP53_3 2.72e-108
chr5:1295773-1296014 TERT 8.62e-30
chr5:1295124-1295288 TERT 5.56e-27
chr5:1295461-1295611 TERT 5.04e-15
chr19:56827049-56827058 ZSCAN5A_2,ZSCAN5A_1,ZSCAN5A_3 1.95e-09
chr5:1295688-1295698 TERT 1.95e-09
chr6:46049504-46049566 CLIC5_2 3.4e-07
Again, some significant sub-regions (“hotspots”) within the TERT promoter show up as significant.
Basic run using “local” background mutation rate calculation method:
python ../mutEnricher.py noncoding
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt
-o test_out_noncoding
--use-local
--prefix test_local
Run with covariate clustering enabled (through pre-computed clusters):
python ../mutEnricher.py noncoding
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt
-o test_out_noncoding
--precomputed-covars precomputed_apcluster/noncoding.ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY/apcluster_regions
--prefix test_precompAP
Run with covariate clustering enabled (enable affinity propagation):
A subtle differences when running non-coding analysis with covariate clustering: The default behavior is to split the analysis by chromosome. This is generally useful, if not necessary, as the potential space of non-coding regions can be much greater than that of coding genes, depending on how the user defines such features.
python ../mutEnricher.py noncoding
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt
-o test_out_noncoding
-c covariates/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.covariates.txt
-w covariates/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.covariate_weights.txt
--prefix test_APclust
-p 10
Run with combined covariate clustering and local background method
As of MutEnricher version 1.3.0, a combined covariate clustering with local background rate method is available for WGS data. When enabled, per-sample and per-feature background rates are computed from covariate cluster members as above, though a wider local background around each feature is scanned to determine the background rate (using the local background procedure methods here). To enable this method, run with the covariate clustering method inputs along with the --use-local
flag:
python ../mutEnricher.py noncoding
annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed vcf_files.txt
--precomputed-covars precomputed_apcluster/noncoding.ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY/apcluster_regions
--use-local
-o test_out_noncoding_APclust_with_local
--prefix test_precompAP
This method can be particularly useful for combating overly conservative p-values resulting from background estimates computed from short intervals.
Coding analysis from MAF input
MutEnricher can also perform coding analysis from Mutation Annotation Format (MAF) input files (specification). For example, the Broad Institute's GDAC Firehose database contains a wide array of somatic mutation data from various cancer studies in MAF format (see https://gdac.broadinstitute.org/).
Nearly all coding analysis options available for VCF inputs are also available for MAF; however, the local background method is not available as this requires the ability to scan neighboring regions in and outside of the regions of interest – these often are not included in MAF files.
An example run on a MAF using pre-computed covariate clustering results (e.g. using somatic mutations from GDAC Firehose for a glioblastoma study) is:
python ../mutEnricher.py coding annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz -
--maf GBM-TP.final_analysis_set.maf
--exome-only
--min-clust-size 0
--precomputed-covars precomputed_apcluster/coding.ucsc.refFlat.20170829.no_chrMY/all_genes/apcluster_genes
-o test_out_maf
--prefix GBM
A few notes about this run:
- The MAF input is supplied with the
--maf
option. - Instead of the
vcfs_list.txt
input, a place-holder character (e.g. “-”) is used. With the--maf
option set this input is ignored, but a value is required for run. - The
--exome-only
flag is set. This tells the program to only cosider exonic regions (as opposed to exonic plus intronic in normal analyses). This is useful when the input data derives from non whole genome sequencing data (e.g. whole exome sequencing). - Setting
--min-clust-size
to 0 is appropriate when using covariate clustering data as anything >0 can require local analysis, which is not available with MAF input.
During run, you may receive warnings similar to: “KLRC3 var chr12 10588555_C_C not in gene limits.
” This generally results from mismatches between the gene coordinates described in the input GTF and those defined in the MAF. MutEnricher simply skips such variants. In addition, gene IDs present in the MAF that are not recognized in the input GTF are skipped and written to an output file called unrecognized_genes.txt
.