Output file descriptions - asoltis/MutEnricher GitHub Wiki

MutEnricher Output File Descriptions

Last updated: October 29, 2019

Contents:

Coding analysis output files

1. [prefix]_gene_enrichments.txt

This text file contains the overall gene enrichment results determined by MutEnricher.

Columns:

Gene: Gene name from GTF.
coordinates: Genomic coordinates of gene, from first to last annotated exon.
num_nonsilent: Total non-silent mutations in gene across samples.
num_bg: Total silent mutations identified within gene coordinates in samples.
full_length: Total gene length in basepairs (corresponding to (2)).
coding_length: Total length of gene coding domains (e.g. sum of CDS regions in GTF).
bg_type: String indicating method used to estimate gene's background rate; one of global, local, or clustered_regions.
bg_prob: Gene background mutation rate used in negative binomial tests.
gene_pval: Raw p-value of negative binomial test for gene.
FDR_BH: Benjamini-Hochberg FDR-corrected p-value for gene.
num_samples: Number of samples possessing a non-silent somatic mutation in gene.
nonsilent_position_counts: Semi-colon-separated list of genomic positions containing non-silent mutations along with counts; in format [position]_[count].
nonsilent_mutation_counts: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].
samples: Semi-colon-separated list of sample IDs containing non-silent somatic mutations in gene.

2. [prefix]_hotspot.txt

This text file contains the results of the hotspot enrichment procedure.

Columns:

Gene: Gene name from GTF.
hotpsot: Genomic coordinates of tested hotspot.
num_mutations: Number of non-silent somatic mutations considered in hotspot test.
hotspot_length: Length of hotspot window.
effective_length: Length of hotspot window adjusted for cohort size (i.e. hotspot length times number of samples).
bg_type: String indicating method used to estimate gene's background rate; one of global, local, or clustered_regions.
bg_prob: Background mutation rate used in negative binomial test for hotspot.
pval: Raw p-value of negative binomial test for hotspot.
FDR_BH: Benjamini-Hochberg FDR-corrected p-value for hotspot.
num_samples: Number of samples possessing a non-silent somatic mutation in hotspot window.
position_counts: Semi-colon-separated list of genomic positions in hotspot containing non-silent mutations, including counts; in format [position]_[count].
mutation_counts: Semi-colon-separated list of genomic positions in hotspot, including base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].
samples: Semi-colon-separated list of sample IDs containing non-silent somatic mutations in gene.

3. [prefix]_gene_hotspot_Fisher_enrichments.txt

This text file contains combined significance results for the overall gene region (1 above) and candidate hotspots (if found, 2 above) using Fisher's method.

Columns:

Gene: Gene name from GTF.
coordinates: Genomic coordinates of gene, from first to last annotated exon.
num_nonsilent: Total non-silent mutations in gene across samples.
num_bg: Total silent mutations identified within gene coordinates in samples.
full_length: Total gene length in basepairs (corresponding to (2)).
coding_length: Total length of gene coding domains (e.g. sum of CDS regions in GTF).
bg_type: String indicating method used to estimate gene's background rate; one of global, local, or clustered_regions.
bg_prob: Gene background mutation rate used in negative binomial tests.
gene_pval: Raw p-value of negative binomial test for gene.
hotspot_pvals: Semi-colon-separated list of p-values associated with identified gene hotspots (NA if no hotspots found).
Fisher_pval: Fisher combined p-value of (9) and (10) values.
Fisher_FDR: Benjamini-Hochberg FDR-corrected Fisher p-value.
num_samples: Number of samples possessing a non-silent somatic mutation in gene.
nonsilent_position_counts: Semi-colon-separated list of genomic positions containing non-silent mutations along with counts; in format [position]_[count].
nonsilent_mutation_counts: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].
samples: Semi-colon-separated list of sample IDs containing non-silent somatic mutations in gene.

4. [prefix]_gene_data.pkl

This is a python pickle object containing the mutation data and calculations used in the enrichment analysis. The file contains a python list of Gene class variables, as defined in the coding analysis code. If users are interested in inspecting this information, this file can be loaded in python with:

# In Python 2
import sys, os, cPickle
sys.path.insert(0, '/MutEnricher/install/path/') # Make MutEnricher install path available
sys.path.insert(0, '/MutEnricher/install/path/math_funcs/') # Add path to math functions as well

genes = cPickle.load(open('/path/to/output/example_gene_data.pkl','rb')) # Load gene data pickle file

# In Python 3
import sys, os
import _pickle as cPickle
sys.path.insert(0, '/MutEnricher/install/path/') # Make MutEnricher install path available
sys.path.insert(0, '/MutEnricher/install/path/math_funcs/') # Add path to math functions as well

genes = cPickle.load(open('/path/to/output/example_gene_data.pkl','rb')) # Load gene data pickle file

If a user would like to find information for a particular gene, this information then can be obtained as so:

gene_of_interest = 'KRAS'
index = None
for g in genes:
    if g.name == gene_of_interest:
        index = g.index
        break
kras = genes[index]

The above code extracts the Gene object for the gene KRAS. The user can now observe internal information associated with this gene.

5. [prefix].log

Text file containing run information, including MutEnricher version, input files, optional parameter values, and notes about the number of genes/hotspots tested.

Non-coding analysis output files

1. [prefix]_region_WAP_enrichments.txt

This text file contains the combined enrichments results for the overall region (from the negative binomial enrichment procedure) and the weighted average proximity clustering procedure. P-values are combined with Fisher's method.

Columns:

Region: Genomic coordinates of region (from input BED file).
region_name: Name assigned to region (from 4th column of BED file if present; assigned internally otherwise).
num_mutations: Total number of somatic mutations in region across samples.
length: Length of region in basepairs.
effective_length: Length of region multiplied by number of samples.
bg_type: String indicating method used to estimate region's background rate; one of global, local, or clustered_regions.
bg_prob: Region background mutation rate used in negative binomial tests.
region_pval: Raw p-value from negative binomial test of region.
WAP: Statistic from weighted average proximity procedure performed on region.
WAP_pval: Permutation p-value of WAP procedure.
Fisher_pval: Fisher combined p-value of (8) and (10) values.
FDR_BH: Benjamini-Hochberg FDR-corrected Fisher p-value.
num_samples: Number of samples possessing a somatic mutation in region.
position_counts: Semi-colon-separated list of genomic positions containing somatic mutations along with counts; in format [position]_[count].
mutation_counts: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].
samples: Semi-colon-separated list of sample IDs containing somatic mutations in region.

2. [prefix]_hotspot.txt

This text file contains the results of the hotspot enrichment procedure using negative binomial tests.

Columns:

Hotpsot: Genomic coordinates of hotspot.
region: Genomic coordinates of full region associated with hotspot.
region_name: Name assigned to region (from 4th column of BED file if present; assigned internally otherwise).
num_mutations: Total number of somatic mutations in region across samples.
hotspot_length: Length of hotspot window.
effective_length: Length of hotspot window adjusted for cohort size (i.e. hotspot length times number of samples).
bg_type: String indicating method used to estimate region's background rate; one of global, local, or clustered_regions.
bg_prob: Hotspot background mutation rate used in negative binomial tests.
pval: Raw p-value of negative binomial test for hotspot.
FDR_BH: Benjamini-Hochberg FDR-corrected p-value for hotspot.
num_samples: Number of samples possessing a somatic mutation in hotspot.
position_counts: Semi-colon-separated list of genomic positions containing somatic mutations along with counts; in format [position]_[count].
mutation_counts: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].
samples: Semi-colon-separated list of sample IDs containing somatic mutations in hotspot.

3. [prefix]_region_WAP_hotspot_Fisher_enrichments.txt

This text file contains combined significance results for the overall regional (1 above) and candidate hotspots (if found, 2 above) using Fisher's method.

Columns:

Region: Genomic coordinates of region (from input BED file).
region_name: Name assigned to region (from 4th column of BED file if present; assigned internally otherwise).
num_mutations: Total number of somatic mutations in region across samples.
length: Length of region in basepairs.
effective_length: Length of region multiplied by number of samples.
bg_type: String indicating method used to estimate region's background rate; one of global, local, or clustered_regions.
bg_prob: Region background mutation rate used in negative binomial tests.
region_pval: Raw p-value from negative binomial test of region.
WAP: Statistic from weighted average proximity procedure performed on region.
WAP_pval: Permutation p-value of WAP procedure.
hotspot_pvals: Semi-colon-separated list of p-values associated with identified hotspots (NA if no hotspots found).
Fisher_pval: Fisher combined p-value of values (8), (10), and (11).
Fisher_FDR: Benjamini-Hochberg FDR-corrected Fisher p-value.
num_samples: Number of samples possessing a somatic mutation in region.
position_counts: Semi-colon-separated list of genomic positions containing somatic mutations along with counts; in format [position]_[count].
mutation_counts: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].
samples: Semi-colon-separated list of sample IDs containing somatic mutations in region.

4. [prefix]_region_data.pkl

This is a python pickle object containing the mutation data and calculations used in the enrichment analysis. The file contains a python list of Region class variables, as defined in the non-coding analysis code. If users are interested in inspecting this information, this file can be loaded in python with:

# In Python 2
import sys, os, cPickle
sys.path.insert(0, '/MutEnricher/install/path/') # Make MutEnricher install path available
sys.path.insert(0, '/MutEnricher/install/path/math_funcs/') # Add path to math functions as well

regions = cPickle.load(open('/path/to/output/example_region_data.pkl','rb')) # Load region data pickle file

# In Python 3
import sys, os
import _pickle as cPickle
sys.path.insert(0, '/MutEnricher/install/path/') # Make MutEnricher install path available
sys.path.insert(0, '/MutEnricher/install/path/math_funcs/') # Add path to math functions as well

regions = cPickle.load(open('/path/to/output/example_region_data.pkl','rb')) # Load region data pickle file

If a user would like to find information for a particular region, this information then can be obtained as so:

region_of_interest = 'chr5:1295773-1296014'
index = None
for r in regions:
    if r.name == region_of_interest:
        index = r.index
        break
reg = regions[index]

The above code extracts the Region object for the defined region. The user can now observe internal information associated with this non-coding region.

5. [prefix].log

Text file containing run information, including MutEnricher version, input files, optional parameter values, and notes about the number of regions/hotspots tested.