Output file descriptions - asoltis/MutEnricher GitHub Wiki
MutEnricher Output File Descriptions
Last updated: October 29, 2019
Contents:
Coding analysis output files
1. [prefix]_gene_enrichments.txt
This text file contains the overall gene enrichment results determined by MutEnricher.
Columns:
Gene
: Gene name from GTF.coordinates
: Genomic coordinates of gene, from first to last annotated exon.num_nonsilent
: Total non-silent mutations in gene across samples.num_bg
: Total silent mutations identified within gene coordinates in samples.full_length
: Total gene length in basepairs (corresponding to (2)).coding_length
: Total length of gene coding domains (e.g. sum of CDS regions in GTF).bg_type
: String indicating method used to estimate gene's background rate; one ofglobal,
local,
orclustered_regions.
bg_prob
: Gene background mutation rate used in negative binomial tests.gene_pval
: Raw p-value of negative binomial test for gene.FDR_BH
: Benjamini-Hochberg FDR-corrected p-value for gene.num_samples
: Number of samples possessing a non-silent somatic mutation in gene.nonsilent_position_counts
: Semi-colon-separated list of genomic positions containing non-silent mutations along with counts; in format [position]_[count].nonsilent_mutation_counts
: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].samples
: Semi-colon-separated list of sample IDs containing non-silent somatic mutations in gene.
2. [prefix]_hotspot.txt
This text file contains the results of the hotspot enrichment procedure.
Columns:
Gene
: Gene name from GTF.hotpsot
: Genomic coordinates of tested hotspot.num_mutations
: Number of non-silent somatic mutations considered in hotspot test.hotspot_length
: Length of hotspot window.effective_length
: Length of hotspot window adjusted for cohort size (i.e. hotspot length times number of samples).bg_type
: String indicating method used to estimate gene's background rate; one ofglobal,
local,
orclustered_regions.
bg_prob
: Background mutation rate used in negative binomial test for hotspot.pval
: Raw p-value of negative binomial test for hotspot.FDR_BH
: Benjamini-Hochberg FDR-corrected p-value for hotspot.num_samples
: Number of samples possessing a non-silent somatic mutation in hotspot window.position_counts
: Semi-colon-separated list of genomic positions in hotspot containing non-silent mutations, including counts; in format [position]_[count].mutation_counts
: Semi-colon-separated list of genomic positions in hotspot, including base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].samples
: Semi-colon-separated list of sample IDs containing non-silent somatic mutations in gene.
3. [prefix]_gene_hotspot_Fisher_enrichments.txt
This text file contains combined significance results for the overall gene region (1 above) and candidate hotspots (if found, 2 above) using Fisher's method.
Columns:
Gene
: Gene name from GTF.coordinates
: Genomic coordinates of gene, from first to last annotated exon.num_nonsilent
: Total non-silent mutations in gene across samples.num_bg
: Total silent mutations identified within gene coordinates in samples.full_length
: Total gene length in basepairs (corresponding to (2)).coding_length
: Total length of gene coding domains (e.g. sum of CDS regions in GTF).bg_type
: String indicating method used to estimate gene's background rate; one ofglobal,
local,
orclustered_regions.
bg_prob
: Gene background mutation rate used in negative binomial tests.gene_pval
: Raw p-value of negative binomial test for gene.hotspot_pvals
: Semi-colon-separated list of p-values associated with identified gene hotspots (NA if no hotspots found).Fisher_pval
: Fisher combined p-value of (9) and (10) values.Fisher_FDR
: Benjamini-Hochberg FDR-corrected Fisher p-value.num_samples
: Number of samples possessing a non-silent somatic mutation in gene.nonsilent_position_counts
: Semi-colon-separated list of genomic positions containing non-silent mutations along with counts; in format [position]_[count].nonsilent_mutation_counts
: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].samples
: Semi-colon-separated list of sample IDs containing non-silent somatic mutations in gene.
4. [prefix]_gene_data.pkl
This is a python pickle object containing the mutation data and calculations used in the enrichment analysis. The file contains a python list of Gene
class variables, as defined in the coding analysis code. If users are interested in inspecting this information, this file can be loaded in python with:
# In Python 2
import sys, os, cPickle
sys.path.insert(0, '/MutEnricher/install/path/') # Make MutEnricher install path available
sys.path.insert(0, '/MutEnricher/install/path/math_funcs/') # Add path to math functions as well
genes = cPickle.load(open('/path/to/output/example_gene_data.pkl','rb')) # Load gene data pickle file
# In Python 3
import sys, os
import _pickle as cPickle
sys.path.insert(0, '/MutEnricher/install/path/') # Make MutEnricher install path available
sys.path.insert(0, '/MutEnricher/install/path/math_funcs/') # Add path to math functions as well
genes = cPickle.load(open('/path/to/output/example_gene_data.pkl','rb')) # Load gene data pickle file
If a user would like to find information for a particular gene, this information then can be obtained as so:
gene_of_interest = 'KRAS'
index = None
for g in genes:
if g.name == gene_of_interest:
index = g.index
break
kras = genes[index]
The above code extracts the Gene
object for the gene KRAS. The user can now observe internal information associated with this gene.
5. [prefix].log
Text file containing run information, including MutEnricher version, input files, optional parameter values, and notes about the number of genes/hotspots tested.
Non-coding analysis output files
1. [prefix]_region_WAP_enrichments.txt
This text file contains the combined enrichments results for the overall region (from the negative binomial enrichment procedure) and the weighted average proximity clustering procedure. P-values are combined with Fisher's method.
Columns:
Region
: Genomic coordinates of region (from input BED file).region_name
: Name assigned to region (from 4th column of BED file if present; assigned internally otherwise).num_mutations
: Total number of somatic mutations in region across samples.length
: Length of region in basepairs.effective_length
: Length of region multiplied by number of samples.bg_type
: String indicating method used to estimate region's background rate; one ofglobal,
local,
orclustered_regions.
bg_prob
: Region background mutation rate used in negative binomial tests.region_pval
: Raw p-value from negative binomial test of region.WAP
: Statistic from weighted average proximity procedure performed on region.WAP_pval
: Permutation p-value of WAP procedure.Fisher_pval
: Fisher combined p-value of (8) and (10) values.FDR_BH
: Benjamini-Hochberg FDR-corrected Fisher p-value.num_samples
: Number of samples possessing a somatic mutation in region.position_counts
: Semi-colon-separated list of genomic positions containing somatic mutations along with counts; in format [position]_[count].mutation_counts
: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].samples
: Semi-colon-separated list of sample IDs containing somatic mutations in region.
2. [prefix]_hotspot.txt
This text file contains the results of the hotspot enrichment procedure using negative binomial tests.
Columns:
Hotpsot
: Genomic coordinates of hotspot.region
: Genomic coordinates of full region associated with hotspot.region_name
: Name assigned to region (from 4th column of BED file if present; assigned internally otherwise).num_mutations
: Total number of somatic mutations in region across samples.hotspot_length
: Length of hotspot window.effective_length
: Length of hotspot window adjusted for cohort size (i.e. hotspot length times number of samples).bg_type
: String indicating method used to estimate region's background rate; one ofglobal,
local,
orclustered_regions.
bg_prob
: Hotspot background mutation rate used in negative binomial tests.pval
: Raw p-value of negative binomial test for hotspot.FDR_BH
: Benjamini-Hochberg FDR-corrected p-value for hotspot.num_samples
: Number of samples possessing a somatic mutation in hotspot.position_counts
: Semi-colon-separated list of genomic positions containing somatic mutations along with counts; in format [position]_[count].mutation_counts
: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].samples
: Semi-colon-separated list of sample IDs containing somatic mutations in hotspot.
3. [prefix]_region_WAP_hotspot_Fisher_enrichments.txt
This text file contains combined significance results for the overall regional (1 above) and candidate hotspots (if found, 2 above) using Fisher's method.
Columns:
Region
: Genomic coordinates of region (from input BED file).region_name
: Name assigned to region (from 4th column of BED file if present; assigned internally otherwise).num_mutations
: Total number of somatic mutations in region across samples.length
: Length of region in basepairs.effective_length
: Length of region multiplied by number of samples.bg_type
: String indicating method used to estimate region's background rate; one ofglobal,
local,
orclustered_regions.
bg_prob
: Region background mutation rate used in negative binomial tests.region_pval
: Raw p-value from negative binomial test of region.WAP
: Statistic from weighted average proximity procedure performed on region.WAP_pval
: Permutation p-value of WAP procedure.hotspot_pvals
: Semi-colon-separated list of p-values associated with identified hotspots (NA if no hotspots found).Fisher_pval
: Fisher combined p-value of values (8), (10), and (11).Fisher_FDR
: Benjamini-Hochberg FDR-corrected Fisher p-value.num_samples
: Number of samples possessing a somatic mutation in region.position_counts
: Semi-colon-separated list of genomic positions containing somatic mutations along with counts; in format [position]_[count].mutation_counts
: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].samples
: Semi-colon-separated list of sample IDs containing somatic mutations in region.
4. [prefix]_region_data.pkl
This is a python pickle object containing the mutation data and calculations used in the enrichment analysis. The file contains a python list of Region
class variables, as defined in the non-coding analysis code. If users are interested in inspecting this information, this file can be loaded in python with:
# In Python 2
import sys, os, cPickle
sys.path.insert(0, '/MutEnricher/install/path/') # Make MutEnricher install path available
sys.path.insert(0, '/MutEnricher/install/path/math_funcs/') # Add path to math functions as well
regions = cPickle.load(open('/path/to/output/example_region_data.pkl','rb')) # Load region data pickle file
# In Python 3
import sys, os
import _pickle as cPickle
sys.path.insert(0, '/MutEnricher/install/path/') # Make MutEnricher install path available
sys.path.insert(0, '/MutEnricher/install/path/math_funcs/') # Add path to math functions as well
regions = cPickle.load(open('/path/to/output/example_region_data.pkl','rb')) # Load region data pickle file
If a user would like to find information for a particular region, this information then can be obtained as so:
region_of_interest = 'chr5:1295773-1296014'
index = None
for r in regions:
if r.name == region_of_interest:
index = r.index
break
reg = regions[index]
The above code extracts the Region
object for the defined region. The user can now observe internal information associated with this non-coding region.
5. [prefix].log
Text file containing run information, including MutEnricher version, input files, optional parameter values, and notes about the number of regions/hotspots tested.