Performance - PathwayAnalysisPlatform/PathwayMatcher GitHub Wiki
PathwayMatcher was benchmarked against different reference datasets covering multiple types of omics data.
Response time
In all cases, response time increases with query size until reaching a plateau indicative of a near-complete coverage of pathways for the given input type. As expected, protein identifiers provide the lowest response time, completing within less than a minute. Mapping peptides and genetic variants to proteins adds additional computation complexity resulting in response time of approximately one and two minutes, respectively. Finally, proteoform matching, the most demanding task computationally, shows a response time increasing linearly until reaching 3.5 min.
Response time of PathwayMatcher using (A) proteins in blue, (B) proteoforms in green, (C) peptides in yellow, and (D) genetic variants in red. Response time in minutes is plotted against query size. Mean is displayed as solid line and 95% range as ribbon.
Datasets
Proteins
- The Human protein set from Uniprot/Swiss-Prot which are manually annotated and reviewed (release 2017_10).
- The list of all annotated proteins in Reactome version 63
MATCH (pe:PhysicalEntity)-[:referenceEntity]->(re:ReferenceEntity)
WHERE pe.speciesName = "Homo sapiens" AND re.databaseName = "UniProt"
RETURN DISTINCT (CASE WHEN size(re.variantIdentifier) > 0 THEN re.variantIdentifier ELSE re.identifier END) as proteinAccession
ORDER by proteinAccession
Proteoforms
The list of all annotated proteoforms in Reactome. Query for Neo4j:
MATCH (pe:PhysicalEntity{speciesName:'Homo sapiens'})-[:referenceEntity]->(re:ReferenceEntity{databaseName:'UniProt'})
WITH DISTINCT pe, re
OPTIONAL MATCH (pe)-[:hasModifiedResidue]->(tm:TranslationalModification)-[:psiMod]->(mod:PsiMod)
WITH DISTINCT pe.stId AS physicalEntity,
re.identifier AS protein,
re.variantIdentifier AS isoform,
tm.coordinate as coordinate,
mod.identifier as type ORDER BY type, coordinate
WITH DISTINCT
physicalEntity,
CASE WHEN isoform IS NOT NULL THEN isoform ELSE protein END as UniProtAcc,
COLLECT(type + ":" + CASE WHEN coordinate IS NOT NULL THEN coordinate ELSE "null" END ) AS ptms
RETURN DISTINCT UniProtAcc, ptms
ORDER BY UniProtAcc
Peptides
-
Proteotypic Peptide Set from ProteomeTools available available from the ProteomeXchange Consortium via the PRIDE repository PXD004732, release date 01/23/2017. It includes 139,797 non-redundant peptides.
-
'Missing Gene' Set, collection of 141,601 non-redundant peptides. Which includes all unique tryptic peptides between 7 and 30 amino acids in length for canonical gene products lacking confident protein level identification in ProteomicsDB.org. The set comprises all the files designated as “TUM_second_pool” and with ".zip" type.
-
'SRMAtlas' Set, which is the SRMAtlas collection of 81497 non-redundant peptides. The set comprises all the files designated as “SRMAtlas” and with ".zip" type.
Each compressed file contains a text file peptides.txt with a list of peptides. The utility class used to gather the peptides from all the files is no.uib.pathwaymatcher.tools.ProteomeTools_PTPListExtractor. You need to download all the files locally and specify the location in the class.
In total there are 333,784 non-redundant peptides in the reference list used for sampling.
Genetic Variants
Variants from the human assembly GRCh37.p13.
Files
- Cypher queries: Statistics
- Reactions and pathways mapped per protein: HitsPerProtein.csv
- Reactions and pathways mapped by each proteoform: HitsPerProteoform.csv
- Reactions and pathways per protein/proteoform: plotHits_v2.R
- Performance times MISSING RESOURCE: times.csv
- Performance plots: makePerformancePlots.R
Generate times
To generate the times use the class PathwayMatcherSpeedTest inside PathwayMatcher.jar. To execute the class use the command:
java -cp PathwayMatcher.jar no.uib.pap.pathwaymatcher.PathwayMatcherSpeedTest <parameters_file>
The parameters file contains how many repetitions and sizes for each data type. The parameters file used is located at <PathwayMatcher_home>/resources/input/tests/ An example is:
REPETITIONS 3
SAMPLE_SETS 30
WARMUP_OFFSET 1
ALL_PEPTIDES resources/input/Peptides/AllPeptides.csv
ALL_PROTEINS resources/input/Proteins/UniProt/uniprot-all.list
ALL_PROTEOFORMS resources/input/ReactomeAllProteoformsSimple.csv
ALL_SNPS extra/SampleDatasets/GeneticVariants/MoBa.csv
PROTEIN_SIZES 1 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
PROTEOFORM_SIZES 1 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
PEPTIDE_SIZES 1 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000
SNPS_SIZES 200000 600000 100000 140000 1800000
The class will generate a file called times.csv It looks like this:
Type,Sample,Size,ms,Repetition
UNIPROT,0,1,2166.983,1
UNIPROT,0,1,1197.578,2
UNIPROT,0,1,1536.749,3
...
PROTEOFORMS,9,20000,1783.052,2
PROTEOFORMS,9,20000,1722.111,3
PROTEOFORMS,9,20000,1797.966,4
PROTEOFORMS,9,20000,1713.655,5
...
PEPTIDES,18,200000,27280.150,4
PEPTIDES,18,200000,26605.677,5
PEPTIDES,19,1,22840.945,1
PEPTIDES,19,1,22743.989,2
...
RSIDS,0,1800000,55313.708,3
RSIDS,1,200000,22924.123,1