Performance - PathwayAnalysisPlatform/PathwayMatcher GitHub Wiki

PathwayMatcher was benchmarked against different reference datasets covering multiple types of omics data.

Response time

In all cases, response time increases with query size until reaching a plateau indicative of a near-complete coverage of pathways for the given input type. As expected, protein identifiers provide the lowest response time, completing within less than a minute. Mapping peptides and genetic variants to proteins adds additional computation complexity resulting in response time of approximately one and two minutes, respectively. Finally, proteoform matching, the most demanding task computationally, shows a response time increasing linearly until reaching 3.5 min.

Response time of PathwayMatcher using (A) proteins in blue, (B) proteoforms in green, (C) peptides in yellow, and (D) genetic variants in red. Response time in minutes is plotted against query size. Mean is displayed as solid line and 95% range as ribbon.

Datasets

Proteins

  • The Human protein set from Uniprot/Swiss-Prot which are manually annotated and reviewed (release 2017_10).
  • The list of all annotated proteins in Reactome version 63
MATCH (pe:PhysicalEntity)-[:referenceEntity]->(re:ReferenceEntity)
WHERE pe.speciesName = "Homo sapiens" AND re.databaseName = "UniProt"
RETURN DISTINCT (CASE WHEN size(re.variantIdentifier) > 0 THEN re.variantIdentifier ELSE re.identifier END) as proteinAccession
ORDER by proteinAccession

Proteoforms

The list of all annotated proteoforms in Reactome. Query for Neo4j:

MATCH (pe:PhysicalEntity{speciesName:'Homo sapiens'})-[:referenceEntity]->(re:ReferenceEntity{databaseName:'UniProt'})
WITH DISTINCT pe, re
OPTIONAL MATCH (pe)-[:hasModifiedResidue]->(tm:TranslationalModification)-[:psiMod]->(mod:PsiMod)
WITH DISTINCT pe.stId AS physicalEntity,
                re.identifier AS protein,
                re.variantIdentifier AS isoform,
                tm.coordinate as coordinate, 
                mod.identifier as type ORDER BY type, coordinate
WITH DISTINCT 
	        physicalEntity,
		CASE WHEN isoform IS NOT NULL THEN isoform ELSE protein END as UniProtAcc,
                COLLECT(type + ":" + CASE WHEN coordinate IS NOT NULL THEN coordinate ELSE "null" END ) AS ptms
RETURN DISTINCT UniProtAcc, ptms
ORDER BY UniProtAcc

Peptides

  • Proteotypic Peptide Set from ProteomeTools available available from the ProteomeXchange Consortium via the PRIDE repository PXD004732, release date 01/23/2017. It includes 139,797 non-redundant peptides.

  • 'Missing Gene' Set, collection of 141,601 non-redundant peptides. Which includes all unique tryptic peptides between 7 and 30 amino acids in length for canonical gene products lacking confident protein level identification in ProteomicsDB.org. The set comprises all the files designated as “TUM_second_pool” and with ".zip" type.

  • 'SRMAtlas' Set, which is the SRMAtlas collection of 81497 non-redundant peptides. The set comprises all the files designated as “SRMAtlas” and with ".zip" type.

Each compressed file contains a text file peptides.txt with a list of peptides. The utility class used to gather the peptides from all the files is no.uib.pathwaymatcher.tools.ProteomeTools_PTPListExtractor. You need to download all the files locally and specify the location in the class.

In total there are 333,784 non-redundant peptides in the reference list used for sampling.

Genetic Variants

Variants from the human assembly GRCh37.p13.

Files

Generate times

To generate the times use the class PathwayMatcherSpeedTest inside PathwayMatcher.jar. To execute the class use the command:

java -cp PathwayMatcher.jar no.uib.pap.pathwaymatcher.PathwayMatcherSpeedTest <parameters_file>

The parameters file contains how many repetitions and sizes for each data type. The parameters file used is located at <PathwayMatcher_home>/resources/input/tests/ An example is:

REPETITIONS	3
SAMPLE_SETS	30
WARMUP_OFFSET	1
ALL_PEPTIDES	resources/input/Peptides/AllPeptides.csv
ALL_PROTEINS	resources/input/Proteins/UniProt/uniprot-all.list
ALL_PROTEOFORMS	resources/input/ReactomeAllProteoformsSimple.csv
ALL_SNPS	extra/SampleDatasets/GeneticVariants/MoBa.csv
PROTEIN_SIZES	1	2000	4000	6000	8000	10000	12000	14000	16000	18000	20000
PROTEOFORM_SIZES	1	2000	4000	6000	8000	10000	12000	14000	16000	18000	20000
PEPTIDE_SIZES	1	20000	40000	60000	80000	100000	120000	140000	160000	180000	200000
SNPS_SIZES	200000	600000	100000 140000 1800000

The class will generate a file called times.csv It looks like this:

Type,Sample,Size,ms,Repetition
UNIPROT,0,1,2166.983,1
UNIPROT,0,1,1197.578,2
UNIPROT,0,1,1536.749,3
...
PROTEOFORMS,9,20000,1783.052,2
PROTEOFORMS,9,20000,1722.111,3
PROTEOFORMS,9,20000,1797.966,4
PROTEOFORMS,9,20000,1713.655,5
...
PEPTIDES,18,200000,27280.150,4
PEPTIDES,18,200000,26605.677,5
PEPTIDES,19,1,22840.945,1
PEPTIDES,19,1,22743.989,2
...
RSIDS,0,1800000,55313.708,3
RSIDS,1,200000,22924.123,1