Proteoform matching - PathwayAnalysisPlatform/PathwayMatcher GitHub Wiki
Proteoform matching is the process of deciding if two proteoforms are equivalent of each other.
Matching types
The matching types defined for PathwayMatcher are:
The matching type is specified using the command line argument -m or --matchType with the desired matching type:
java -jar PathwayMatcher.jar match-proteoforms -m superset -i myFile.txt
java -jar PathwayMatcher.jar match-proteoforms -m strict -i myFile.txt
Superset
The set of input PTMs are a superset of the reference PTMs set
Command line argument: -m superset or -m superset_no_types
- The UniProt Accession is the same
- The Isoform is the same; either:
- Both have an isoform specified. Ex: P31749-3
- Both refer to the default one. Ex: P31749
- The PTMs:
- The input contains ALL the reference PTMs or more (Input is superset or equal). Each reference PTM must have a matching input PTM. Some input PTMs might not have a matching reference PTM.
- A PTM matches if this two requirements are true:
- The types match:
- If chosen superset then types should be equal
- If chosen superset_no_types the type is not considered
- The coordinates match if any this happens:
- Both are known (positive integer) coordinates and are the same.
- Both are known (positive integer) coordinates and they are different, but the absolute difference between the two coordinates is less than or equal to a user-defined margin (‘range’ option in command line).
- One of the coordinates is unknown ("null", empty, "?", “-1”).
- The types match:
Subset
The set of input PTMs are a subset of the reference PTMs set
Command line argument: -m subset or -m subset_no_types
- The UniProt Accession is the same
- The Isoform is the same; either:
- Both have an isoform specified. Ex: P31749-3
- Both refer to the default one. Ex: P31749
- The PTMs:
- Each input PTM must have a matching reference PTM. Some reference PTMs might not have a matching input PTM.
- A PTM matches if this two requirements are true:
- The types match: * If chosen subset then types should be equal * If chosen subset_no_types the type is not considered
- The coordinates match if any this happens:
- Both are known (positive integer) coordinates and are the same.
- Both are known (positive integer) coordinates and they are different, but the absolute difference between the two coordinates is less than or equal to a user-defined margin (‘range’ option in command line).
- One of the coordinates is unknown ("null", empty, "?", “-1”).
One
At least one input ptms matches a reference ptm
Command line argument: -m one or -m one_no_types
- The UniProt Accession is the same
- The Isoform is the same; either:
- Both have an isoform specified. Ex: P31749-3
- Both refer to the default one. Ex: P31749
- The PTMs:
- At least one input PTM must have a matching reference PTM
- A PTM matches if this two requirements are true:
- The types match:
- If chosen one then types should be equal
- If chosen one_no_types the type is not considered
- The coordinates match if any this happens:
- Both are known (positive integer) coordinates and are the same.
- Both are known (positive integer) coordinates and they are different, but the absolute difference between the two coordinates is less than or equal to a user-defined margin (‘range’ option in command line).
- One of the coordinates is unknown ("null", empty, "?", “-1”).
- The types match:
Strict
Proteoforms must match exactly in all the attributes.
Command line argument: -m strict
- The UniProt Accession is the same
- The Isoform is the same; either:
- Both have an isoform specified. Ex: P31749-3
- Both refer to the default one. Ex: P31749
- The PTMs have the same elements:
- The reference PTM set and the input PTM set have the same size.
- Each reference PTM has a matching input PTM.
- A PTM matches if:
- Types are the same.
- Coordinates are the same:
- In case they are numbers the should be equal
- In case they are null, then both should be null.
Accession
Proteoforms must only share the UniProt accession. This is equivalent conceptually to pathway search at protein level with the command match-uniprot, but in this case the results are displayed as participant proteoforms.
Command line argument: -m accession
Extra considerations:
- Negative values, zero or floating-point numbers are invalid as sequence coordinates in the input.
- We accept only PSI-MOD ontology modification types.
- The margin to compare the coordinates should be set as an unsigned integer.
Table 1 show examples of PTM coordinates matching. The letter k represents any positive integer. It compares a PTM coordinate in an input PTM with a PTM coordinate in reference PTM.
Table 1
Input | Reference | Margin | Matched | Comment |
---|---|---|---|---|
17 | 17 | 0 | Yes | Equal |
16 | 17 | 0 | No | Out of margin |
18 | 17 | 0 | No | Out of margin |
7 | 13 | 5 | No | Out of margin |
8 | 13 | 5 | Yes | In margin |
9 | 13 | 5 | Yes | In margin |
17 | 13 | 5 | Yes | In margin |
18 | 13 | 5 | Yes | In margin |
19 | 13 | 5 | No | Out of margin |
0 | 2 | 5 | No | Input in margin but not valid |
-1 | 2 | 5 | No | Input in margin but negative |
?, empty, null | Positive integer | k | Yes | Input is less specific |
Positive integer | ?, empty, null, -1 | k | Yes | Input is more specific |
?, empty, null | ?, empty, null, -1 | k | Yes | Equally unspecific |
Negative int, zero | Any | k | No | Negative or zero input are invalid |
Use cases examples
Example P20908 - COL5A1:
The protein Collagen alpha-1(V) chain () has 12 proteoforms annotated in Reactome:
P20908;
P20908;00037:null
P20908;00037:null,00038:null,00039:null
P20908;00037:null,00039:null
P20908;00038:null,00039:null
P20908;00038:null,00039:null,00162:null
P20908;00038:null,00039:null,01914:null
P20908;00039:null
P20908;00039:null,00162:null
P20908;00039:null,01914:null
P20908;00162:null
P20908;01914:null
Most of the proteoforms involve post-translational modifications from which it is not annotated the exact site of the modified residue. Still, it is possible to select groups of proteoforms by the type of modifications using the different matching criteria of pathway matcher. Then with the selected proteoforms search for reactions and pathways.
Strict
Example 1:
INPUT:
P20908;00038:null,00039:null,00162:null
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m strict -i resources/input/UseCases/P20908/Strict/example1.txt
Selected proteoforms:
P20908;00038:null,00039:null,00162:null
Example 2:
INPUT:
P20908
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m strict -i resources/input/UseCases/P20908/Strict/example2.txt
Selected proteoforms:
P20908;
Superset
Example 1:
INPUT:
P20908;00038:null,00039:null
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m superset -i resources/input/UseCases/P20908/Superset/example1.txt -o output/
Selected proteoforms:
P20908;
P20908;00039:null
P20908;00038:null,00039:null
Example 2:
INPUT:
P20908;00038:null,00039:null,01914:null
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m superset -i resources/input/UseCases/P20908/Superset/example2.txt -o output/
Selected proteoforms:
P20908;
P20908;00039:null
P20908;01914:null
P20908;00038:null,00039:null
P20908;00039:null,01914:null
P20908;00038:null,00039:null,01914:null
Subset
Example 1:
INPUT:
P20908;00037:null
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m subset -i resources/input/UseCases/P20908/Subset/example1.txt -o output/
Selected proteoforms:
P20908;00037:null
P20908;00037:null,00038:null,00039:null
P20908;00037:null,00039:null
Example 2:
INPUT:
P20908;00039:null,00162:null
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m subset -i resources/input/UseCases/P20908/Subset/example2.txt -o output/
Selected proteoforms:
P20908;00038:null,00039:null,00162:null
P20908;00039:null,00162:null
One
Example 1:
INPUT:
P20908;00038:null
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m one -i resources/input/UseCases/P20908/One/example1.txt -o output/
Selected proteoforms:
P20908;
P20908;00037:null,00038:null,00039:null
P20908;00038:null,00039:null
P20908;00038:null,00039:null,00162:null
P20908;00038:null,00039:null,01914:null
Example 2:
INPUT:
P20908;00039:null,01914:null
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m one -i resources/input/UseCases/P20908/One/example2.txt -o output/
Selected proteoforms:
P20908;
P20908;00037:null,00038:null,00039:null
P20908;00037:null,00039:null
P20908;00038:null,00039:null
P20908;00038:null,00039:null,00162:null
P20908;00038:null,00039:null,01914:null
P20908;00039:null
P20908;00039:null,00162:null
P20908;00039:null,01914:null
P20908;01914:null
Accession
Example 1:
INPUT:
P20908;00039:null,01914:null
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m accession -i resources/input/UseCases/P20908/Accession/example1.txt -o output/
Selected proteoforms:
P20908;
P20908;00037:null
P20908;00037:null,00038:null,00039:null
P20908;00037:null,00039:null
P20908;00038:null,00039:null
P20908;00038:null,00039:null,00162:null
P20908;00038:null,00039:null,01914:null
P20908;00039:null
P20908;00039:null,00162:null
P20908;00039:null,01914:null
P20908;00162:null
P20908;01914:null
Example P06241 - FYN:
The Tyrosine-protein kinase Fyn has 8 proteoforms annotated in Reactome:
P06241;
P06241;00046:21,00048:420
P06241;00048:420
P06241;00048:420,00068:2
P06241;00048:531
P06241;00068:2,00115:3,00115:6
P06241-1;00048:420,00068:2
P06241-1;00068:2
We can select different sets of proteoforms by choosing the different possible matching criteria of PathwayMatcher:
Strict
Example 1:
INPUT:
P06241-1;00048:420,00068:2
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m strict -i resources/input/UseCases/P06241/Strict/example1.txt -o output/
Selected proteoforms:
P06241-1;00048:420,00068:2
Example 2:
INPUT:
P06241;00048:531
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m strict -i resources/input/UseCases/P06241/Strict/example2.txt -o output/
Selected proteoforms:
P06241;00048:531
Superset
Example 1:
INPUT:
P06241-1;00048:420,00068:2
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m superset -i resources/input/UseCases/P06241/Superset/example1.txt -o output/
Selected proteoforms:
P06241-1;00048:420,00068:2
P06241-1;00068:2
Example 2:
INPUT:
P06241;00048:420
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m superset -r 5 -i resources/input/UseCases/P06241/Superset/example2.txt -o output/
Selected proteoforms:
P06241;
P06241;00048:420
Subset
Example 1:
INPUT:
P06241;00048:420
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m subset -r 5 -i resources/input/UseCases/P06241/Subset/example1.txt -o output/
Selected proteoforms:
P06241;00046:21,00048:420
P06241;00048:420
P06241;00048:420,00068:2
Example 2:
INPUT:
P06241;
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m subset -i resources/input/UseCases/P06241/Subset/example2.txt -o output/
Selected proteoforms:
P06241;
P06241;00046:21,00048:420
P06241;00048:420
P06241;00048:420,00068:2
P06241;00048:531
P06241;00068:2,00115:3,00115:6
One
Example 1:
INPUT:
P06241;00048:420
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m one -i resources/input/UseCases/P06241/One/example1.txt -o output/
Selected proteoforms:
P06241;
P06241;00046:21,00048:420
P06241;00048:420
P06241;00048:420,00068:2
Example 2:
INPUT:
P06241;00046:21,00048:420,00048:531
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m one -i resources/input/UseCases/P06241/One/example2.txt -o output/
Selected proteoforms:
P06241;
P06241;00046:21,00048:420
P06241;00048:420
P06241;00048:420,00068:2
P06241;00048:531
Accession
Example 1:
INPUT:
P06241;00048:420
COMMAND:
java -jar PathwayMatcher.jar match-proteoforms -m accession -i resources/input/UseCases/P06241/Accession/example1.txt -o output/
Selected proteoforms:
P06241;
P06241;00046:21,00048:420
P06241;00048:420
P06241;00048:420,00068:2
P06241;00048:531
P06241;00068:2,00115:3,00115:6
P06241-1;00048:420,00068:2
P06241-1;00068:2