Proteoform matching - PathwayAnalysisPlatform/PathwayMatcher GitHub Wiki

Proteoform matching is the process of deciding if two proteoforms are equivalent of each other.

Matching types

The matching types defined for PathwayMatcher are:

The matching type is specified using the command line argument -m or --matchType with the desired matching type:

java -jar PathwayMatcher.jar match-proteoforms -m superset -i myFile.txt
java -jar PathwayMatcher.jar match-proteoforms -m strict -i myFile.txt

Superset

The set of input PTMs are a superset of the reference PTMs set

Command line argument: -m superset or -m superset_no_types

  • The UniProt Accession is the same
  • The Isoform is the same; either:
    • Both have an isoform specified. Ex: P31749-3
    • Both refer to the default one. Ex: P31749
  • The PTMs:
    • The input contains ALL the reference PTMs or more (Input is superset or equal). Each reference PTM must have a matching input PTM. Some input PTMs might not have a matching reference PTM.
  • A PTM matches if this two requirements are true:
    • The types match:
      • If chosen superset then types should be equal
      • If chosen superset_no_types the type is not considered
    • The coordinates match if any this happens:
      • Both are known (positive integer) coordinates and are the same.
      • Both are known (positive integer) coordinates and they are different, but the absolute difference between the two coordinates is less than or equal to a user-defined margin (‘range’ option in command line).
      • One of the coordinates is unknown ("null", empty, "?", “-1”).

Subset

The set of input PTMs are a subset of the reference PTMs set

Command line argument: -m subset or -m subset_no_types

  • The UniProt Accession is the same
  • The Isoform is the same; either:
    • Both have an isoform specified. Ex: P31749-3
    • Both refer to the default one. Ex: P31749
  • The PTMs:
    • Each input PTM must have a matching reference PTM. Some reference PTMs might not have a matching input PTM.
  • A PTM matches if this two requirements are true:
    • The types match: * If chosen subset then types should be equal * If chosen subset_no_types the type is not considered
    • The coordinates match if any this happens:
      • Both are known (positive integer) coordinates and are the same.
      • Both are known (positive integer) coordinates and they are different, but the absolute difference between the two coordinates is less than or equal to a user-defined margin (‘range’ option in command line).
      • One of the coordinates is unknown ("null", empty, "?", “-1”).

One

At least one input ptms matches a reference ptm

Command line argument: -m one or -m one_no_types

  • The UniProt Accession is the same
  • The Isoform is the same; either:
    • Both have an isoform specified. Ex: P31749-3
    • Both refer to the default one. Ex: P31749
  • The PTMs:
    • At least one input PTM must have a matching reference PTM
  • A PTM matches if this two requirements are true:
    • The types match:
      • If chosen one then types should be equal
      • If chosen one_no_types the type is not considered
    • The coordinates match if any this happens:
      • Both are known (positive integer) coordinates and are the same.
      • Both are known (positive integer) coordinates and they are different, but the absolute difference between the two coordinates is less than or equal to a user-defined margin (‘range’ option in command line).
      • One of the coordinates is unknown ("null", empty, "?", “-1”).

Strict

Proteoforms must match exactly in all the attributes.

Command line argument: -m strict

  • The UniProt Accession is the same
  • The Isoform is the same; either:
    • Both have an isoform specified. Ex: P31749-3
    • Both refer to the default one. Ex: P31749
  • The PTMs have the same elements:
    • The reference PTM set and the input PTM set have the same size.
    • Each reference PTM has a matching input PTM.
  • A PTM matches if:
    • Types are the same.
    • Coordinates are the same:
      • In case they are numbers the should be equal
      • In case they are null, then both should be null.

Accession

Proteoforms must only share the UniProt accession. This is equivalent conceptually to pathway search at protein level with the command match-uniprot, but in this case the results are displayed as participant proteoforms.

Command line argument: -m accession

Extra considerations:

  • Negative values, zero or floating-point numbers are invalid as sequence coordinates in the input.
  • We accept only PSI-MOD ontology modification types.
  • The margin to compare the coordinates should be set as an unsigned integer.

Table 1 show examples of PTM coordinates matching. The letter k represents any positive integer. It compares a PTM coordinate in an input PTM with a PTM coordinate in reference PTM.

Table 1

Input Reference Margin Matched Comment
17 17 0 Yes Equal
16 17 0 No Out of margin
18 17 0 No Out of margin
7 13 5 No Out of margin
8 13 5 Yes In margin
9 13 5 Yes In margin
17 13 5 Yes In margin
18 13 5 Yes In margin
19 13 5 No Out of margin
0 2 5 No Input in margin but not valid
-1 2 5 No Input in margin but negative
?, empty, null Positive integer k Yes Input is less specific
Positive integer ?, empty, null, -1 k Yes Input is more specific
?, empty, null ?, empty, null, -1 k Yes Equally unspecific
Negative int, zero Any k No Negative or zero input are invalid

Use cases examples

Example P20908 - COL5A1:

The protein Collagen alpha-1(V) chain () has 12 proteoforms annotated in Reactome:

P20908;
P20908;00037:null
P20908;00037:null,00038:null,00039:null
P20908;00037:null,00039:null
P20908;00038:null,00039:null
P20908;00038:null,00039:null,00162:null
P20908;00038:null,00039:null,01914:null
P20908;00039:null
P20908;00039:null,00162:null
P20908;00039:null,01914:null
P20908;00162:null
P20908;01914:null

Most of the proteoforms involve post-translational modifications from which it is not annotated the exact site of the modified residue. Still, it is possible to select groups of proteoforms by the type of modifications using the different matching criteria of pathway matcher. Then with the selected proteoforms search for reactions and pathways.

Strict

Example 1:

INPUT:

P20908;00038:null,00039:null,00162:null

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m strict -i resources/input/UseCases/P20908/Strict/example1.txt 

Selected proteoforms:

P20908;00038:null,00039:null,00162:null

Example 2:

INPUT:

P20908

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m strict -i resources/input/UseCases/P20908/Strict/example2.txt 

Selected proteoforms:

P20908;

Superset

Example 1:

INPUT:

P20908;00038:null,00039:null

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m superset -i resources/input/UseCases/P20908/Superset/example1.txt -o output/

Selected proteoforms:

P20908;
P20908;00039:null
P20908;00038:null,00039:null

Example 2:

INPUT:

P20908;00038:null,00039:null,01914:null

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m superset -i resources/input/UseCases/P20908/Superset/example2.txt -o output/

Selected proteoforms:

P20908;
P20908;00039:null
P20908;01914:null
P20908;00038:null,00039:null
P20908;00039:null,01914:null
P20908;00038:null,00039:null,01914:null

Subset

Example 1:

INPUT:

P20908;00037:null

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m subset -i resources/input/UseCases/P20908/Subset/example1.txt -o output/

Selected proteoforms:

P20908;00037:null
P20908;00037:null,00038:null,00039:null
P20908;00037:null,00039:null

Example 2:

INPUT:

P20908;00039:null,00162:null

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m subset -i resources/input/UseCases/P20908/Subset/example2.txt -o output/

Selected proteoforms:

P20908;00038:null,00039:null,00162:null
P20908;00039:null,00162:null

One

Example 1:

INPUT:

P20908;00038:null

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m one -i resources/input/UseCases/P20908/One/example1.txt -o output/

Selected proteoforms:

P20908;
P20908;00037:null,00038:null,00039:null
P20908;00038:null,00039:null
P20908;00038:null,00039:null,00162:null
P20908;00038:null,00039:null,01914:null

Example 2:

INPUT:

P20908;00039:null,01914:null

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m one -i resources/input/UseCases/P20908/One/example2.txt -o output/

Selected proteoforms:

P20908;
P20908;00037:null,00038:null,00039:null
P20908;00037:null,00039:null
P20908;00038:null,00039:null
P20908;00038:null,00039:null,00162:null
P20908;00038:null,00039:null,01914:null
P20908;00039:null
P20908;00039:null,00162:null
P20908;00039:null,01914:null
P20908;01914:null

Accession

Example 1:

INPUT:

P20908;00039:null,01914:null

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m accession -i resources/input/UseCases/P20908/Accession/example1.txt -o output/

Selected proteoforms:

P20908;
P20908;00037:null
P20908;00037:null,00038:null,00039:null
P20908;00037:null,00039:null
P20908;00038:null,00039:null
P20908;00038:null,00039:null,00162:null
P20908;00038:null,00039:null,01914:null
P20908;00039:null
P20908;00039:null,00162:null
P20908;00039:null,01914:null
P20908;00162:null
P20908;01914:null

Example P06241 - FYN:

The Tyrosine-protein kinase Fyn has 8 proteoforms annotated in Reactome:

P06241;
P06241;00046:21,00048:420
P06241;00048:420
P06241;00048:420,00068:2
P06241;00048:531
P06241;00068:2,00115:3,00115:6
P06241-1;00048:420,00068:2
P06241-1;00068:2

We can select different sets of proteoforms by choosing the different possible matching criteria of PathwayMatcher:

Strict

Example 1:

INPUT:

P06241-1;00048:420,00068:2

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m strict -i resources/input/UseCases/P06241/Strict/example1.txt -o output/

Selected proteoforms:

P06241-1;00048:420,00068:2

Example 2:

INPUT:

P06241;00048:531

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m strict -i resources/input/UseCases/P06241/Strict/example2.txt -o output/

Selected proteoforms:

P06241;00048:531

Superset

Example 1:

INPUT:

P06241-1;00048:420,00068:2

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m superset -i resources/input/UseCases/P06241/Superset/example1.txt -o output/

Selected proteoforms:

P06241-1;00048:420,00068:2
P06241-1;00068:2

Example 2:

INPUT:

P06241;00048:420

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m superset -r 5 -i resources/input/UseCases/P06241/Superset/example2.txt -o output/

Selected proteoforms:

P06241;
P06241;00048:420

Subset

Example 1:

INPUT:

P06241;00048:420

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m subset -r 5 -i resources/input/UseCases/P06241/Subset/example1.txt -o output/

Selected proteoforms:

P06241;00046:21,00048:420
P06241;00048:420
P06241;00048:420,00068:2

Example 2:

INPUT:

P06241;

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m subset -i resources/input/UseCases/P06241/Subset/example2.txt -o output/

Selected proteoforms:

P06241;
P06241;00046:21,00048:420
P06241;00048:420
P06241;00048:420,00068:2
P06241;00048:531
P06241;00068:2,00115:3,00115:6

One

Example 1:

INPUT:

P06241;00048:420

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m one -i resources/input/UseCases/P06241/One/example1.txt -o output/

Selected proteoforms:

P06241;
P06241;00046:21,00048:420
P06241;00048:420
P06241;00048:420,00068:2

Example 2:

INPUT:

P06241;00046:21,00048:420,00048:531

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m one -i resources/input/UseCases/P06241/One/example2.txt -o output/

Selected proteoforms:

P06241;
P06241;00046:21,00048:420
P06241;00048:420
P06241;00048:420,00068:2
P06241;00048:531

Accession

Example 1:

INPUT:

P06241;00048:420

COMMAND:

java -jar PathwayMatcher.jar match-proteoforms -m accession -i resources/input/UseCases/P06241/Accession/example1.txt -o output/

Selected proteoforms:

P06241;
P06241;00046:21,00048:420
P06241;00048:420
P06241;00048:420,00068:2
P06241;00048:531
P06241;00068:2,00115:3,00115:6
P06241-1;00048:420,00068:2
P06241-1;00068:2