3.1 Protein Sequence Status and Sequence Panel - PRIDE-Archive/pride-inspector GitHub Wiki

In MS proteomics based experiments, potentially identified proteins are reported using the searched database’s proprietary identifiers. These identifiers are unstable and can change or may even be deleted over time. The latter happens if, for instance, hypothetical proteins are removed when gene prediction algorithms are updated or new biological evidence is created.

A few years ago we investigated the impact of changing protein identifiers on stored proteomics data over time. We found that in several cases 10-20% of the reported identifiers were no longer valid after only a year after the experimental results had been published. To highlight this problem to the user as well as to keep the reported data usable, PRIDE Inspector Toolsuite has a function to automatically check the reported protein identification’s status. To do this we integrated specific components that access the identifications source database and retrieve the current identifier status.

If the identifier was only updated, the new accession is automatically displayed in the protein table and the updated sequence retrieved. In some cases, even though a protein’s identifier did not change its underlying sequence was altered in the protein sequence database. Therefore, PRIDE Inspector automatically fetches a protein’s current sequence and checks whether the reported peptides still fit this identification.

Protein Sequence Status and Update

When using the Obtain Protein Details feature in the PRIDE Inspector, the status of the protein according to the original database is downloaded in addition to the protein name and protein sequence. It could be one of the following cases:

 - Active: the protein still exists in the original database, and the details remain unchanged.
 - Unknown: the protein does not exist in the original database.
 - Deleted: the protein has been removed from the original database.
 - Merged: the protein has been merged with other proteins to form a new protein.
 - Demerged: the protein has been split into two or more proteins.
 - Changed: there have been some changes on this protein, but the type of the change is unknown.
 - Error: there is an error associated with this protein.

To summarize, there are three main results for a protein’s status: active, changed, and deleted. For UniProtKB (UniProt KnowledgeBase) changed identifiers are subdivided in merged and demerged identifiers. The main reason for the demerging of identifiers is that new identifiers were created for every species a protein was identified in as well as new identifiers for the various genes a protein can come from. The merging of identifiers mainly happens when based on new gene prediction algorithms proteins that were previously believed to be distinct are then considered to actually come from the same gene. The International Protein Index (IPI) database was discontinued in September 2011. Therefore, PRIDE Inspector can only report whether a given identifier was still active in the last IPI release but cannot report on changed or deleted identifiers.