FAQs - westlake-repl/ProTrek GitHub Wiki

Q1: What is the limitation of ProTrek?

As an AI model, ProTrek has its limitations:

Predictions of de novo designed proteins

The training data for ProTrek is exclusively derived from the UniProt database, which primarily catalogs naturally occurring proteins. Consequently, ProTrek may exhibit limitations when analyzing certain de novo designed proteins, particularly those with smaller sizes. Many de novo designed proteins comprise only a single domain or a solitary motif. However, predicting their functions presents significant challenges due to their restricted contextual information compared to full-length proteins. Their functional ambiguities may arise from heightened sensitivity to subtle structural variations, a limitation further compounded by insufficient experimental validation and evolutionary data support.

Predictions of subtle sequence changes at the mutation level

While ProTrek can identify core protein functional categories, it struggles with predicting quantitative or numerical properties (measurable characteristics like emission wavelength, binding affinity, or catalytic efficiency) that are determined by subtle sequence changes at the mutation level. For example, it can recognize fluorescent proteins but often fails to accurately predict their specific emission wavelengths. This limitation occurs because the model lacks sufficient training data to capture how fine changes in amino acid configurations affect specific biophysical parameters.

These functional changes induced by mutations often vary significantly across different proteins. Accurate prediction would require collecting massive datasets documenting each mutation and its corresponding functional change across diverse proteins, which presents a tremendous challenge. This limitation is similar to AlphaFold2, which, while capable of predicting protein structures, often lacks sensitivity to structural changes at the mutation level. For such specialized predictions, an effective way is to use ColabProTrek to fine-tune the model on the corresponding mutation dataset (see FLIP).

Predictions of mini- or small proteins (e.g.<100 aa) or incomplete proteins fragments

We found that ProTrek may fail to accurately predict the functions for certain (not all) understudied miniproteins. There are several reasons for this. First, miniproteins are relatively underrepresented in the protein databases, with only a small fraction carrying experimentally validated GO terms, resulting in limited high-quality training data. The exceptional structural diversity and conformational flexibility characteristic of these proteins means that some of them may lack clearly identifiable domains or conserved motifs. Additionally, their functions often depend critically on a small number of key residues—rendering global sequence pattern recognition less reliable. Furthermore, many miniproteins rely on specific partners—membrane receptors, molecular chaperones, or metal ions—for correct folding or function, and single-sequence predictors like ProTrek are often unable to model these interaction relationships.

Similarly, if your input consists only of a protein fragment that depends on other regions or molecular partners to execute its biological functions, ProTrek may assign a lower confidence score to its predictions.

Predictions of protein-protein or protein-molecular interaction

ProTrek's training datasets all come from AlphaFoldDB, which contains structures predicted by AlphaFold 2. Therefore, similar to AlphaFold2, ProTrek cannot directly handle protein complexes. But users can specify a particular chain of the complex for prediction. We notice that ProTrek can predict some potential interacting proteins or molecular information, even though it has not been explicitly trained in this manner. For example, I randomly try a text query "Interacting proteins of tumor suppressor protein p53", and it returns Q8BK35 and Q1LZ89.

For more accurate predictions, you can collect some training dataset like protein-protein interactions or protein-molecular classification (molecular as a classification label) and then train models using ColabProTrek. After training, you can utilize your trained model for prediction.

Q2: Why does the online ProTrek server run slower than described in the paper?

Currently, our service runs on 4 rented servers that also run many other tasks. The speed of our retrieval system is severely constrained by hardware limitations, including available memory, number of CPU cores, network latency, and server storage capacity. For example, when storage becomes saturated, increased disk I/O delays and impaired caching further exacerbate the issue. We are trying to seek funding to purchase or lease more advanced servers.

Q3: Are protein sequences in the databases accurate?

All protein sequences are from official databases, and the accuracy of proteins depends on the quality of these original databases. ProTrek is designed to efficiently search and retrieve proteins based on available data. It could happen that proteins searched from these databases fail to express successfully due to various factors， such as errors in metagenomic sequencing or assembly, unsuitable expression conditions (e.g., incorrect expression temperature), or inappropriate inducer concentrations, among others. Therefore, we recommend further validation before experimental use.

Q4: What do the different ProTrek “matching scores” mean and how should they be interpreted?

The ProTrek matching score represents the proximity of two objects in the embedding space. The protein-text score (including both protein-to-text and text-to-protein directions) measures the semantic relevance between a protein and its functional description. Scores above 15 typically indicate good relevance, scores above 18 suggest strong relevance, while scores below 10 mostly indicate low relevance. The sequence-sequence score reflects the distance between two proteins on their sequence embedding space, however, even proteins with dissimilar sequences (e.g. <30% sequence identity) may still score highly if they share highly similar structures or functions. This is because the ProTrek's contrastive learning mechanism which forges tight associations among SSF (sequence-structure-function) by converging genuine sample pairs (sequence-structure, structure-function, and sequence-function) while simultaneously diverging negative samples within the latent space. In our experience, a sequence-sequence score > 45 often indicates a strong relevance, while < 20 usually indicates low relevance (For your reference, the highest score for sequence-to-sequence is about 54, while the lowest score is about 10-15).

To understand the range and implications of the different scoring types, compare random pairs (used to determine the lower threshold) to true pairs (used to determine the upper threshold). For example, when performing a structure-to-sequence search, you can use known true matches to estimate the highest possible score (our reference value: 27-37) and use mismatched pairs to understand the minimum baseline score (our reference value: 5-10). This can be done using the "Compute similarity score between two modalities" option at the bottom of the interface.

Also note that these thresholds are not absolute and may depend on the specific scenarios. We also recommend users to refer to the rankings of the results rather than only the scores. Also see here - provides a way to see how a matching score ranks in the database to judge whether it is good or not.

Q5: Why are the results of a query limited to 10,000 entries, and will this limit be lifted for larger analyses?

To balance server capacity, we have implemented a temporary restriction limiting queries to 10,000 entries. We plan to increase this threshold to 100,000 entries following paper acceptance, accompanied by a full release of all embedding weights. This will enable researchers to perform local large-scale and batch searches directly on their workstations. To ensure fair access and prevent other users from having to wait longer, we have implemented this limit at this stage.

Q6: Has ProTrek's result been validated by wet lab experiments?

Yes, ProTrek has been experimentally validated for two different enzymes using seq2seq and text2seq searches (full details will be available in the final version). For one enzyme, the top 9 ranked sequences from the OMG database (200 million proteins) all showed the expected activity, and the top 1 ranked sequence had the highest activity compared to the currently reported enzyme.

As an AI tool, ProTrek has made some important progress, though limitations remain. We hope it can help biologists generate valuable scientific hypotheses and accelerate research progress.

We welcome your ProTrek feedback - successes, limitations, or suggestions for improvement. Please contact us at [email protected] or [email protected] to help guide future development and inform other researchers.