ProteinMPNN - TUM-CBR/pymol-plugins GitHub Wiki

proteinMPNN

ProteinMPNN

This is a tool which uses ProteinMPNN to design protein sequences. You can read the ProteinMPNN paper to learn more about the algorithm. This page explains how to use the tool offered by CBR tools for using ProteinMPNN to design proteins.

Describe what you wish to Design

For this example, we will start with the PDB structure 2GVI. Use the fetch command from PyMol to get the structure: fetch 2gvi.

Our objective will be to create a sequence that eliminates the Tryptophan located in the PDB position 142. To do so, we will use ProteinMPNN to redesign the two helixes which contain said Tryptophan. Lets start by selecting all the residues that will be part of our design using PyMol's selection capabilities:

image

We can then proceed to give what we have selected a meaningful name such as "positions_to_engineer" using PyMol's rename functionality:

image

We are now ready to engineer our first sequences. Open the "ProteinMPNN" application and hit the refresh button. The user interface should display the "positions_to_engineer" in the list. Proceed to set the "include" box next to the name to "checked":

image

Generating the new Sequences

Now hit the "Run" button to get your sequences. The output is random, but in my case I got something like the below:

>2gvi, score=1.4689, global_score=1.5176, fixed_chains=[], designed_chains=['A'], model_name=v_48_020, git_hash=8907e6671bfbfc92303b5f79c4b5e6ce47cdef57, seed=323
MEKLNFGIPEWAFEFHGHKCPYMPMGYRAGSYALKIAGLEKEKDHRTYLLSEMSPEDMNGCFNDGAQAATGCTYGKGLFSLLGYGKLALILYRPGRKAIRVHVRNSFMDELSTRASDFFRYRKQGYEPSEIPAGAIDPVLEWISSLEDEEIFEYREIDGFTFEPVKKNGAKVRCDVCGEYTYEADAKLLNGKPVCKPDYYG
>T=0.1, sample=1, score=0.8839, global_score=1.3547, seq_recovery=0.4600
MEKLNFGIPEWAFEFHGHKCPYMPMGYRAGSYALKIAGLEKEKDHRTYLLSEMSPEDMNGCFNDGAQAATGCTYGKGLFSLLGYGKLALILYRPGRKAIRVHVKESFLEELKVKAAKYFARLAKGVPCKDIPDEDIDPVLEWIASKKDEDIFEYREIDGFTFEPVKKNGAKVRCDVCGEYTYEADAKLLNGKPVCKPDYYG
>T=0.1, sample=2, score=0.8945, global_score=1.3492, seq_recovery=0.4400
MEKLNFGIPEWAFEFHGHKCPYMPMGYRAGSYALKIAGLEKEKDHRTYLLSEMSPEDMNGCFNDGAQAATGCTYGKGLFSLLGYGKLALILYRPGRKAIRVHLKKEFLEELKKIAAEYFALLAAGVPCRDIPDEAIDPVLEWIASKKDEEMFEYREIDGFTFEPVKKNGAKVRCDVCGEYTYEADAKLLNGKPVCKPDYYG
>T=0.1, sample=3, score=0.8479, global_score=1.3460, seq_recovery=0.4800
MEKLNFGIPEWAFEFHGHKCPYMPMGYRAGSYALKIAGLEKEKDHRTYLLSEMSPEDMNGCFNDGAQAATGCTYGKGLFSLLGYGKLALILYRPGRKAIRVHVKEEFLEELKKIGAAYFARLAAGTPCTEIPAEDIDPVLEWIASKEDEDIFEYREIDGFTFEPVKKNGAKVRCDVCGEYTYEADAKLLNGKPVCKPDYYG
>T=0.1, sample=4, score=0.8756, global_score=1.3510, seq_recovery=0.4600
MEKLNFGIPEWAFEFHGHKCPYMPMGYRAGSYALKIAGLEKEKDHRTYLLSEMSPEDMNGCFNDGAQAATGCTYGKGLFSLLGYGKLALILYRPGRKAIRVHLKKSFLEELKEIAKKYFELLAKGVPCKEIPDEYIDPVLEWIASKKDEDIFEYREIDGFTFEPVKKNGAKVRCDVCGEYTYEADAKLLNGKPVCKPDYYG
>T=0.1, sample=5, score=0.8405, global_score=1.3199, seq_recovery=0.4600
MEKLNFGIPEWAFEFHGHKCPYMPMGYRAGSYALKIAGLEKEKDHRTYLLSEMSPEDMNGCFNDGAQAATGCTYGKGLFSLLGYGKLALILYRPGRKAIRVHVKEEFLKELKEKAAAYFARLAAGTPCRDIPDSDIDPVLEWIASKKDEEIFEYREIDGFTFEPVKKNGAKVRCDVCGEYTYEADAKLLNGKPVCKPDYYG

Even though ProteinMPNN produced 5 novel sequences, all of them still contain the Tryptophan. It seems that the training data for ProteinMPNN would strongly advocate to conserve the tryptophan. Nevertheless, we can ask ProteinMPNN to truly try to remove the tryptophan.

Excluding Residues

It is possible to exclude residues from certain positions. First, lets create a new selection with the respective positions. In our case, we select only the tryptophan in position 142. Make sure to unselect the "positions_to_engineer" selection we did before before proceeding.

image

As before, lets save it under a new name called "positions_to_constrain". We should now have two selections, one called "positions_to_engineer" which contains the region of the protein we wish to engineer and the other called "positions_to_constrain" which only contains the tryptophan at position 141.

Go back to the "PrteinMPNN" application and hit the "refresh" button so both selections get listed. Then edit the "excluded_residues" column of the "positions_to_constrain" row adding only the letter "W" (for tryptophan).

image

Hit the run button and now you should get sequences where the tryptophan no longer exists in position 141. In my case, I got:

>2gvi, score=1.5358, global_score=1.5209, fixed_chains=[], designed_chains=['A'], model_name=v_48_020, git_hash=8907e6671bfbfc92303b5f79c4b5e6ce47cdef57, seed=587
MEKLNFGIPEWAFEFHGHKCPYMPMGYRAGSYALKIAGLEKEKDHRTYLLSEMSPEDMNGCFNDGAQAATGCTYGKGLFSLLGYGKLALILYRPGRKAIRVHVRNSFMDELSTRASDFFRYRKQGYEPSEIPAGAIDPVLEWISSLEDEEIFEYREIDGFTFEPVKKNGAKVRCDVCGEYTYEADAKLLNGKPVCKPDYYG
>T=0.1, sample=1, score=0.9385, global_score=1.3738, seq_recovery=0.3922
MEKLNFGIPEWAFEFHGHKCPYMPMGYRAGSYALKIAGLEKEKDHRTYLLSEMSPEDMNGCFNDGAQAATGCTYGKGLFSLLGYGKLALILYRPGRKAIRVHVKKEFLEELKKIAAAYFAALAAGTPCRDIPDEWIDPVLAYIASKKDEDIFSYREIDGFTFEPVKKNGAKVRCDVCGEYTYEADAKLLNGKPVCKPDYYG
>T=0.1, sample=2, score=0.9401, global_score=1.3616, seq_recovery=0.4510
MEKLNFGIPEWAFEFHGHKCPYMPMGYRAGSYALKIAGLEKEKDHRTYLLSEMSPEDMNGCFNDGAQAATGCTYGKGLFSLLGYGKLALILYRPGRKAIRVHVKKEFLEELKKIAKKYFELLAKGVKPKDIPDSAINPVLEYIASKKDEEIFSYREIDGFTFEPVKKNGAKVRCDVCGEYTYEADAKLLNGKPVCKPDYYG
>T=0.1, sample=3, score=0.9187, global_score=1.3343, seq_recovery=0.4510
MEKLNFGIPEWAFEFHGHKCPYMPMGYRAGSYALKIAGLEKEKDHRTYLLSEMSPEDMNGCFNDGAQAATGCTYGKGLFSLLGYGKLALILYRPGRKAIRVHLKEEFLKELKEIAKEYFDLVKKGVPCEEIPDSAIDPVLEYIASKKDEDIFSYREIDGFTFEPVKKNGAKVRCDVCGEYTYEADAKLLNGKPVCKPDYYG
>T=0.1, sample=4, score=0.9456, global_score=1.3681, seq_recovery=0.4510
MEKLNFGIPEWAFEFHGHKCPYMPMGYRAGSYALKIAGLEKEKDHRTYLLSEMSPEDMNGCFNDGAQAATGCTYGKGLFSLLGYGKLALILYRPGRKAIRVHVKDSFLKELKEIAKKYFELLKKGVPCKDIPDEAINPVLEFIASKKDEDIFSYREIDGFTFEPVKKNGAKVRCDVCGEYTYEADAKLLNGKPVCKPDYYG
>T=0.1, sample=5, score=0.9975, global_score=1.3745, seq_recovery=0.3725
MEKLNFGIPEWAFEFHGHKCPYMPMGYRAGSYALKIAGLEKEKDHRTYLLSEMSPEDMNGCFNDGAQAATGCTYGKGLFSLLGYGKLALILYRPGRKAIRVHLKPEFLEELKVVGAAYFALLAKGVPCTDIPDEAIDPVLEYIASKKDEDMFSYREIDGFTFEPVKKNGAKVRCDVCGEYTYEADAKLLNGKPVCKPDYYG

ProteinMPNN did introduce a new tryptophan on the 2nd sequence (at a different position) but the rest of the sequences no longer have a tryptophan in the section we designed.