Protein structure search tutorial - glasgowlab/home GitHub Wiki
- This workshop follows a short lecture "Tools and databases to explore protein structure diversity" that contains useful background and examples, and describes limitations of the methods. Please refer to the presentation for a more complete review.
- Keep in mind there might be potential updates and deprecated arguments (if you see this in the future). Refer to a github page or other manual of the tools (see below).
- All the commands are executed on our (Glasgow lab) server.
In this workshop, we will explore how to fetch protein structures (experimentally determined ones or predicted models) using Foldseek, and structurally align them with US-align. Here are some useful links:
- Foldseek GitHub page
- US-align GitHub page
- Protein Universe Atlas - read more about it in their paper
- ESM Metagenomic Atlas - predicted protein models from metagenomics data. See paper
- InterPro - classifies a protein into domains and more
Here are some useful tools to start exploring phylogenetic relationships:
-
Taxonomy browser - search by
taxid
for taxons (taxid
can be reported byFoldseek
). - TimeTree - get divergence time, evolutionary timeline, and build a very simple species relationship (note that some phylogenies might be poorly resolved or outdated).
-
Interactive Tree of Life - contains publicly available phylogenetic trees. Can vizualize and annotate your own tree (although for annotations I use e.g.
FigTree
).
Pick input .pdb structure ──► Run Foldseek with desired database ──► Filter ──► Download .pdb files ──► Check manually in PyMOL ──► Run US-align
Note - I already installed Foldeek and some of the databases - no need to run this:
- As a reference, here is how to install Foldseek (note - they put the most recent stable version through conda):
#already installed
conda install -c conda-forge -c bioconda foldseek
Foldseek is already installed in the dsenv
conda environment. Run this to activate he environment:
conda activate dsenv
- This is how to install a database
#in /ifs/data/home/ds4316/
mkdir pdb_db
cd pdb_db
foldseek databases PDB pdb tmp
IMPORTANT NOTE At least in the current release of Foldseek, the downloaded PDB database must be named pdb
(lowercase). If named any other way, databases
will create broken simlinks in the database (on our cluster they appear in red; ls -l
and you will see simlinks). As a result, --cluster-search 1
will not work and will give an error related to the database, although without this parameter easy-search
runs without error for me.
Save your input structure as a .pdb
(not .cif
) - input.pdb
. I you have a homo or heteromultimeric protein, consider using each protomer separately (or just one for a homomeric protein).
- Below is one of the expample flag combinations I was using for one of my problems. Play around with the flags for your specific problem.
foldseek easy-search input.pdb /ifs/share/foldseek_database/pdb_db/pdb input.fsk.raw.txt tmp --tmscore-threshold 0.3 --exhaustive-search 1 --cluster-search 1 -c 0.5 --cov-mode 2 --format-output "query,target,fident,mismatch,gapopen,qstart,qend,tstart,tend,evalue,prob,rmsd,lddt,alntmscore,tlen,alnlen,taxid,taxname,theader"
This will create a file input.fsk.raw.txt
with pdb structure names and other information I specified with --format-output
flags.
Parameters:
-
--tmscore-threshold 0.3
only keeps targets that have TM-score ≥ 0.3 when structurally aligned to query (input.pdb
). Note: TM-scores reported by Foldseek are different (mostly lower from what I just browsed through) from US-align (TM-align). -
--exhaustive-search 1
skips prefilter and performs an all-vs-all alignment (more sensitive but slower). Note -
--cluster-search 1
reports all matches from a cluster. By default, Foldseek clusters the database and reports only a representative of the clusteras the hit, ommiting others. This results in e.g. reporting 1z15 but not 1z18 altought both their TM-scores ≥ 0.3. -
-c 0.5 --cov-mode 2
only keeps targets that cover at leat 50% (-c 0.5
) of query (--cov-mode 2
) -
--format-output
what fields to report (in this order):-
query
- name of query -
target
- name of target including chain and format -
fident
- fraction of sequence identity (identical matches) -
mismatch
- number of mismatches -
gapopen
- Number of gap open events (note: this is NOT the number of gap characters) -
qstart
,qend
- alignment start and end in query -
tstart
,tend
- alignment start and end in target -
evalue
- reported e-value of the match -
prob
- probability for query and target to be homologous (e.g. being within the same SCOPe superfamily) -
rmsd
- RMSD -
lddt
- average LDDT-score of the alignment (more local than TM-score) -
alntmscore
- TM-score -
tlen
- aminoacid length of the target hit -
alnlen
- length of alignment -
taxid
- universal taxonomic id's (useful when make a phylogenetic tree downstream) -
taxname
- (lowest) taxon name (e.g. species name) -
theader
- name of entry from PDB
-
- Check that probabilities of the hits to be in the same SCOP (higher = more likely). Note that these commands filter by
column id
- if you output different columns or in different sequence you have to change the columd ids (denoted by$
):
awk '$11 {print $0}' input.fsk.raw.txt
- Check TM-score range:
awk {'print $14'} input.raw.txt | sort -u | sed -n '1p;$p'
- You can filter to keep only high-probability hits (I used
0.95
here as a cutoff) and save asinput.fsk.txt
:
awk '$11 >= 0.95 {print $0}' input.fsk.raw.txt > input.fsk.txt
- Get
.pdb
names (save inids.txt
) and also add yourinput.pdb
in the end don't forget - for structural alignment later (can justecho
):
#for PDB download needs to be ',' delimited
awk '{print $2}' | sort -u | tr '\n' ',' > pdb_list.txt
#remove last ","
- Use the batch script (
batch_download.sh
) to batch download.pdb
files fromhttps://www.rcsb.org/docs/programmatic-access/batch-downloads-with-shell-script
(save it in your folder). Make it executable if needed - runchmod +x batch_download.sh
:
#.cif files
./batch_download.sh -f pdb_list.txt -c
gunzip *
Check if the structures make sense
ESM Atlas contains hundreds of millions of predicted protein models from metagenomics data which is a great resource to build up the number of sequence-diverse structures. Since these are predicted frmo shotgun sequencing, they might contain truncations and also contaminations from unrelated organisms (e.g. host organism if sequencing microbiome from an e.g. an animal). Also keep in mind these models will be predicted as monomers and there is no information on stoichiometry.
#!/bin/bash
#SBATCH --partition=glab
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH --mem=24G
#use --tmscore-threshold 0.65
#-a 1 allows to convert to seq but it didn't do anything maybe bec of --format-output
foldseek easy-search input.pdb /ifs/share/foldseek_database/ESMAtlas30_db/ESMAtlas30_db/esm input.esm.raw.txt tmp --tmscore-threshold 0.65 -c 0.5 --cov-mode 2 -a 1 --format-output "query,target,fident,mismatch,gapopen,qstart,qend,tstart,tend,evalue,prob,rmsd,lddt,alntmscore,tlen,alnlen,theader"
The output is saved in input.esm.raw.txt
.
Look at probability values (will not report redundant ones):
awk '{print $11}' input.esm.raw.txt | sort | uniq -c
TM-score range:
awk {'print $14'} input.esm.raw.txt | sort -u | sed -n '1p;$p'
- Get
.pdb
names (save inids.txt
) and also add "input.pdb" in the end (can justecho
):
awk -F '\t' '{print $2}' input.esm.raw.txt| awk -F '.' '{print $1}' > ids.txt
- Download the pdb structures:
Script
esmdownload.sh
to dowload .pdb files with ESM API and save them:
#!/bin/bash
#entry names are saved in ids.txt
file="ids.txt"
#Download with ESM API
while IFS= read -r ids || [[ -n "$ids" ]]; do
wget -O "$ids.pdb" "https://api.esmatlas.com/fetchPredictedStructure/$ids" --no-check-certificate
done < "$file"
to make executable: chmod +x esmdownload.sh
. Run script: ./esmdownload.sh
. Moved them to folder ./pdbs
.
- Filter
Open in PyMOL and go through them. Remove suspicious ones (e.g. truncated structures). Can also remove the structures of different protein families (e.g. share the aligned domain but not the rest).
You can run structural alignment with US-align these ways:
- Each entry in aligned only to the reference (e.g. your initial input strucutre); or
- Global all-vs-all alignment (will produce one global alignment; takes long for many structures)
- Make a list with
.pdb
names of the downloaded models from ESMAtlas including your input and save asnames.txt
. An example looks like this:
input.pdb
entry1.pdb
1abc.pdb
1dea_A.pdb
Note: For the purpose of this tutorial, don't make a list with all of them - just make it with 10 as an example. The all-vs-all
alignment is time-consuming for many structures and the time is not linear and depends on the input size (medium-size 500 structures can take from 3-10 hours).
- Run USAlign All-vs-all.
#!/bin/bash
#SBATCH --partition=glab
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem=40G
/ifs/share/USalign -dir /folder/with/pdbs /folder/with/names.txt -mm 4 -fast > alignment.txt
- Reformat:
awk 'NR >= 14' alignment.txt | sed '$d' > alignment.format.txt
You can now visualize the alignment in your favorite desktop alignment software (a good one is Jalview)