Protein structure search tutorial - glasgowlab/home GitHub Wiki

Protein structure search and alignment

Computational workshop

Daniel Sultanov

June 30 2025

Before we start

  • This workshop follows a short lecture "Tools and databases to explore protein structure diversity" that contains useful background and examples, and describes limitations of the methods. Please refer to the presentation for a more complete review.
  • Keep in mind there might be potential updates and deprecated arguments (if you see this in the future). Refer to a github page or other manual of the tools (see below).
  • All the commands are executed on our (Glasgow lab) server.

Background

In this workshop, we will explore how to fetch protein structures (experimentally determined ones or predicted models) using Foldseek, and structurally align them with US-align. Here are some useful links:

Here are some useful tools to start exploring phylogenetic relationships:

  • Taxonomy browser - search by taxid for taxons (taxid can be reported by Foldseek).
  • TimeTree - get divergence time, evolutionary timeline, and build a very simple species relationship (note that some phylogenies might be poorly resolved or outdated).
  • Interactive Tree of Life - contains publicly available phylogenetic trees. Can vizualize and annotate your own tree (although for annotations I use e.g. FigTree).

Pipeline

 Pick input .pdb structure  ──► Run Foldseek with desired database ──► Filter ──► Download .pdb files ──► Check manually in PyMOL ──► Run US-align 

0. Install Foldseek and databases

Note - I already installed Foldeek and some of the databases - no need to run this:

  1. As a reference, here is how to install Foldseek (note - they put the most recent stable version through conda):
#already installed
conda install -c conda-forge -c bioconda foldseek

Foldseek is already installed in the dsenv conda environment. Run this to activate he environment:

conda activate dsenv
  1. This is how to install a database
#in /ifs/data/home/ds4316/
mkdir pdb_db
cd pdb_db
foldseek databases PDB pdb tmp

IMPORTANT NOTE At least in the current release of Foldseek, the downloaded PDB database must be named pdb (lowercase). If named any other way, databases will create broken simlinks in the database (on our cluster they appear in red; ls -l and you will see simlinks). As a result, --cluster-search 1 will not work and will give an error related to the database, although without this parameter easy-search runs without error for me.

I. Prepare query file

Save your input structure as a .pdb (not .cif) - input.pdb. I you have a homo or heteromultimeric protein, consider using each protomer separately (or just one for a homomeric protein).

IIA. Run Foldseek with PDB

  1. Below is one of the expample flag combinations I was using for one of my problems. Play around with the flags for your specific problem.
foldseek easy-search input.pdb /ifs/share/foldseek_database/pdb_db/pdb input.fsk.raw.txt tmp --tmscore-threshold 0.3 --exhaustive-search 1 --cluster-search 1 -c 0.5 --cov-mode 2 --format-output "query,target,fident,mismatch,gapopen,qstart,qend,tstart,tend,evalue,prob,rmsd,lddt,alntmscore,tlen,alnlen,taxid,taxname,theader"

This will create a file input.fsk.raw.txt with pdb structure names and other information I specified with --format-output flags.

Parameters:

  • --tmscore-threshold 0.3 only keeps targets that have TM-score ≥ 0.3 when structurally aligned to query (input.pdb). Note: TM-scores reported by Foldseek are different (mostly lower from what I just browsed through) from US-align (TM-align).
  • --exhaustive-search 1 skips prefilter and performs an all-vs-all alignment (more sensitive but slower). Note
  • --cluster-search 1 reports all matches from a cluster. By default, Foldseek clusters the database and reports only a representative of the clusteras the hit, ommiting others. This results in e.g. reporting 1z15 but not 1z18 altought both their TM-scores ≥ 0.3.
  • -c 0.5 --cov-mode 2 only keeps targets that cover at leat 50% (-c 0.5) of query (--cov-mode 2)
  • --format-output what fields to report (in this order):
    1. query - name of query
    2. target - name of target including chain and format
    3. fident - fraction of sequence identity (identical matches)
    4. mismatch - number of mismatches
    5. gapopen - Number of gap open events (note: this is NOT the number of gap characters)
    6. qstart,qend - alignment start and end in query
    7. tstart,tend - alignment start and end in target
    8. evalue - reported e-value of the match
    9. prob - probability for query and target to be homologous (e.g. being within the same SCOPe superfamily)
    10. rmsd - RMSD
    11. lddt - average LDDT-score of the alignment (more local than TM-score)
    12. alntmscore - TM-score
    13. tlen - aminoacid length of the target hit
    14. alnlen - length of alignment
    15. taxid - universal taxonomic id's (useful when make a phylogenetic tree downstream)
    16. taxname - (lowest) taxon name (e.g. species name)
    17. theader - name of entry from PDB
  1. Check that probabilities of the hits to be in the same SCOP (higher = more likely). Note that these commands filter by column id - if you output different columns or in different sequence you have to change the columd ids (denoted by $):
awk '$11 {print $0}' input.fsk.raw.txt
  1. Check TM-score range:
awk {'print $14'} input.raw.txt | sort -u | sed -n '1p;$p'
  1. You can filter to keep only high-probability hits (I used 0.95 here as a cutoff) and save as input.fsk.txt:
awk '$11 >= 0.95 {print $0}' input.fsk.raw.txt > input.fsk.txt

IIIA. Download .pdb files from PDB

  1. Get .pdb names (save in ids.txt) and also add your input.pdb in the end don't forget - for structural alignment later (can just echo):
#for PDB download needs to be ',' delimited
awk '{print $2}' | sort -u | tr '\n' ',' > pdb_list.txt
#remove last ","
  1. Use the batch script (batch_download.sh) to batch download .pdb files from https://www.rcsb.org/docs/programmatic-access/batch-downloads-with-shell-script (save it in your folder). Make it executable if needed - run chmod +x batch_download.sh:
#.cif files
./batch_download.sh -f pdb_list.txt -c
gunzip *

IVA. Look at structures manually in PyMOL

Check if the structures make sense

IIB. Run Foldseek with ESMAtlas

ESM Atlas contains hundreds of millions of predicted protein models from metagenomics data which is a great resource to build up the number of sequence-diverse structures. Since these are predicted frmo shotgun sequencing, they might contain truncations and also contaminations from unrelated organisms (e.g. host organism if sequencing microbiome from an e.g. an animal). Also keep in mind these models will be predicted as monomers and there is no information on stoichiometry.

#!/bin/bash
#SBATCH --partition=glab
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12 
#SBATCH --mem=24G

#use --tmscore-threshold 0.65
#-a 1 allows to convert to seq but it didn't do anything maybe bec of --format-output
foldseek easy-search input.pdb /ifs/share/foldseek_database/ESMAtlas30_db/ESMAtlas30_db/esm input.esm.raw.txt tmp --tmscore-threshold 0.65 -c 0.5 --cov-mode 2 -a 1 --format-output "query,target,fident,mismatch,gapopen,qstart,qend,tstart,tend,evalue,prob,rmsd,lddt,alntmscore,tlen,alnlen,theader"

The output is saved in input.esm.raw.txt.

Look at probability values (will not report redundant ones):

awk '{print $11}' input.esm.raw.txt | sort | uniq -c

TM-score range:

awk {'print $14'} input.esm.raw.txt | sort -u | sed -n '1p;$p'

IIIB. Download the .pdb files

  1. Get .pdb names (save in ids.txt) and also add "input.pdb" in the end (can just echo):
awk -F '\t' '{print $2}' input.esm.raw.txt| awk -F '.' '{print $1}' > ids.txt
  1. Download the pdb structures: Script esmdownload.sh to dowload .pdb files with ESM API and save them:
#!/bin/bash

#entry names are saved in ids.txt
file="ids.txt"

#Download with ESM API
while IFS= read -r ids || [[ -n "$ids" ]]; do
    wget -O "$ids.pdb" "https://api.esmatlas.com/fetchPredictedStructure/$ids" --no-check-certificate
done < "$file"

to make executable: chmod +x esmdownload.sh. Run script: ./esmdownload.sh. Moved them to folder ./pdbs.

IVB. Look at structures manually in PyMOL

  1. Filter
    Open in PyMOL and go through them. Remove suspicious ones (e.g. truncated structures). Can also remove the structures of different protein families (e.g. share the aligned domain but not the rest).

V. Run US-align

You can run structural alignment with US-align these ways:

  • Each entry in aligned only to the reference (e.g. your initial input strucutre); or
  • Global all-vs-all alignment (will produce one global alignment; takes long for many structures)
  1. Make a list with .pdb names of the downloaded models from ESMAtlas including your input and save as names.txt . An example looks like this:
input.pdb
entry1.pdb
1abc.pdb
1dea_A.pdb

Note: For the purpose of this tutorial, don't make a list with all of them - just make it with 10 as an example. The all-vs-all alignment is time-consuming for many structures and the time is not linear and depends on the input size (medium-size 500 structures can take from 3-10 hours).

  1. Run USAlign All-vs-all.
#!/bin/bash
#SBATCH --partition=glab
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem=40G
/ifs/share/USalign -dir /folder/with/pdbs /folder/with/names.txt -mm 4 -fast > alignment.txt
  1. Reformat:
awk 'NR >= 14' alignment.txt | sed '$d' > alignment.format.txt

You can now visualize the alignment in your favorite desktop alignment software (a good one is Jalview)

⚠️ **GitHub.com Fallback** ⚠️