Folding - polizzilab/software-wiki GitHub Wiki
ColabFold just takes a fasta with all the sequences you want to predict structures for. The structures are predicted separately despite being in the same .fasta:
>seq1
SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE
>seq2
SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE
To not predict the sequences separately, just separate the two chains to predict together with a :
>complex-1
SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE:SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE
import os
import multiprocessing
import subprocess
def submit_colabfold(path_to_fasta, output_dir, device=0):
"""
Runs AF2 in single sequence mode.
Device is the index of the desired device as it appears in nvidia-smi
"""
gpu_map = {"CUDA_DEVICE_ORDER": "PCI_BUS_ID", "CUDA_VISIBLE_DEVICES": str(device)}
subprocess.run(f"/nfs/sbgrid/programs/x86_64-linux/colabfold/1.5.2/bin/colabfold_batch {path_to_fasta} {output_dir} --msa-mode single_sequence --overwrite-existing-results".split(), env=gpu_map)
def worker_submit_colabfold(tup):
fasta_path, output_dir, idx = tup
os.makedirs(output_dir, exist_ok=True)
submit_colabfold(fasta_path, output_dir, idx)
if __name__ == '__main__':
path_to_designs = '...'
all_fasta_paths = []
# Assumes number of design fasta files is less than or equal to the number of GPUs.
for path in os.listdir(path_to_designs):
fasta_path = os.path.join(path_to_designs, path)
if os.path.isfile(fasta_path) and fasta_path.endswith('.fasta'):
all_fasta_paths.append(fasta_path)
# Submits one file chunk per GPU.
with multiprocessing.Pool(len(all_fasta_paths)) as p:
for _ in p.imap(worker_submit_colabfold, [(fasta_path, os.path.join(path_to_designs, f'chunk_{idx}'), idx) for idx, fasta_path in enumerate(all_fasta_paths)]):
passAF3 can only run on GPUs with recent CUDA Capability (for us, this means only npl1 and np-gpu-2). To run AF3, you must create an input directory containing the .json files describing the inputs you wish to fold and define an output directory to which AF3 will dump the results.
Here is the link to the AF3 prediction instructions
Note
The name field must be unique for each input.json file passed to the submit_af3.sh script below.
{
"name": "design_0",
"sequences": [
{ "protein": {
"id": "A",
"sequence": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
"unpairedMsa": "",
"pairedMsa": "",
"templates": []
}},
{ "ligand": {
"id": "B",
"smiles": "COC1=CC=C(C=C1)N2C3=C(CCN(C3=O)C4=CC=C(C=C4)N5CCCCC5=O)C(=N2)C(=O)N"
}}
],
"modelSeeds": [1],
"dialect": "alphafold3",
"version": 1
}{
"name": "homodimer-folding",
"sequences": [
{ "protein": {
"id": ["A", "B"],
"sequence": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
"unpairedMsa": "",
"pairedMsa": "",
"templates": []
}}
],
"modelSeeds": [1],
"dialect": "alphafold3",
"version": 1
}To have AF3 generate an MSA, delete the fields setting the MSA to an empty string. It will use JACKHMMER to do this (really slow) by default. You can pass in a custom MSA by copying the contents of the A3M file into the pairedMSA and unpairedMSA strings. You can see what AF3 generates for the MSA format by running structure prediction with automatic MSA generation and checking the output directory for a .json file which contains your input specification as described here with the full MSA and any template CIF structures injected into the JSON data.
{
"name": "msa-folding",
"sequences": [
{ "protein": {
"id": "A",
"sequence": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
"templates": []
}}
],
"modelSeeds": [1],
"dialect": "alphafold3",
"version": 1
}AF3 only supports templates in CIF format (you can try converting from PDB to CIF from PyMol or by using ProDy, though there may be header fields that are missing that are required for AF3. Try comparing your converted CIF file to an AF3 output and follow error messages until it accepts the template) and templating for single chains. You cannot template a complex without connecting the chains with some kind of linker to make AF3 think it's a single chain.
To use templates with AF3 you have to manually specify the mapping from the resindices of the template to the sequence you're predicting. The queryIndices are the (0-indexed) indices into the sequence you're predicting. The templateIndices are the (0-indexed) indices into the template structure.
{
"name": "template-folding",
"sequences": [
{ "protein": {
"id": "A",
"sequence": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
"unpairedMsa": "",
"pairedMsa": "",
"templates": [
"mmcifPath": "/path/to/cif/file.cif",
"queryIndices": [0, 1, 2, 4, 5, 6],
"templateIndices": [0, 1, 2, 3, 4, 8]
]}}
],
"modelSeeds": [1],
"dialect": "alphafold3",
"version": 1
}You can submit an AF3 job by making a copy of the following script which I call submit_af3.sh.
Note
The number of diffusion samples, recycles, and diffusion steps used during the predictions of all the .json input files is controlled in the last line of the script. See the AF3 GitHub for all of the options and how to change them.
#!/bin/bash
# Ensure the script exits on any command failure
set -e
# Usage message
usage() {
echo "Usage: $0 -i INPUT_DIR -o PROJECT_DIR -d CUDA_VISIBLE_DEVICES"
exit 1
}
# Parse input arguments
while getopts "i:o:d:" opt; do
case "${opt}" in
i)
INPUT_DIR=${OPTARG}
;;
o)
PROJECT_DIR=${OPTARG}
;;
d)
CUDA_VISIBLE_DEVICES=${OPTARG}
;;
*)
usage
;;
esac
done
# Ensure all required arguments are specified
if [ -z "${INPUT_DIR}" ] || [ -z "${PROJECT_DIR}" ] || [ -z "${CUDA_VISIBLE_DEVICES}" ]; then
usage
fi
# Export environment variables
export ALPHAFOLD_X=3.0.1
export XLA_FLAGS="--xla_gpu_enable_triton_gemm=false"
# Memory settings used for folding up to 5,120 tokens on A100 80 GB.
export XLA_PYTHON_CLIENT_PREALLOCATE=true
export XLA_CLIENT_MEM_FRACTION=0.6
export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}
MODEL_DIR=/nfs/polizzi/shared/programs/structpred/alphafold3/weights/
DATABASE_DIR=/programs/local/alphafold/AF3/
# Run the AlphaFold command
time run_alphafold.py --input_dir=${INPUT_DIR} --output_dir=${PROJECT_DIR} --db_dir=${DATABASE_DIR} --model_dir=${MODEL_DIR} --num_recycles=10 --num_diffusion_samples=5 --flash_attention_implementation=xlaBoltz-2 does not work on np-gpu-1.
To run Boltz on npl1 or np-gpu-2, the following executable can be used. The --use_potentials flag fixes (some) stereochemistry and clashing issues in exchange for slightly lower runtimes. Boltz also has a --devices flag which you can give a number --devices 8 to automatically parallelize the predictions over multiple (8) GPUs.
Note
If you wish to generate an MSA, you can ping the ColabFold server by deleting the msa: empty line in the .yaml file and running the predict command below with an additional --use_msa_server flag.
/nfs/polizzi/bfry/miniforge3/envs/boltz2_retry/bin/boltz predict {INPUT_DIRECTORY_OR_YAML} --use_potentials --output_format pdbWarning
Boltz writes a lot of intermediate helper files while performing predictions. If your prediction fails or you stop the inference procedure halfway through and change the input .yaml file or prediction flags, delete any directories generated by Boltz to ensure it doesn't try to reuse those previous predictions/intermediate files.
See the Boltz Prediction Instructions for more info.
Boltz inputs can encoded as either YAML or FASTA files. The latter has more limited functionality but works like this:
>A|protein|empty
SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE
>B|smiles
COC1=CC=C(C=C1)N2C3=C(CCN(C3=O)C4=CC=C(C=C4)N5CCCCC5=O)C(=N2)C(=O)N
version: 1
sequences:
- protein:
id: A
sequence: SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE
msa: empty
- ligand:
id: B
smiles: 'COC1=CC=C(C=C1)N2C3=C(CCN(C3=O)C4=CC=C(C=C4)N5CCCCC5=O)C(=N2)C(=O)N'version: 1
sequences:
- protein:
id: A
sequence: SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE
msa: empty
- ligand:
id: B
smiles: 'COC1=CC=C(C=C1)N2C3=C(CCN(C3=O)C4=CC=C(C=C4)N5CCCCC5=O)C(=N2)C(=O)N'
properties:
- affinity:
binder: Bversion: 1
sequences:
- protein:
id: [A, B]
sequence: SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE
msa: emptychain_id for the template is which ID specified in the YAML to template. There is also a .cif input option, that works the same. Boltz automatically maps the residues between the template and the input sequence unlike AF3. There is a force option for the templates as well:
For any template you provide, you can also specify a force flag which will use a potential to enforce that the backbone does not deviate excessively from the template during the prediction. When using force one must specify also the threshold field which controls the distance (in Angstroms) that the prediction can deviate from the template.
version: 1
sequences:
- protein:
id: A
sequence: SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE
msa: empty
templates:
- pdb: /path/to/pdb/file.pdb
chain_id: A RF3 does not work on np-gpu-1.
I made a wrapper script to submit RF3 jobs at /nfs/polizzi/bfry/test_rf3/submit_run_rf3.py. Like AF3, there are CLI arguments that can change the prediction instructions such as the number of diffusion steps, recycles, etc...:
#!/usr/bin/env python3
"""
Usage: python3 submit_run_rf3.py <input_dir> <output_dir> [device_ordinal]
"""
import sys
import subprocess
from pathlib import Path
def main(input_, output_, device):
input_ = Path(input_)
output_ = Path(output_)
if not output_.exists():
output_.mkdir(parents=True)
subprocess.run(
f'pushd /nfs/polizzi/bfry/programs/modelforge; CUDA_VISIBLE_DEVICES={device} /nfs/polizzi/bfry/miniforge3/envs/rf3/bin/rf3 fold inputs={input_.absolute()} out_dir={output_.absolute()} num_steps=50 early_stopping_plddt_threshold=null diffusion_batch_size=1; popd',
shell=True
)
if __name__ == '__main__':
device = '0'
try:
device = sys.argv[3]
except:
pass
main(sys.argv[1], sys.argv[2], device)Like the AF2/Colabfold interface, you can use one file to encode multiple inputs for RF3. This is a .json formatted list containing sub-dictionaries formatted as below. Like AF3, all inputs must have unique names.
Here is the link to the RF3 input documentation
[
{
"name": "test_1",
"components": [
{
"seq": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
"chain_id": "A"
},
{
"smiles": "COC1=CC=C(C=C1)N2C3=C(CCN(C3=O)C4=CC=C(C=C4)N5CCCCC5=O)C(=N2)C(=O)N",
"chain_id": "B"
}
]
},{
"name": "test_2",
"components": [
{
"seq": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
"chain_id": "A"
},
{
"smiles": "COC1=CC=C(C=C1)N2C3=C(CCN(C3=O)C4=CC=C(C=C4)N5CCCCC5=O)C(=N2)C(=O)N",
"chain_id": "B"
}
]
}
]To use an MSA, you need a path to a .A3M file containing the MSA specified below.
[
{
"name": "test_1",
"components": [
{
"seq": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
"msa_path": "/path/to/msa.a3m",
"chain_id": "A"
},
{
"smiles": "COC1=CC=C(C=C1)N2C3=C(CCN(C3=O)C4=CC=C(C=C4)N5CCCCC5=O)C(=N2)C(=O)N",
"chain_id": "B"
}
]
}
][
{
"name": "test_1",
"components": [
{
"seq": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
"chain_id": ["A", "B"]
},
]
}
]One interesting feature of RF3 is that you can fold a protein around a fixed ligand conformer which might be useful in cases where we have a challenging ligand such as one where stereochemistry is sometimes violated by other structure prediction methods.
[
{
"name": "iter_001_design_chunk_00_seq_0000",
"components": [
{
"seq": "MSPESKKQKVEDLLSAIVKGDTAAIQSLLSPNARGEDLNTGTRLNSAQEIVDDLKSTVDTYSIESEILSVEVEGNEVTVVTLGRVTASDGSVEVLRVEHVFEFNDDGKINSIRYLELPLG",
"chain_id": "A"
},
{
"chain_id": "B",
"path": "/nfs/polizzi/bfry/programs/NISE/design_campaign_exa_ntf2/test-exa.sdf"
}
],
"ground_truth_conformer_selection": [
"B"
],
"template_selection": [
"B"
]
}
]RF3 uses a custom syntax for describing what regions to template. You just pass the template .cif in like the .sdf file example above. See the link here for more information.