AlphaFold2

ColabFold input Format

ColabFold just takes a fasta with all the sequences you want to predict structures for. The structures are predicted separately despite being in the same .fasta:

>seq1
SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE
>seq2
SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE

To not predict the sequences separately, just separate the two chains to predict together with a :

>complex-1
SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE:SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE

Running ColabFold w/o MSAs

import os
import multiprocessing
import subprocess

def submit_colabfold(path_to_fasta, output_dir, device=0):
    """
    Runs AF2 in single sequence mode.
    Device is the index of the desired device as it appears in nvidia-smi
    """
    gpu_map = {"CUDA_DEVICE_ORDER": "PCI_BUS_ID", "CUDA_VISIBLE_DEVICES": str(device)}
    subprocess.run(f"/nfs/sbgrid/programs/x86_64-linux/colabfold/1.5.2/bin/colabfold_batch {path_to_fasta} {output_dir} --msa-mode single_sequence --overwrite-existing-results".split(), env=gpu_map)

def worker_submit_colabfold(tup):
    fasta_path, output_dir, idx = tup
    os.makedirs(output_dir, exist_ok=True)
    submit_colabfold(fasta_path, output_dir, idx)

if __name__ == '__main__':
    path_to_designs = '...'
    all_fasta_paths = []

    # Assumes number of design fasta files is less than or equal to the number of GPUs.
    for path in os.listdir(path_to_designs):
        fasta_path = os.path.join(path_to_designs, path)
        if os.path.isfile(fasta_path) and fasta_path.endswith('.fasta'):
            all_fasta_paths.append(fasta_path)

    # Submits one file chunk per GPU.
    with multiprocessing.Pool(len(all_fasta_paths)) as p:
        for _ in p.imap(worker_submit_colabfold, [(fasta_path, os.path.join(path_to_designs, f'chunk_{idx}'), idx) for idx, fasta_path in enumerate(all_fasta_paths)]):
            pass

AlphaFold3

AF3 can only run on GPUs with recent CUDA Capability (for us, this means only npl1 and np-gpu-2). To run AF3, you must create an input directory containing the .json files describing the inputs you wish to fold and define an output directory to which AF3 will dump the results.

Here is the link to the AF3 prediction instructions

AF3 Input Format

Note

The name field must be unique for each input.json file passed to the submit_af3.sh script below.

Protein Ligand Example (No MSA)

{
	"name": "design_0",
	"sequences": [
		{ "protein": {
			"id": "A",
			"sequence": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
			"unpairedMsa": "",
			"pairedMsa": "",
			"templates": []
		}}, 
		{ "ligand": {
			"id": "B", 
			"smiles": "COC1=CC=C(C=C1)N2C3=C(CCN(C3=O)C4=CC=C(C=C4)N5CCCCC5=O)C(=N2)C(=O)N"
		}}
	],
	"modelSeeds": [1],
	"dialect": "alphafold3",
	"version": 1
}

Fold Two Copies of the Same Sequence (No MSA)

{
	"name": "homodimer-folding",
	"sequences": [
		{ "protein": {
			"id": ["A", "B"],
			"sequence": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
			"unpairedMsa": "",
			"pairedMsa": "",
			"templates": []
		}}
	],
	"modelSeeds": [1],
	"dialect": "alphafold3",
	"version": 1
}

Predict a Structure Using an MSA

To have AF3 generate an MSA, delete the fields setting the MSA to an empty string. It will use JACKHMMER to do this (really slow) by default. You can pass in a custom MSA by copying the contents of the A3M file into the pairedMSA and unpairedMSA strings. You can see what AF3 generates for the MSA format by running structure prediction with automatic MSA generation and checking the output directory for a .json file which contains your input specification as described here with the full MSA and any template CIF structures injected into the JSON data.

{
	"name": "msa-folding",
	"sequences": [
		{ "protein": {
			"id": "A",
			"sequence": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
			"templates": []
		}}
	],
	"modelSeeds": [1],
	"dialect": "alphafold3",
	"version": 1
}

Predict a Structure Using no MSAs with Templates

AF3 only supports templates in CIF format (you can try converting from PDB to CIF from PyMol or by using ProDy, though there may be header fields that are missing that are required for AF3. Try comparing your converted CIF file to an AF3 output and follow error messages until it accepts the template) and templating for single chains. You cannot template a complex without connecting the chains with some kind of linker to make AF3 think it's a single chain.

To use templates with AF3 you have to manually specify the mapping from the resindices of the template to the sequence you're predicting. The queryIndices are the (0-indexed) indices into the sequence you're predicting. The templateIndices are the (0-indexed) indices into the template structure.

{
	"name": "template-folding",
	"sequences": [
		{ "protein": {
			"id": "A",
			"sequence": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
			"unpairedMsa": "",
			"pairedMsa": "",
			"templates": [
                            "mmcifPath": "/path/to/cif/file.cif",
                            "queryIndices": [0, 1, 2, 4, 5, 6],
                            "templateIndices": [0, 1, 2, 3, 4, 8]
            ]}}
	],
	"modelSeeds": [1],
	"dialect": "alphafold3",
	"version": 1
}

Submitting AF3 Jobs

You can submit an AF3 job by making a copy of the following script which I call submit_af3.sh.

Note

The number of diffusion samples, recycles, and diffusion steps used during the predictions of all the .json input files is controlled in the last line of the script. See the AF3 GitHub for all of the options and how to change them.

#!/bin/bash

# Ensure the script exits on any command failure
set -e

# Usage message
usage() {
    echo "Usage: $0 -i INPUT_DIR -o PROJECT_DIR -d CUDA_VISIBLE_DEVICES"
    exit 1
}

# Parse input arguments
while getopts "i:o:d:" opt; do
    case "${opt}" in
        i)
            INPUT_DIR=${OPTARG}
            ;;
        o)
            PROJECT_DIR=${OPTARG}
            ;;
        d)
            CUDA_VISIBLE_DEVICES=${OPTARG}
            ;;
        *)
            usage
            ;;
    esac
done

# Ensure all required arguments are specified
if [ -z "${INPUT_DIR}" ] || [ -z "${PROJECT_DIR}" ] || [ -z "${CUDA_VISIBLE_DEVICES}" ]; then
    usage
fi

# Export environment variables
export ALPHAFOLD_X=3.0.1
export XLA_FLAGS="--xla_gpu_enable_triton_gemm=false"
# Memory settings used for folding up to 5,120 tokens on A100 80 GB.
export XLA_PYTHON_CLIENT_PREALLOCATE=true
export XLA_CLIENT_MEM_FRACTION=0.6
export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}

MODEL_DIR=/nfs/polizzi/shared/programs/structpred/alphafold3/weights/
DATABASE_DIR=/programs/local/alphafold/AF3/

# Run the AlphaFold command
time run_alphafold.py --input_dir=${INPUT_DIR} --output_dir=${PROJECT_DIR} --db_dir=${DATABASE_DIR} --model_dir=${MODEL_DIR} --num_recycles=10 --num_diffusion_samples=5 --flash_attention_implementation=xla

Boltz-2x

Boltz-2 does not work on np-gpu-1.

Running Boltz Structure Prediction

To run Boltz on npl1 or np-gpu-2, the following executable can be used. The --use_potentials flag fixes (some) stereochemistry and clashing issues in exchange for slightly lower runtimes. Boltz also has a --devices flag which you can give a number --devices 8 to automatically parallelize the predictions over multiple (8) GPUs.

Note

If you wish to generate an MSA, you can ping the ColabFold server by deleting the msa: empty line in the .yaml file and running the predict command below with an additional --use_msa_server flag.

/nfs/polizzi/bfry/miniforge3/envs/boltz2_retry/bin/boltz predict {INPUT_DIRECTORY_OR_YAML} --use_potentials --output_format pdb

Warning

Boltz writes a lot of intermediate helper files while performing predictions. If your prediction fails or you stop the inference procedure halfway through and change the input .yaml file or prediction flags, delete any directories generated by Boltz to ensure it doesn't try to reuse those previous predictions/intermediate files.

Boltz Input Formats

See the Boltz Prediction Instructions for more info.

Boltz inputs can encoded as either YAML or FASTA files. The latter has more limited functionality but works like this:

FASTA: Protein-Ligand Input

>A|protein|empty
SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE
>B|smiles
COC1=CC=C(C=C1)N2C3=C(CCN(C3=O)C4=CC=C(C=C4)N5CCCCC5=O)C(=N2)C(=O)N

YAML: Basic Protein-Ligand Prediction w/o MSAs

version: 1
sequences:
  - protein:
      id: A
      sequence: SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE
      msa: empty
  - ligand:
      id: B
      smiles: 'COC1=CC=C(C=C1)N2C3=C(CCN(C3=O)C4=CC=C(C=C4)N5CCCCC5=O)C(=N2)C(=O)N'

YAML: Basic Protein-Ligand Prediction w/o MSAs with Affinity Prediction

version: 1
sequences:
  - protein:
      id: A
      sequence: SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE
      msa: empty
  - ligand:
      id: B
      smiles: 'COC1=CC=C(C=C1)N2C3=C(CCN(C3=O)C4=CC=C(C=C4)N5CCCCC5=O)C(=N2)C(=O)N'
properties:
  - affinity:
      binder: B

YAML: Homodimer Prediction w/o MSAs

version: 1
sequences:
  - protein:
      id: [A, B]
      sequence: SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE
      msa: empty

YAML: Prediction with Templates

chain_id for the template is which ID specified in the YAML to template. There is also a .cif input option, that works the same. Boltz automatically maps the residues between the template and the input sequence unlike AF3. There is a force option for the templates as well:

For any template you provide, you can also specify a force flag which will use a potential to enforce that the backbone does not deviate excessively from the template during the prediction. When using force one must specify also the threshold field which controls the distance (in Angstroms) that the prediction can deviate from the template.

version: 1
sequences:
  - protein:
      id: A
      sequence: SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE
      msa: empty
templates:
  - pdb: /path/to/pdb/file.pdb
    chain_id: A

RoseTTAFold-3

RF3 does not work on np-gpu-1.

Running RF3

I made a wrapper script to submit RF3 jobs at /nfs/polizzi/bfry/test_rf3/submit_run_rf3.py. Like AF3, there are CLI arguments that can change the prediction instructions such as the number of diffusion steps, recycles, etc...:

#!/usr/bin/env python3

"""
Usage: python3 submit_run_rf3.py <input_dir> <output_dir> [device_ordinal]
"""

import sys
import subprocess
from pathlib import Path

def main(input_, output_, device):

    input_ = Path(input_)
    output_ = Path(output_)

    if not output_.exists():
        output_.mkdir(parents=True)

    subprocess.run(
        f'pushd /nfs/polizzi/bfry/programs/modelforge; CUDA_VISIBLE_DEVICES={device} /nfs/polizzi/bfry/miniforge3/envs/rf3/bin/rf3 fold inputs={input_.absolute()} out_dir={output_.absolute()} num_steps=50 early_stopping_plddt_threshold=null diffusion_batch_size=1; popd',
        shell=True
    )

if __name__ == '__main__':
    device = '0'
    try:
        device = sys.argv[3]
    except:
        pass

    main(sys.argv[1], sys.argv[2], device)

RF3 Inputs

Like the AF2/Colabfold interface, you can use one file to encode multiple inputs for RF3. This is a .json formatted list containing sub-dictionaries formatted as below. Like AF3, all inputs must have unique names.

Here is the link to the RF3 input documentation

Protein-Ligand Predictions, w/o MSAs.

[
    {
        "name": "test_1",
        "components": [
            {
                "seq": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
                "chain_id": "A"
            },
            {
                "smiles": "COC1=CC=C(C=C1)N2C3=C(CCN(C3=O)C4=CC=C(C=C4)N5CCCCC5=O)C(=N2)C(=O)N",
                "chain_id": "B"
            }
        ]
    },{
        "name": "test_2",
        "components": [
            {
                "seq": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
                "chain_id": "A"
            },
            {
                "smiles": "COC1=CC=C(C=C1)N2C3=C(CCN(C3=O)C4=CC=C(C=C4)N5CCCCC5=O)C(=N2)C(=O)N",
                "chain_id": "B"
            }
        ]
    }
]

Protein-Ligand Predictions, w/ MSAs.

To use an MSA, you need a path to a .A3M file containing the MSA specified below.

[
    {
        "name": "test_1",
        "components": [
            {
                "seq": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
                "msa_path": "/path/to/msa.a3m",
                "chain_id": "A"
            },
            {
                "smiles": "COC1=CC=C(C=C1)N2C3=C(CCN(C3=O)C4=CC=C(C=C4)N5CCCCC5=O)C(=N2)C(=O)N",
                "chain_id": "B"
            }
        ]
    }
]

Homodimer Predictions, w/o MSAs.

[
    {
        "name": "test_1",
        "components": [
            {
                "seq": "SEQWENCESEQWENCESEQWENCESEQWENCESEQWENCE",
                "chain_id": ["A", "B"]
            },
        ]
    }
]

Ligand Conformer Fixed Inference

One interesting feature of RF3 is that you can fold a protein around a fixed ligand conformer which might be useful in cases where we have a challenging ligand such as one where stereochemistry is sometimes violated by other structure prediction methods.

[
    {
        "name": "iter_001_design_chunk_00_seq_0000",
        "components": [
            {
                "seq": "MSPESKKQKVEDLLSAIVKGDTAAIQSLLSPNARGEDLNTGTRLNSAQEIVDDLKSTVDTYSIESEILSVEVEGNEVTVVTLGRVTASDGSVEVLRVEHVFEFNDDGKINSIRYLELPLG",
                "chain_id": "A"
            },
            {
                "chain_id": "B",
                "path": "/nfs/polizzi/bfry/programs/NISE/design_campaign_exa_ntf2/test-exa.sdf"
            }
        ],
        "ground_truth_conformer_selection": [
            "B"
        ],
        "template_selection": [
            "B"
        ]
    }
]

Regular protein structure templating

RF3 uses a custom syntax for describing what regions to template. You just pass the template .cif in like the .sdf file example above. See the link here for more information.

Folding - polizzilab/software-wiki GitHub Wiki

AlphaFold2

ColabFold input Format

Running ColabFold w/o MSAs

AlphaFold3

AF3 Input Format

Protein Ligand Example (No MSA)

Fold Two Copies of the Same Sequence (No MSA)

Predict a Structure Using an MSA

Predict a Structure Using no MSAs with Templates

Submitting AF3 Jobs

Boltz-2x

Running Boltz Structure Prediction

Boltz Input Formats

FASTA: Protein-Ligand Input

YAML: Basic Protein-Ligand Prediction w/o MSAs

YAML: Basic Protein-Ligand Prediction w/o MSAs with Affinity Prediction

YAML: Homodimer Prediction w/o MSAs

YAML: Prediction with Templates

RoseTTAFold-3

Running RF3

RF3 Inputs

Protein-Ligand Predictions, w/o MSAs.

Protein-Ligand Predictions, w/ MSAs.

Homodimer Predictions, w/o MSAs.

Ligand Conformer Fixed Inference

Regular protein structure templating

⚠️ GitHub.com Fallback ⚠️

Folding - polizzilab/software-wiki GitHub Wiki

AlphaFold2

ColabFold input Format

Running ColabFold w/o MSAs

AlphaFold3

AF3 Input Format

Protein Ligand Example (No MSA)

Fold Two Copies of the Same Sequence (No MSA)

Predict a Structure Using an MSA

Predict a Structure Using no MSAs with Templates

Submitting AF3 Jobs

Boltz-2x

Running Boltz Structure Prediction

Boltz Input Formats

FASTA: Protein-Ligand Input

YAML: Basic Protein-Ligand Prediction w/o MSAs

YAML: Basic Protein-Ligand Prediction w/o MSAs with Affinity Prediction

YAML: Homodimer Prediction w/o MSAs

YAML: Prediction with Templates

RoseTTAFold-3

Running RF3

RF3 Inputs

Protein-Ligand Predictions, w/o MSAs.

Protein-Ligand Predictions, w/ MSAs.

Homodimer Predictions, w/o MSAs.

Ligand Conformer Fixed Inference

Regular protein structure templating

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️