Options and Usages - labgem/ASMC GitHub Wiki

Command line options

usage: run_asmc.py [-h] [-o] [-t] [-l] [--end {pocket,modeling,alignment,clustering,logo}] [-r] [-p] [--chain] (-s | -m | -M | -a ) [--id] [-n] [-e] [--min-samples] [--test {0,1}] [-w] [--prefix] [--format] [--resolution] [--units]

Help message:

options:
  -h, --help            show this help message and exit
  -o , --outdir         output directory [default: ./]
  -t , --threads        number of cpu threads [default: 6]
  -l , --log            log file path, if it's not provided the logs are display in the stdout
  --end {pocket,modeling,alignment,clustering,logo}
                        indicates at which step to stop [default: logo]

Reference Structures options:
  -r , --ref            file containing paths to all references
  -p , --pocket         file indicating for each reference, the chain and the active site positions. If no file is provided, P2RANK is run to detect pockets
  --chain               specifies chains for pocket search, separated by ',' only used if --pocket isn't provided [default: all]

Targets options:
  If --seqs is given, homology modeling is performed. If --models is given, homology modeling is not performed and if --active-sites or --msa is given just the clustering is performed

  -s , --seqs           multi fasta file
  -m , --models         file containing paths to all models and for each model, his reference
  -M , --msa            file indicating active site positions for each reference, identity_targets_refs path and the path of a MSA
  -a , --active-sites   active sites alignment in fasta format, can be used to create subgroup
  --id                  percent identity cutoff between target and reference to build a model of the target, only used with -s, --seqs [default: 30.0]
  -n , --nb-models      number of target models generated by MODELLER

Clustering options:
  -e , --eps            maximum distance between two samples for them to be considered neighbours [0,1] [default: auto]
  --min-samples         the number of samples in a neighbourhood for a point to be considered as a core point [default: auto]
  --test {0,1}          0: use the --eps value, 1: test different [default: 0]values
  -w , --weighted-pos   pocket position with more weight for clustering, positions are numbered from 1 to the total number of positions. To give several positions, separate them with commas, e.g: 1,6,12

Sequence logo options:
  --prefix              prefix for logo title before the cluster id [default: G]
  --format              file format for output logos, 'eps' or 'png' [default: 'png']
  --resolution          image resolution (png only), 150, 300 or 600 dpi [default: 300]
  --units               The units used for the y-axis [default: bits]: bits, nats, probability, kT, kJ/mol or kcal/mol

Usages

ASMC aims at highlighting the amino acid diversity that composes the active site of a given homologous protein family. To achieve this, ASMC requires at least one reference protein structure – as we shall see later, more relevant groups are obtained by increasing the number of reference protein structures. Reference structures should be carefully selected under the biological context (open/close state, monomer, ligands, enzymatic results...) and priority should be given to high-resolution holo structures, namely protein-ligand complexes, since active sites are often better characterized.

The ASMC pipeline is designed as a user-friendly automated framework to achieve the modeling and clustering of homologous protein active sites. Otherwise, ASMC can be executed in several ways, depending on the user’s objective:

  • ASMC default
  • ASMC with user-refined pocket(s) - RECOMMENDED
  • Pocket Search
  • Homology Modeling
  • Structural Alignment
  • Clustering
    • MSA Clustering
    • Re-Clustering

Output Summary

ASMC default ASMC with user-refined pocket(s) Pocket Search Homology Modeling Structural Alignment MSA Clustering Re-Clustering
pocket.csv :white_check_mark: :white_check_mark: * *
prank_outputs :white_check_mark: :white_check_mark: * *
models/ :white_check_mark: :white_check_mark: :white_check_mark: *
models.txt :white_check_mark: :white_check_mark: :white_check_mark: *
identity_targets_ref.tsv :white_check_mark: :white_check_mark: :white_check_mark: *
pairwise/ :white_check_mark: :white_check_mark: :white_check_mark:
superposition/ :white_check_mark: :white_check_mark: :white_check_mark:
active_site_alignment.fasta :white_check_mark: :white_check_mark: :white_check_mark:
groups_x_min_y.tsv :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
groups_logo.png :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
fasta file for each group :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:

(*): this output will be created depending on whether the user provides specific inputs. See sections below for more details.

ASMC default

Run ASMC in a blind way (unknown active site) including Pocket Search, Homology Modeling, Structural Alignment and Clustering.

User must provide a reference file relating to protein reference(s) and a set of homologous protein fasta sequences.

python ASMC/run_asmc.py --log run_asmc.log --threads 6 -r reference_file -s sequences.fasta

The algorithm will set the best-ranked P2RANK pocket as the pocket reference to be used for both the structural alignment and clustering steps.

However, we recommend users to carefully define their active site by following the Pocket Search step before running ASMC with the protocole presented hereafter.

ASMC with user-refined pocket(s) - RECOMMENDED

Skip Pocket Search and run Homology Modeling, Structural Alignment and Clustering. It is advisable to manually define the active site positions, based on the literature and/or your own expertise.

User must provide a reference file relating to protein reference(s), a pocket csv file and a set of homologous protein fasta sequences.

python ASMC/run_asmc.py --log run_asmc.log --threads 6 -r reference_file -p pocket.csv -s sequences.fasta

ASMC with specific objectives

The ASMC workflow allows each step to be performed independently, depending on the user's objective:

  • identify protein pockets (Pocket Search)
  • generate 3D models (Homology Modeling)
  • align 3D models on reference structure(s) (Structural Alignment)
  • cluster structure- or MSA-based active sites (Clustering).

Pocket Search

Stop ASMC after the Pocket Search - option --end pocket.

User must provide a reference file relating to protein reference(s) and a set of homologous protein fasta sequences.

python ASMC/run_asmc.py --log run_asmc.log --threads 6 -r reference_file -s sequences.fasta --end pocket

Homology Modeling

Stop ASMC after the Homology Modeling - option --end modeling.

User must provide a reference file relating to protein reference(s), a pocket csv file and a set of homologous protein fasta sequences.

python ASMC/run_asmc.py --log run_asmc.log --threads 6 -r reference_file -p pocket.csv -s sequences.fasta --end modeling 

Structural Alignment

Stop ASMC after the Structural Alignment - option --end alignment.

User must provide a reference file relating to protein reference(s), a pocket csv file and a model file relating to 3D models obtained with MODELLER, AlphaFold or other method (PDB format).

python ASMC/run_asmc.py --log run_asmc.log --threads 6 -r reference_file -p pocket.csv -m models.txt --end alignment

Clustering

Run the Clustering step starting from either a list of models, a Multiple Sequence Alignment (MSA) or an ASMC output group.

List of Models

User want to run ASMC clustering starting from existing 3D models and structural references with known pocket.

User must provide the pocket and the model files with the options -p and -m, respectively.

python ASMC/run_asmc.py --log run_asmc.log --threads 6 -r reference_file -p pocket.csv -m models.txt

MSA Clustering

First, user must calculate the identity percentage between targets and reference(s). This step is optional if there is only one reference and mandatory if there are several.

User must provide a MSA in fasta format (not performed by ASMC) and a specific file called using the --msa option.

python ASMC/run_asmc.py --log run_asmc.log --threads 6 --msa msa.txt

Re-Clustering

Warning: user must provide a directory name with the -o option when using these command lines in order to avoid erasing the previous files. If the directory doesn't exist, run_asmc.py will create it on the fly.

- with different DBSCAN parameters

User must provide the active site alignment file generated by the ASMC pipeline, using the -a option, and values for --eps and --min-samples options (e.g, 0.1 and 15, respectively).

python ASMC/run_asmc.py -o output_directory --log run_asmc.log --threads 6 -a active_sites_alignment.fasta --eps 0.1 --min-samples 15

- with an existing ASMC group

User must provide the active site alignment file generated by the ASMC pipeline for the queried ASMC group, using the -a option (e.g, for the G2 group).

python ASMC/run_asmc.py -o group_split --log run_asmc.log --threads 6 -a G2.fasta

How to deal with ASMC outputs

When the ASMC process is complete, you should get two important output files: groups_logo.png, displaying the sequence logos for each ASMC group in a column, and groups_x_min_y.tsv from which you can obtain some interesting information.

  • search for specific amino acids at a certain active site position (extract_aa.py).
  • remove duplicated active site sequences (stats.py).
  • compare with another file groups_x_min_y.tsv, e.g, between structure- and MSA-based clustering (compare_active_site.py).
  • format as CSV to facilitate manual investigation in a spreadsheet program (groups_tsv_to_csv).
  • visualize with Pymol a specific target superimposed on its reference structure, by following the steps described here.