Options and Usages - labgem/ASMC GitHub Wiki
Command line options
usage: run_asmc.py [-h] [-o] [-t] [-l] [--end {pocket,modeling,alignment,clustering,logo}] [-r] [-p] [--chain] (-s | -m | -M | -a ) [--id] [-n] [-e] [--min-samples] [--test {0,1}] [-w] [--prefix] [--format] [--resolution] [--units]
Help message:
options:
-h, --help show this help message and exit
-o , --outdir output directory [default: ./]
-t , --threads number of cpu threads [default: 6]
-l , --log log file path, if it's not provided the logs are display in the stdout
--end {pocket,modeling,alignment,clustering,logo}
indicates at which step to stop [default: logo]
Reference Structures options:
-r , --ref file containing paths to all references
-p , --pocket file indicating for each reference, the chain and the active site positions. If no file is provided, P2RANK is run to detect pockets
--chain specifies chains for pocket search, separated by ',' only used if --pocket isn't provided [default: all]
Targets options:
If --seqs is given, homology modeling is performed. If --models is given, homology modeling is not performed and if --active-sites or --msa is given just the clustering is performed
-s , --seqs multi fasta file
-m , --models file containing paths to all models and for each model, his reference
-M , --msa file indicating active site positions for each reference, identity_targets_refs path and the path of a MSA
-a , --active-sites active sites alignment in fasta format, can be used to create subgroup
--id percent identity cutoff between target and reference to build a model of the target, only used with -s, --seqs [default: 30.0]
-n , --nb-models number of target models generated by MODELLER
Clustering options:
-e , --eps maximum distance between two samples for them to be considered neighbours [0,1] [default: auto]
--min-samples the number of samples in a neighbourhood for a point to be considered as a core point [default: auto]
--test {0,1} 0: use the --eps value, 1: test different [default: 0]values
-w , --weighted-pos pocket position with more weight for clustering, positions are numbered from 1 to the total number of positions. To give several positions, separate them with commas, e.g: 1,6,12
Sequence logo options:
--prefix prefix for logo title before the cluster id [default: G]
--format file format for output logos, 'eps' or 'png' [default: 'png']
--resolution image resolution (png only), 150, 300 or 600 dpi [default: 300]
--units The units used for the y-axis [default: bits]: bits, nats, probability, kT, kJ/mol or kcal/mol
Usages
ASMC aims at highlighting the amino acid diversity that composes the active site of a given homologous protein family. To achieve this, ASMC requires at least one reference protein structure – as we shall see later, more relevant groups are obtained by increasing the number of reference protein structures. Reference structures should be carefully selected under the biological context (open/close state, monomer, ligands, enzymatic results...) and priority should be given to high-resolution holo structures, namely protein-ligand complexes, since active sites are often better characterized.
The ASMC pipeline is designed as a user-friendly automated framework to achieve the modeling and clustering of homologous protein active sites. Otherwise, ASMC can be executed in several ways, depending on the user’s objective:
- ASMC default
- ASMC with user-refined pocket(s) - RECOMMENDED
- Pocket Search
- Homology Modeling
- Structural Alignment
- Clustering
- MSA Clustering
- Re-Clustering
Output Summary
ASMC default | ASMC with user-refined pocket(s) | Pocket Search | Homology Modeling | Structural Alignment | MSA Clustering | Re-Clustering | |
---|---|---|---|---|---|---|---|
pocket.csv | :white_check_mark: | :white_check_mark: | * | * | |||
prank_outputs | :white_check_mark: | :white_check_mark: | * | * | |||
models/ | :white_check_mark: | :white_check_mark: | :white_check_mark: | * | |||
models.txt | :white_check_mark: | :white_check_mark: | :white_check_mark: | * | |||
identity_targets_ref.tsv | :white_check_mark: | :white_check_mark: | :white_check_mark: | * | |||
pairwise/ | :white_check_mark: | :white_check_mark: | :white_check_mark: | ||||
superposition/ | :white_check_mark: | :white_check_mark: | :white_check_mark: | ||||
active_site_alignment.fasta | :white_check_mark: | :white_check_mark: | :white_check_mark: | ||||
groups_x_min_y.tsv | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | |||
groups_logo.png | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | |||
fasta file for each group | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
(*): this output will be created depending on whether the user provides specific inputs. See sections below for more details.
ASMC default
Run ASMC in a blind way (unknown active site) including Pocket Search, Homology Modeling, Structural Alignment and Clustering.
User must provide a reference file relating to protein reference(s) and a set of homologous protein fasta sequences.
python ASMC/run_asmc.py --log run_asmc.log --threads 6 -r reference_file -s sequences.fasta
The algorithm will set the best-ranked P2RANK pocket as the pocket reference to be used for both the structural alignment and clustering steps.
However, we recommend users to carefully define their active site by following the Pocket Search step before running ASMC with the protocole presented hereafter.
ASMC with user-refined pocket(s) - RECOMMENDED
Skip Pocket Search and run Homology Modeling, Structural Alignment and Clustering. It is advisable to manually define the active site positions, based on the literature and/or your own expertise.
User must provide a reference file relating to protein reference(s), a pocket csv file and a set of homologous protein fasta sequences.
python ASMC/run_asmc.py --log run_asmc.log --threads 6 -r reference_file -p pocket.csv -s sequences.fasta
ASMC with specific objectives
The ASMC workflow allows each step to be performed independently, depending on the user's objective:
- identify protein pockets (Pocket Search)
- generate 3D models (Homology Modeling)
- align 3D models on reference structure(s) (Structural Alignment)
- cluster structure- or MSA-based active sites (Clustering).
Pocket Search
Stop ASMC after the Pocket Search - option --end pocket
.
User must provide a reference file relating to protein reference(s) and a set of homologous protein fasta sequences.
python ASMC/run_asmc.py --log run_asmc.log --threads 6 -r reference_file -s sequences.fasta --end pocket
Homology Modeling
Stop ASMC after the Homology Modeling - option --end modeling
.
User must provide a reference file relating to protein reference(s), a pocket csv file and a set of homologous protein fasta sequences.
python ASMC/run_asmc.py --log run_asmc.log --threads 6 -r reference_file -p pocket.csv -s sequences.fasta --end modeling
Structural Alignment
Stop ASMC after the Structural Alignment - option --end alignment
.
User must provide a reference file relating to protein reference(s), a pocket csv file and a model file relating to 3D models obtained with MODELLER, AlphaFold or other method (PDB format).
python ASMC/run_asmc.py --log run_asmc.log --threads 6 -r reference_file -p pocket.csv -m models.txt --end alignment
Clustering
Run the Clustering step starting from either a list of models, a Multiple Sequence Alignment (MSA) or an ASMC output group.
List of Models
User want to run ASMC clustering starting from existing 3D models and structural references with known pocket.
User must provide the pocket and the model files with the options -p
and -m
, respectively.
python ASMC/run_asmc.py --log run_asmc.log --threads 6 -r reference_file -p pocket.csv -m models.txt
MSA Clustering
First, user must calculate the identity percentage between targets and reference(s). This step is optional if there is only one reference and mandatory if there are several.
User must provide a MSA in fasta format (not performed by ASMC) and a specific file called using the --msa
option.
python ASMC/run_asmc.py --log run_asmc.log --threads 6 --msa msa.txt
Re-Clustering
Warning: user must provide a directory name with the -o
option when using these command lines in order to avoid erasing the previous files. If the directory doesn't exist, run_asmc.py
will create it on the fly.
- with different DBSCAN parameters
User must provide the active site alignment file generated by the ASMC pipeline, using the -a
option, and values for --eps
and --min-samples
options (e.g, 0.1 and 15, respectively).
python ASMC/run_asmc.py -o output_directory --log run_asmc.log --threads 6 -a active_sites_alignment.fasta --eps 0.1 --min-samples 15
- with an existing ASMC group
User must provide the active site alignment file generated by the ASMC pipeline for the queried ASMC group, using the -a
option (e.g, for the G2 group).
python ASMC/run_asmc.py -o group_split --log run_asmc.log --threads 6 -a G2.fasta
How to deal with ASMC outputs
When the ASMC process is complete, you should get two important output files: groups_logo.png
, displaying the sequence logos for each ASMC group in a column, and groups_x_min_y.tsv
from which you can obtain some interesting information.
- search for specific amino acids at a certain active site position (
extract_aa.py
). - remove duplicated active site sequences (
stats.py
). - compare with another file
groups_x_min_y.tsv
, e.g, between structure- and MSA-based clustering (compare_active_site.py
). - format as CSV to facilitate manual investigation in a spreadsheet program (
groups_tsv_to_csv
). - visualize with Pymol a specific target superimposed on its reference structure, by following the steps described here.