Tutorial - labgem/ASMC GitHub Wiki

Requirement

Installation and Configuration steps must be completed before this part.

Preface

This tutorial describes how to run ASMC on a family of homologous proteins, named Amine Dehydrogenases (AmDHs), when the active site residues are known (cf. ASMC with user-refined pocket).

A directory named tutorial/ is available at ASMC/docs/ and contains the following input files:

ADH4.pdb : PDB ID 6G1M, chain B.
DH35.pdb : PDB ID 6IAU, chain B.
DHP6.pdb : PDB ID 6IAQ, chain A.
MATA.pdb : PDB ID 7ZBO, chain A.
pocket.csv : list of amino acid residues considered as part of the active site.
sequences.fasta : a set of 954 protein sequences in FASTA format (950 AmDHs + 4 reference AmDHs).

The last file required is reference_file which must be written as follows, replacing <path_to_ASMC> with the path to where the ASMC repository was downloaded:

<path_to_ASMC>/ASMC/docs/tutorial/ADH4.pdb
<path_to_ASMC>/ASMC/docs/tutorial/DH35.pdb
<path_to_ASMC>/ASMC/docs/tutorial/DHP6.pdb
<path_to_ASMC>/ASMC/docs/tutorial/MATA.pdb

NB: if the active site is unknown, please consider this section.

Usage

Change the working directory for <path_to_ASMC>/ASMC/docs/tutorial/ and run ASMC with reference_file, pocket.csv and sequences.fasta called with the -r, -p and -s options, respectively.

cd <path_to_ASMC>/ASMC/docs/tutorial/
python ASMC/run_asmc.py --log run_asmc.log --threads 6 -r reference_file -p pocket.csv -s sequences.fasta

The whole process can be verified by checking the file run_asmc.log.

Once completed, the following output files are available:

models.txt : list of models generated by MODELLER.
identity_targets_refs.tsv : identity percentage between each protein sequence and its reference.
active_site_alignment.fasta : active site sequences for each protein, in FASTA format.
groups_0.12_min_5.tsv : clustering computed by DBSCAN, here with eps=0.12 and min_samples=5 - automatically computed by DBSCAN if not provided by users.
GX.fasta : FASTA file for each DBSCAN group, here with X = [-1:3].
groups_logo.png : sequence logos for all DBSCAN groups (Figure 1), with the number of sequences per group indicated in the bottom right-hand corner.
models/ : directory including all the 3D models listed in models.txt.
pairwise/ : directory including all the structural pairwise alignments, computed by US-align, in FASTA format.
superposition/ : directory including all the PDB models with 3D coordinates aligned on its reference.

Proteins belonging to group -1 (G-1.fasta) must be considered as "outliers" since DBSCAN was unable to group them in a cluster >= 5 members. This do not mean these proteins are not interesting and users are advised to consider them.

If a group is wide enough, users can try to generate sub-clusters using the Re-Clustering procedure, by playing with the --eps parameter.

Several python scripts were designed to further analyze ASMC clusters, more details in the section How to deal with ASMC outputs.

FIGURE 1 TO ADD (groups_logo.png)