Learn Accurate Models with Persistent Contrastive Divergence - soedinglab/CCMgen GitHub Wiki
Inferring Couplings with CCMpredPy and Persistent Contrastive Divergence
We can learn the coupling parameters of the Markov Random Field model using a more precise inference method that needs slightly longer run time than pseudo-likelihood maximization. However, it is recommended to use evolutionary couplings inferred with persistent contrastive divergence with CCMgen to generate new, similar protein sequences:
ccmpred data/1atzA.fas --ofn-cd --persistent \
-b data/1atzA.braw.gz
The output generated by CCMgenPy will look as follows:
βββΈβββΈββ³ββββββββββΈβΊβ³βββββ» β» version 1.0.0
β β ββββ£βββ£β³ββ£βΈ βββ£ββββ³β Vorberg, Seemayer and Soeding (2018)
βββΈβββΈβΉ βΉβΉ βΉββΈβββΈβΊβ»ββΉ βΉ https://github.com/soedinglab/ccmgen
Using 1 threads for OMP parallelization.
1atzA is of length L=75 and there are 3068 sequences in the alignment.
Alignment has diversity [sqrt(N)/L]=0.739 and Neff(HHsuite-like)=5.492.
Number of effective sequences after simple reweighting (id-threshold=0.8, ignore_gaps=False): 1149.44.
Calculating AA Frequencies with 0.00087 percent pseudocounts (uniform_pseudocounts 1 1)
Will optimize 2482125 float64 variables wrt persistent contrastive divergence:
nr of sampled sequences=500 (0.163xN and 0.435xNeff and 6.667xL) Gibbs steps=1
and Lβ regularization (Ξ»single=10 Ξ»pairfactor=0.2 Ξ»pair=14.8)
Optimizer: Gradient descent optimization (alpha0=0.001)
convergence criteria: maxit=2000 early_stopping=True epsilon=1e-05 prev=5
decay: decay_type=sig decay_rate=5e-06 decay_start=0.1
[ removed the per-iteration statistics for brevity ]
Finished with code 0 -- Stopping condition (xnorm diff < 1e-05) successfull.
Writing msgpack-formatted potentials to data/1atzA.braw.gz
We now have one additional file in the data/
directory: 1atzA.braw.gz
, the raw coupling potentials stored in binary MessagePack format for use with CCMgen.
Visualize Contact Map from Raw Couplings
A contact map can also be generated directly from the binary file of raw couplings with the --braw-file
flag like this:
ccm_plot cmap --braw-file data/1atzA.braw.gz \
--alignment-file data/1atzA.fas \
--pdb-file data/1atzA.pdb \
--plot-file data/1atzA.pcd.apc.html \
--seq-sep 4 --contact-threshold 8 \
--apc
This command will generate the following contact map:
Inferring MRF Model with Reduced Number of Constraints
If one wishes to generate diverse alignments with CCMgen, it is helpful to reduce the number of constraints in the MRF model that would otherwise trap Gibbs sampler in local optima leading to alignments of low diversity. For that purpose, it is possible to specify a reference structure with a PDB file and a distance threshold, so that residue pairs separated by C_beta distances larger than the threshold will obtain zero-couplings.
ccmpred data/1atzA.fas --ofn-cd --persistent \
-b data/1atzA.constrained.braw.gz \
--pdb-file data/1atzA.pdb --contact-threshold 12
Working with Raw Coupling Potential Files
It is possible to initialise potentials in CCMpredPy from a binary MessagePack file by using the flag --init-from-raw
, which might be of interest in order to restart optimization given a predefined model or in combination with the flag --do-not-optimize
to simply compute contact matrix files from the raw coupling potentials.
Furthermore, to work with the raw coupling files in your Python code you can use raw
module in the ccmpred package as follows:
# load the raw module
>>> import ccmpred.raw
#read raw coupling potentials stored in binary MessagePack format
>>> raw = ccmpred.raw.parse_msgpack("data/1atzA.braw.gz")
#the length of the protein (number columns in alignment)
>>> raw.ncol
75
#the dimensions of the single potentials
>>> raw.x_single.shape
(75, 20)
#the dimensions of the pairwise potentials
>>> raw.x_pair.shape
(75, 75, 21, 21)
#meta information about the CCMpredPy run that generated the model
>>> raw.meta
{'version': '1.0.0', 'method': 'ccmpred-py', 'workflow': [{'timestamp': '2018-06-04 14:39:35.902889',
'msafile': {'neff': 1149.4357714155785, 'nrow': 3068, 'ncol': 75, 'file': 'data/1atzA.fas', 'max_gap_pos': 100, 'max_gap_seq': 100}, 'pseudocounts': {'pseudocount_type': 'uniform_pseudocounts', 'pseudocount_n_single': 1, 'pseudocount_n_pair': 1, 'remove_gaps': False, 'pseudocount_ratio_single': 0.0008692358364079104, 'pseudocount_ratio_pair': 0.0008692358364079104}, 'weighting': {'type': 'simple', 'ignore_gaps': False, 'cutoff': 0.8}, 'results': {'opt_code': 2, 'opt_message': 'Reached maximum number of iterations', 'num_iterations': 50, 'runtime': 0.252776, 'fx_final': -1, 'out_binary_raw_file': 'data/1atzA.braw.gz'}, 'initialisation': {'single_potential_init': None, 'pair_potential_init': 'zero'}, 'regularization': {'regularization_type': None, 'regularization_scaling': None, 'single_prior': None, 'lambda_single': 10, 'lambda_pair': 0.2, 'lambda_pair_factor': 0.2}, 'optimization': {'method': 'gradientDescent', 'convergence': {'maxit': 50, 'early_stopping': False, 'epsilon': 1e-05, 'convergence_prev': 5}, 'decay': {'alpha0': 0.001, 'decay': False, 'decay_start': 0.0001, 'decay_rate': 10.0, 'decay_type': 'step'}, 'fix_v': False}, 'obj_function': {'name': 'ContrastiveDivergence', 'gibbs_steps': 1, 'nr_seq_sample': 500}}]}