Learn Accurate Models with Persistent Contrastive Divergence - soedinglab/CCMgen GitHub Wiki

Inferring Couplings with CCMpredPy and Persistent Contrastive Divergence

We can learn the coupling parameters of the Markov Random Field model using a more precise inference method that needs slightly longer run time than pseudo-likelihood maximization. However, it is recommended to use evolutionary couplings inferred with persistent contrastive divergence with CCMgen to generate new, similar protein sequences:

ccmpred data/1atzA.fas --ofn-cd --persistent \
	-b data/1atzA.braw.gz

The output generated by CCMgenPy will look as follows:

  ┏━╸┏━╸┏┳┓┏━┓┏━┓┏━╸╺┳┓┏━┓╻ ╻  version 1.0.0
  ┃  ┃  ┃┃┃┣━┛┣┳┛┣╸  ┃┃┣━┛┗┳┛  Vorberg, Seemayer and Soeding (2018)
  ┗━╸┗━╸╹ ╹╹  ╹┗╸┗━╸╺┻┛╹   ╹   https://github.com/soedinglab/ccmgen

Using 1 threads for OMP parallelization.
1atzA is of length L=75 and there are 3068 sequences in the alignment.
Alignment has diversity [sqrt(N)/L]=0.739 and Neff(HHsuite-like)=5.492.
Number of effective sequences after simple reweighting (id-threshold=0.8, ignore_gaps=False): 1149.44.
Calculating AA Frequencies with 0.00087 percent pseudocounts (uniform_pseudocounts 1 1)

Will optimize 2482125 float64 variables wrt persistent contrastive divergence: 
nr of sampled sequences=500 (0.163xN and 0.435xNeff and 6.667xL) Gibbs steps=1  
and L₂ regularization (λsingle=10 λpairfactor=0.2 λpair=14.8)
Optimizer: Gradient descent optimization (alpha0=0.001)
	convergence criteria: maxit=2000 early_stopping=True epsilon=1e-05 prev=5
	decay: decay_type=sig decay_rate=5e-06 decay_start=0.1 

[ removed the per-iteration statistics for brevity ]

Finished with code 0 -- Stopping condition (xnorm diff < 1e-05) successfull.

Writing msgpack-formatted potentials to data/1atzA.braw.gz

We now have one additional file in the data/ directory: 1atzA.braw.gz, the raw coupling potentials stored in binary MessagePack format for use with CCMgen.

Visualize Contact Map from Raw Couplings

A contact map can also be generated directly from the binary file of raw couplings with the --braw-file flag like this:

ccm_plot cmap --braw-file data/1atzA.braw.gz \
    --alignment-file data/1atzA.fas \
    --pdb-file data/1atzA.pdb \
    --plot-file data/1atzA.pcd.apc.html \
    --seq-sep 4 --contact-threshold 8 \
    --apc

This command will generate the following contact map:

1atzA.pcd.apc.png

Inferring MRF Model with Reduced Number of Constraints

If one wishes to generate diverse alignments with CCMgen, it is helpful to reduce the number of constraints in the MRF model that would otherwise trap Gibbs sampler in local optima leading to alignments of low diversity. For that purpose, it is possible to specify a reference structure with a PDB file and a distance threshold, so that residue pairs separated by C_beta distances larger than the threshold will obtain zero-couplings.

ccmpred data/1atzA.fas --ofn-cd --persistent \
    -b data/1atzA.constrained.braw.gz \
    --pdb-file data/1atzA.pdb --contact-threshold 12

Working with Raw Coupling Potential Files

It is possible to initialise potentials in CCMpredPy from a binary MessagePack file by using the flag --init-from-raw, which might be of interest in order to restart optimization given a predefined model or in combination with the flag --do-not-optimize to simply compute contact matrix files from the raw coupling potentials.

Furthermore, to work with the raw coupling files in your Python code you can use raw module in the ccmpred package as follows:

# load the raw module 
>>> import ccmpred.raw

#read raw coupling potentials stored in binary MessagePack format
>>> raw = ccmpred.raw.parse_msgpack("data/1atzA.braw.gz")

#the length of the protein (number columns in alignment)
>>> raw.ncol
75

#the dimensions of the single potentials
>>> raw.x_single.shape
(75, 20)

#the dimensions of the pairwise potentials
>>> raw.x_pair.shape
(75, 75, 21, 21)

#meta information about the CCMpredPy run that generated the model
>>> raw.meta
{'version': '1.0.0', 'method': 'ccmpred-py', 'workflow': [{'timestamp': '2018-06-04 14:39:35.902889', 
'msafile': {'neff': 1149.4357714155785, 'nrow': 3068, 'ncol': 75, 'file': 'data/1atzA.fas', 'max_gap_pos': 100, 'max_gap_seq': 100}, 'pseudocounts': {'pseudocount_type': 'uniform_pseudocounts', 'pseudocount_n_single': 1, 'pseudocount_n_pair': 1, 'remove_gaps': False, 'pseudocount_ratio_single': 0.0008692358364079104, 'pseudocount_ratio_pair': 0.0008692358364079104}, 'weighting': {'type': 'simple', 'ignore_gaps': False, 'cutoff': 0.8}, 'results': {'opt_code': 2, 'opt_message': 'Reached maximum number of iterations', 'num_iterations': 50, 'runtime': 0.252776, 'fx_final': -1, 'out_binary_raw_file': 'data/1atzA.braw.gz'}, 'initialisation': {'single_potential_init': None, 'pair_potential_init': 'zero'}, 'regularization': {'regularization_type': None, 'regularization_scaling': None, 'single_prior': None, 'lambda_single': 10, 'lambda_pair': 0.2, 'lambda_pair_factor': 0.2}, 'optimization': {'method': 'gradientDescent', 'convergence': {'maxit': 50, 'early_stopping': False, 'epsilon': 1e-05, 'convergence_prev': 5}, 'decay': {'alpha0': 0.001, 'decay': False, 'decay_start': 0.0001, 'decay_rate': 10.0, 'decay_type': 'step'}, 'fix_v': False}, 'obj_function': {'name': 'ContrastiveDivergence', 'gibbs_steps': 1, 'nr_seq_sample': 500}}]}