Protein Folding (Chignolin) with AMBER 16 - westpa/westpa GitHub Wiki

by Barmak Mostofian and Upendra Adhikari

 

Overview
In this tutorial, we use WESTPA to simulate the folding and unfolding of the small peptide chignolin and to calculate the corresponding rate constants. The problem of protein folding has received considerable attention in molecular biology and the molecular dynamics (MD) simulation of this process is very challenging due to the relatively long time scales involved. This challenge can be tackled with the weighted ensemble (WE) technique.

The decapeptide chignolin (sequence: GYDPETGTWG) forms a beta-hairpin structure. Thus, it represents a very common protein structural motif and it is small enough to repeatedly fold and unfold during ordinary "brute force" MD simulations, which provide a reference data set for comparison with WESTPA results. We run weighted-ensemble non-equilibrium steady-state simulations of chignolin folding and unfolding separately and we are interested in comparing the folding/unfolding dynamics of chignolin estimated via WESTPA to data from brute force (BF) simulation methods, demonstrating the correctness and potential usefulness of the weighted ensemble approach. (Note that WE simulations are not needed for chignolin, since this peptide exhibits sufficiently fast dynamics for good sampling by regular MD. WE tends to offer more of a performance advantage over regular MD for slower, more complex systems.) A key goal of this tutorial is to learn how to use the WESTPA flux analysis tools for the calculation of the rate constant of an event.

This tutorial assumes that the user has some basic knowledge of the AMBER simulation suite and has studied the WESTPA introductory tutorial. In the following, we discuss the chignolin simulation files and the parameters set to perform BF and WE simulations of chignolin folding and unfolding. Furthermore, we describe how to run the simulations and how to derive and interpret the rate constants of the simulated process. We would like to note up front that significantly more computing time is required for converged folding simulations and hence we suggest the user should start with the unfolding simulations.

 

Required Files
All required input files can be retrieved from here. These are three directories, 'WE_folding', 'WE_unfolding', and 'bruteforce', which contain the following necessary files for the corresponding simulations (these files can be AMBER-specific, WESTPA-specific, or VMD or Python analysis scripts).

 

File Description
chignolin.prmtop AMBER parameters and topology of chignolin
chignolin.pdb PDB-format coordinates of the folded chignolin crystal structure; required for all RMSD calculations
chignolin0.pdb PDB-format coordinates of the unfolded chignolin structure used as the starting structure of WE folding simulations; required for the initial state RMSD calculation of the WE folding simulation
chignolin0.rst7 input coordinates of the chignolin starting structure; required for running MD simulations with AMBER; this file is identical in the BF and WE unfolding simulations ('chignolin.pdb') but different in the WE folding simulation ('chignolin0.pdb')
md_bf.inp AMBER simulation config file used in the BF simulations; all parameters are explained as comments in the file
md0.inp AMBER simulation config file used only in the first iteration of the WE simulations (velocities are generated); all parameters are explained as comments in the file
md.inp AMBER simulation config file used in the WE simulations (velocities are taken from parent restart files); all parameters are explained as comments in the file
run_bf.sh BF simulation run script
run.sh WE simulation run script; NOTE: for continuing a WE simulation, the line './init.sh' must be commented out
get_rmsd_bf.tcl VMD analysis script for the chignolin C-alpha RMSD calculation in the BF simulation
get_rmsd_init.tcl VMD analysis script for the WE initial state chignolin C-alpha RMSD calculation
get_rmsd0.tcl VMD analysis script for the WE first iteration chignolin C-alpha RMSD calculation
get_msd.tcl VMD analysis script for the WE chignolin C-alpha RMSD calculation
get_mfpt.py Python analysis script for the mean first-passage times of chignolin folding and unfolding events in the BF simulation
get_mean_rate.py Python analysis script converting the probability flux reaching the target state in the WE simulations to window-averaged rate constants
west.cfg WE simulation config file (the maximum run time parameter, 'max_run_wallclock', needs to be adjusted to the user’s system)
system.py WE simulation driver file
env.sh WE simulation environment setting file
init.sh WE simulation initiation file
get_pcoord.sh WE simulation initial state progress coordinate
runseg.sh WE simulation segment run and progress coordinate file

 

System Details
A beta-hairpin chignolin structure, originally retrieved from the Protein Data Bank (ID: 1UAO), serves as the starting structure for the BF and for the WE unfolding simulations (the starting structure for the WE folding simulations can be selected from the BF simulation). The parameters were derived from the recently optimized AMBER force field 'ff14SBonlysc'. The simulations are performed with the Generalized Born implicit solvent model at T = 275 K. All simulations are performed with the AMBER program ‘sander‘ and the analysis is done with the VMD software.1

 

Brute-Force Simulations
We first run a “brute force” (BF) simulation, meaning an ordinary MD run, of chignolin. In general, running a BF simulation prior to using WESTPA is important for understanding time requirements (how much wall clock per nanosecond) but also to ensure the starting structure is reasonable in the force field and solvent model being used. Moreover, it allows us to choose an appropriate progress coordinate for the WESTPA simulations. We analyze the C-alpha root mean-square deviation (RMSD) of chignolin during the BF simulation, which we want to use as the progress coordinate in WESTPA simulations. For folding studies, in particular, the BF simulation provides information about the degree of fluctuations in the native state, and hence guidance for defining the target folded state in WESTPA simulations. In the case of chignolin, which both folds and unfolds on timescales accessible to ordinary MD, the BF simulation also provides information on defining the unfolded state.

 

We perform a 4 μs long BF simulation of chignolin and write out coordinates every 20 ps. The user can change these parameters in the MD config file 'md_bf.inp'. The BF simulation can be submitted with the following command:

./run_bf.sh
(This submission script may have to be adjusted to the user’s computing platform.)

The BF simulation required ~1.8 microsec/day on a single GPU card in a previously performed sample simulation (note that, for better performance, the AMBER program ‘pmemd.cuda’ has been used for the sample simulation; sander would have required ~70 ns/day on 2x28 CPUs). All AMBER output files from this simulation can be found in the subdirectory 'sample_simulation'.

 

The chignolin C-alpha RMSD can be computed in the following way:

vmd    -dispdev text    -e get_rmsd_bf.tcl
(This script assumes the BF simulation trajectory as well as the chignolin parameter/topology and crystal structure pdb files are all in the current directory.)
The output RMSD data file, 'rmsd.dat', is a single-column file of one RMSD value per line that corresponds to the time evolution of the chignolin C-alpha RMSD over the course of the simulation.

 

Figure 1 shows the C-alpha RMSD over simulation time for the brute-force simulation starting from the folded beta-hairpin. This informs us that several unfolding and re-folding events take place within 4000 ns and that the peptide can assume RMSD values >4 Å when unfolding and <0.5 Å upon re-folding. As indicated in Figure 1, we choose these threshold values to define folded and unfolded structures, respectively. This choice is based on visual inspection of the RMSD plot and simulation structures, which show a fully formed β-sheet and native hydrogen bonds at RMSD < 0.5 Å and have a disturbed β-sheet with broken native hydrogen bonds at RMSD > 4 Å (this pair of RMSD values will also be used later to define target states in WESTPA simulations). Note that the (un)folding rate constants will be sensitive to the state definitions, and defining states is a challenging process beyond the scope of this tutorial. Our choices are indeed subjective and perhaps not optimal for some purposes. Our state definitions are designed to avoid potential recrossing artifacts in rate calculations: once a trajectory reaches a state it should tend to remain there, rather than immediately returning to the previous state.



Figure 1
Figure1: C-alpha RMSD vs simulation time for the brute force simulation of chignolin.

 

The rate of (un-)folding can be derived from this BF simulation simply by tracking the (un-)folding events during the given simulation time. More specifically, we can quantify the rate constant for either process as the inverse of its mean first-passage time (MFPT), where, for instance, the FPT for unfolding is the time required to reach the unfolded threshold (RMSD > 4 Å) after first folding, i.e. reaching RMSD < 0.5 Å. The opposite holds for the derivation of folding rate constants.
The user can run the following script to obtain the MFPTs for both folding and unfolding (the folding and unfolding MFPTs will be written to the standard output along with uncertainty values, which are derived from a Bayesian bootstrapping procedure,2 and can be used as error bars):

python    get_mfpt.py    rmsd.dat    20e-12    0.5    4.0
(The command-line arguments are the RMSD data file, the time step of the RMSD values in seconds, and the folded and unfolded threshold values.)

Based on this, the rate constant of unfolding is estimated to be 0.13 x 108 s-1 (confidence interval: 0.09 x 108 s-1 – 0.18 x 108 s-1) and that of folding is estimated to be 0.71 x 107 s-1 (confidence interval: 0.44 x 107 s-1 – 1.24 x 107 s-1).

 

Selection of WESTPA Parameters
To compare the folding and unfolding rate constants to those obtained from BF simulations, we conduct two separate non-equilibrium steady-state WESTPA simulations. The target state of the folding simulation will be used as the initial state of the unfolding simulation and vice versa. By design, this approach leads to a non-equilibrium flux of probabilities from the corresponding starting to target states, which can be obtained faster than through equilibrium simulations. Moreover, the separation of the chignolin folding and unfolding into two different simulations allows us to set the parameters for either process (e.g. progress coordinate bin widths) in a more process-specific way.

Selection of a progress coordinate for WE simulation is not automated and requires user specification. Currently, WESTPA users typically use intuition along with a trial-and-error process to determine a set of bins sufficient for observing the process of interest. Note that better bins can lead to more precise estimation of observables based on a fixed computational budget.
As mentioned above, an adequate progress coordinate for our illustration with the chignolin peptide is the RMSD of C-alpha atoms with respect to those of the folded structure. Although the RMSD with respect to a single reference structure is generally not an ideal coordinate for describing states, it proves sufficient for our example. (Note that the RMSD does not specifically account for structural details that may be important in defining a given state.) Folded and unfolded states are defined based on maximum and minimum RMSD values, respectively, as noted above in the discussion of BF simulation.

Bin sizes and other WE parameters also tend to be chosen in a heuristic trial-and-error fashion. Here, we found a bin width of 0.2 Å to work well. However, the very first bin for the unfolding simulations is larger than the regular bin width with RMSD = [0 Å, 0.5 Å] because any structure with RMSD < 0.5 Å is considered to be in the folded initial state. Analogously, for the folding simulations, the very last bin is larger than the regular bin width of 0.2 Å.

Four simulation trajectories are started in every bin at every iteration. Every iteration segment simulation is τ = 20 ps long because this choice led to transition events with the bin set chosen, in a trial-and-error process. In order to reach a steady value of the reactive flux (probability arriving to target) and hence the rate constant, the unfolding simulations span 1,000 iterations but the folding simulations take roughly 10,000 iterations.

 

WESTPA Simulations and Analysis
The user can run the WESTPA unfolding simulation in the corresponding directory by executing the following script:

./run.sh
(The line './init.sh' initializes the WESTPA simulation and it must be commented out for continuing a WESTPA simulation; otherwise the simulation starts again from scratch replacing previously obtained results.)

This WE unfolding simulation required ~45 hours for 1000 iterations on 32 CPUs in a previously performed sample simulaton. All WESTPA output files from this simulation can be found in the subdirectory 'sample_simulation'.

 

The mean flux (probability per unit time reaching the target state) of such a steady-state implementation is the rate of interest here. According to the Hill relation,3 it is exactly the inverse MFPT and, as mentioned above, it corresponds to the rate constant of the underlying process The user can run the WESTPA analysis tool 'w_fluxanl' to extract this value in the following way:

$WEST_ROOT/bin/w_fluxanl

The output is the file 'fluxanl.h5', which contains the instantaneous probability flux into the target state at any iteration. We are interested in average values at any iteration and the following python script calculates, for any iteration, the mean rate constant based on the corresponding flux arriving in the target state over a preceding window of molecular simulation times (e.g., over 1 ns):

python    get_mean_flux.py    20e-12    1e-9
(The command-line arguments are the WESTPA segment simulation length, τ, and the time width for window-averaging. Both arguments are in seconds as the time unit.)

 

The weights used in WESTPA enable estimation of the exact rate constant, which should match that obtained from averaging FPTs obtained from BF simulation. One subtle difference is that the WESTPA calculation estimates the MFPT based on the chosen initial structure(s) which may not correspond precisely to the ensemble of starting structures implicit in extracting first-passage events from a long MD run.

 

Figure 2 shows the evolution of the average chignolin unfolding rate constant as a function of molecular simulation time during three independent WE simulations. All individual runs settle down after a few ns to an unfolding rate constant of 107 to 108 s-1 roughly comparable to that derived from BF simulations. The above-mentioned difference in rate calculation between the BF and the WE simulations may be responsible for the slightly different results obtained: note that a 3-fold difference in rate constants amounts to only ~0.59 kcal/mol difference in the effective free energy barrier to unfolding (at the simulation temperature of 275 K).



Figure 2
Figure 2: Estimating the unfolding rate constant of chignolin. The 1 ns window-averaged unfolding rate constant is shown in a semi-logarithmic plot for three independent WESTPA simulations (black, red, and green profiles) with the same folded starting structure (see lower left). The corresponding rate constant from the BF simulation is indicated by the horizontal blue line and its confidence interval by the shaded region. The molecular time corresponds to the simulation time of a continuous trace of simulation segments and is different from the total aggregate time of all simulation segments, ~1300 ns on average for each simulation.

 

WESTPA folding simulations and average flux analysis can be performed similar to the unfolding simulations. However, significantly more simulation time must be used in order to reach a converged steady-state of fluxes. One previously performed sample simulation required ~7 days for 10000 iterations (corresponding to 200 ns of molecular time) on 32 CPUs and its output files can be found in the corresponding subdirectory 'sample_simulation'.

Figure 3 shows the evolution of the average chignolin folding rate constant as a function of molecular simulation time during three independent WE simulations. Compared with unfolding simulations, the traces exhibit significantly larger fluctuations, even after the apparent transient period of the first ~100 ns. This indicates that the chosen bins are less ideally suited for the folding process, which arguably is more complicated and challenging to sample compared to unfolding. During the beta-hairpin folding distinct hydrogen bonds must be formed between the neighboring anti-parallel strands, and possibly in a specific order, so that the peptide assumes an RMSD value of <0.5 Å. Unfolding, in contrast, does not have to be achieved in a systematic way, probably by simultaneous hydrogen bond breaking, in order to reach RMSD > 4 Å. This difference is captured by the faster convergence of unfolding rate constants than those of the folding process. Once again, the agreement with brute-force simulation is rough, but note also that the average WESTPA behavior will be dominated by the largest flux (rate estimate).



Figure 3
Figure 3: Estimating the folding rate constant of chignolin. The 20 ns window-averaged folding rate constant is shown in a semi-logarithmic plot for three independent WESTPA simulations (black, red, and green profiles) with the same unfolded starting structure (see lower left). Note the significantly longer molecular and aggregate simulation times for each simulation to obtain converged rates of folding compared to unfolding (see Figure 2). The corresponding rate constant from the brute-force simulation is indicated by the horizontal blue line and its confidence interval by the shaded region.

 

The WE simulations contain multiple trajectories that may cover different regions of conformational space at any given time. In the non-equilibrium chignolin (un-)folding simulations, those continuous trajectories that reach the target state may be of interest because they passed through the entire process. WESTPA allows identification and reconstruction of all such trajectories with the following set of commands.

python    get_target_trajs.py    1    10000
(The command-line arguments indicate the first and last iteration number to be considered.)

The output file 'target_trajs.dat' has two columns: one with the iteration number and one with the specific simulation (or segment) number that reached the target state at that iteration. Thus, the number of rows indicates the total number of simulations that have reached the target state. For instance, when this analysis is applied to the previously performed simulation ('WE_folding/sample_simulation'), the very first line of the output file indicates that the segment with the ID number 12 at iteration number 260 has reached the folding target state. This means that, in this instance, the unfolded protein has been fully folded (i.e., C-alpha RMSD < 0.5 Å) after 260 x 20 ps = 5.2 ns. Its full path can be constructed by tracing it back with the WESTPA analysis tool 'w_trace'.

$WEST_ROOT/bin/w_trace    260:12    -o target_traj_260_12.h5
(The command line arguments indicate the specific iteration and segment number to be traced back and the output h5 file. This operation takes longer the more iterations are specified to be traced back.)

The output text file 'traj_260_12_trace.txt' provides the following details for this specific unfolding simulation process where each line corresponds to a certain iteration (column 1) starting with iteration number 0, i.e. the provided initial state: the specific segment ID at that iteration (column 2), the weight this simulation segment carried (column 3), its simulation wallclock and CPU times (columns 4 and 5), and the progress coordinate at the end of that simulation segment (column 6). Of course, the progress coordinate of the initial state is that of the starting structure (RMSD = 4.2 Å) and that of the final iteration number (260) falls within the target state bin (RMSD = 0.47 Å ∈ [0 Å, 0.5 Å]).

The information provided by 'w_trace' can be used in a variety of ways to explore the underlying pathway of a continuous folding trajectory. In particular, for a visual inspection, one can concatenate the given trajectory segments, which are usually saved in the WESTPA subdirectory 'traj_segs' (NOT provided for the previously performed sample simulation), with the following script.

python    merge_trace_trajs.py    260:12
(This command makes use of the VMD plugin binary 'catdcd'.)

The output dcd trajectory file can be visualized in a molecular viewing program such as vmd. One realizes that initial folding events occur relatively fast within the first iterations but the final folding events require much longer to take place.

 

References
(1) Humphrey, W., Dalke, A. and Schulten, K., "VMD - Visual Molecular Dynamics", J. Molec. Graphics, 1996, vol. 14, pp. 33-38.
(2) Mostofian, B.; Zuckerman, D. M. Error Analysis for Small-Sample, High-Variance Data: Cautions for Bootstrapping and Bayesian Bootstrapping. arXiv:1806.01998 2018, 1–15.
(3) Hill, T. L. State Probabilities and Fluxes in Terms of the Rate Constants of the Diagram. In Free Energy Transduction and Biochemical Cycle Kinetics; Springer New York: New York, NY, 1989; pp 39–88.

⚠️ **GitHub.com Fallback** ⚠️