Data management, network, and disk usage - westpa/westpa GitHub Wiki
A single weighted ensemble simulation may generate multiple terabytes of data, presenting difficulties for storage and retrieval of data. Moreover, short trajectory segments in weighted ensemble simulations commonly result in large numbers of small files, which are managed more slowly on some file systems than a smaller number of large files with the same overall disk size. To alleviate these potential issues, we recommend the following:
- Delete unnecessary files as each trajectory segment is simulated. Unnecessary files may include input files, log files from analysis tools, and raw text output files from analysis tools. Often, useful data from log files (e.g., temperature from a molecular dynamics simulation) may be extracted from log files and saved as auxiliary data to the WESTPA data file (
west.h5), which stores data more efficiently than raw text. - Tar and optionally compress data from each weighted ensemble iteration. This strikes a balance between excessive file count and excessive file size, either of which is typically sub-optimal for long term storage, especially on tape systems that may not guarantee the integrity of large files.
- For molecular dynamics simulations, consider saving atomic coordinates for solute atoms to an HDF5 file, for each weighted ensemble iteration.
Large weighted ensemble simulations running over multiple compute nodes may consume substantial network bandwidth as files, progress coordinate data, and auxiliary data is passed back and forth between compute nodes. Such weighted ensemble simulations may also heavily utilize storage devices, possibly resulting in "I/O-bound" simulations. We recommend the following to alleviate these potential pitfalls:
- Simulate each trajectory segment in scratch space local to that segment's compute node. After the segment completes, copy any necessary data to globally-accessible storage using an efficient protocol such as
rsync. - Copy repeatedly accessed files, such as reference structures and analysis scripts, to local scratch space.
- To avoid repeatedly searching the file tree for repeatedly used programs, set environment variables to the absolute paths of those programs before starting a WESTPA client (e.g., via the ZMQ work manager), and ensure that each child process inherits these environment variables.
- Consider the impact of short values of τ (the replication/combination interval) on efficiency of resource usage. Short values of τ may result in excessive calls to the file system to read restart files and simulation executables.
- Benchmark your weighed ensemble simulation relative to standard simulation, to determine overhead due to I/O and network usage. Overhead will vary with system architecture and simulation requirements, but it is not uncommon for overhead to be < 10%.
- For highly I/O intensive simulations, consider performing the simulation in volatile memory via
tmpfs, and carefully assess whether there is benefit to running over multiple compute nodes.