Dataset and curation - choderalab/modelforge GitHub Wiki

Dataset module

The dataset module provides functions and classes to load datasets from curated HDF5 files, saving to torch.DataSet or LighningDataSet instances in order to train NNPs. The dataset module implements actions associated with data storage, caching, and retrieval, as well as the pipeline from the stored hdf5 files to the pytorch dataset class that can be used for training. The general workflow to interact with public datasets will be as follows:

  1. obtaining the dataset
  2. processing the dataset and storing it in a hdf5 file with standard naming and units
  3. uploading to Zenodo and updating the retrieval link in the dataset implementation

The specific dataset classes like QM9Dataset or SPICEDataset download a hdf5 file with defined key names and values in a particular format from Zenodo and load the data in memory. The values in the dataset need to be specified in the [openMM unit system](http://docs.openmm.org/6.2.0/userguide/theory.html#units).

The public API for creating a TorchDataset is implemented in the specific data classes (e.g., QM9Dataset) and the DatasetFactory. The TorchDataset can be loaded in a Pytorch Dataloader.

modelforge
  datasets/ # defines the interaction with public datasets 
    dataset.py
      TorchDataset(torch.utils.data.Dataset) # A custom dataset class to wrap numpy datasets for PyTorch.
	* __init__(self, dataset: np.ndarray, property_name: PropertyNames)
	* __len__()
	* __getitem__(self, idx:int) -> Dict[str, torch.Tensor]
      HDF5Dataset() # Base class for data stored in HDF5 format.
        * _{to|from}_file_cache() # write/read high-performance numpy cache file (can change a lot)
        * _{from}_hdf5() # read our HDF5 format (reproducible and archival) (also supports gzipped files)
	* _perform_transformations(label_transform: Optional[Dict[str, Callable]], transforms: Dict[str, Callable]) # transform any entry of the dataset using a custom function
      DatasetFactory # Factory class for creating Dataset instances.
	* create_dataset(data: HDF5Dataset, label_transform: Optional[Dict[str, Callable]],transform: Optional[Dict[str, Callable]]) -> TorchDataset
	# Creates a TorchDataset instance given an HDF5Dataset.
      TorchDataModule(pl.LighningDataModule) # A custom data module class to handle data loading and preparation for PyTorch Lightning training.
	* def __init__(self, data: HDF5Dataset, SplittingStrategy: SplittingStrategy,batch_size)
	* prepare_data()
	* setup()
	* {train|val|test}_dataloader()-> DataLoader					
    {qm9|spice|phalkethoh|ani1x|ani2x|tmqm}.py
      QM9Dataset(HDF5Dataset) # Data class for handling QM9 data.
      * properties_of_interest() -> List[str] # [getter|setter], entries in dataset that are retrieved  
      * available_properties() -> List[str] # list of available properties in the dataset
      * _download() # Download the hdf5 file containing the data from source.
    transformation.py # transformation functions applied to entries in dataset
    * default_transformations 
    utils.py
    RandomSplittingStrategy(SplittingStrategy)
    * split(dataset:TorchDataset) -> Tuple[Subset, Subset, Subset] # Splits the provided dataset into training, validation, and testing subsets

Curate module (modelforge.curate)

This provides information related the modelforge hdf5 schema 2.

The curate module provides functionality to retrieve source datasets and generate HDF5 datafiles with a consistent format and units, to be loaded by the dataset module. This model puts an emphasis on validation at the time of constructing/curating the dataset to avoid inconsistencies or errors in the final hdf5 file.

The purpose of including this module in the package is to encapsulate all routines used to generate the input datafile, including any and all manipulation of the underlying data (e.g., unit conversion, summing of quantities, calculation of reference energy, etc.), to ensure transparency and reproducibility.

For efficient data writing/reading configurations of the same system are grouped together into a single entry. The HDF5 files generated by the curate module have:

  • the units associated with each entry defined in the "u" attribute for each quantity. This uses openff-units compatible names.
  • property type of data represented by the array as the attribute "property_type": this defines whether a quantity e.g., represents, length, energy, force, etc. This is used for unit conversion/validation when using the datafile. - the format of the data inside the array as an attribute "format". This can take on the value of "atomic_numbers", "per_system", "per_atom", or "meta_data". The use of the "format" attribute allows the dataloader to be more general and thus work with different datasets where the names of the underlying quantities may vary, as the dataloader uses this information to know how to parse arrays.
    • "atomic_numbers" -- the atomic numbers of the system with shape [n_atoms, 1]. e.g., methane: [6], [1], [1], [1], [1](/choderalab/modelforge/wiki/6],-[1],-[1],-[1],-[1). Note, since modelforge does not allow the atomic_number array to change within a system, this array applies to all configurations
    • "per_system" -- a property that applies to the entire system being considering which will have shape [n_configurations, *], where axis=0 allows us to index into the value for a given configuration. E.g., energy is an example of a per_system property that would have shape [n_configurations, 1].
    • "per_atom" -- a property that applies to individual atoms, with shape [n_configurations, n_atoms, *]. e.g., positions are a per_atom property with shape [n_configurations, n_atoms, 3].
    • "meta_data" -- additional data about the system (e.g., smiles); this type of data is not used by the data loader at this time.

A full description of the curate module can be found in the documentation: https://modelforge.readthedocs.io/en/latest/curate.html

Data format:

Note: the following is outdated and will be updated:

As an example, let us load the first record for the QM9 dataset

from modelforge.curation.qm9_curation import QM9Curation

qm9_dataset = QM9Curation(
    hdf5_file_name="qm9_dataset.hdf5",
    output_file_dir="datasets/hdf5_files",
    local_cache_dir="datasets/qm9_dataset_raw",
)

qm9_dataset.process(max_records=1)

In all the curated datasets, a list named data is generated. Each entry in the list corresponds to a specific molecule, where the molecule information is stored as a dictionary.

For example, we can access all of the properties stored in the dataset as follows:

for data_point in qm9_dataset.data:
    for key, val in data_point.items():
        print(f"{key} : {val} : {qm9_dataset._record_entries_series[key]}")

Note this also accesses the _record_entries_series dictionary in the dataset, which stores the descriptor discussed above.

Let us examine a small selection of the stored data to discuss the specific format and common elements between all datasets.

In all datasets, each entry in the data list will contain several keys:

  • name -- unique identifying string of the molecule, typically taken from the original dataset
  • n_configs -- number of configurations/conformers for the molecule
  • atomic_numbers -- array of atomic numbers (in order) of the molecule.
  • geometry -- array of atomic positions of the conformers

name and n_configs are both considered to be of format single_rec (see above) as these values apply to all data in the molecule and are not conformer dependent.

name : dsgdb9nsd_000001 : <class 'str'> : single_rec
n_configs : 1 : <class 'int'> : single_rec

atomic_numbers is marked as a single_atom, as this array applies to all conformers (order of the atomic indices cannot change), but is also a per-atom property, hence why we consider it a single_atom as opposed to single_rec. Note as can be seen below, the shape of atomic_numbers is (n_atoms,1), where in this case n_atoms=5. We defined this as (n_atoms, 1) instead of (n_atoms) for consistency with other per-atom properties:

atomic_numbers : 
[
 [6]
 [1]
 [1]
 [1]
 [1]
] 
<class 'numpy.ndarray'>
single_atom

The geometry is of format series_atom as we will have a unique set of coordinates for each conformer. This is of shape (n_configs, n_atoms, 3), which since n_configs=1, is of shape (1,5,3). Note that this is a numpy.ndarray with units attached (using openff-units, based on pint).

geometry : 
[[
  [-0.0012698135899999999 0.10858041577999998 0.00080009958]
  [0.00021504159999999998 -0.0006031317599999999 0.00019761204]
  [0.10117308433 0.14637511618 2.7657479999999996e-05]
  [-0.05408150689999999 0.14475266137999998 -0.08766437151999999]
  [-0.05238136344999999 0.14379326442999998 0.09063972942]
]] nanometer : 
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> : 
series_atom

Note, for data of format series_atom, the final dimension is variable. For example, the charges in this dataset are series_atom, but only a single charge is associated with each atom, rather than a vector of a shape 3. Hence, we have an entry of shape (n_configs, n_atoms, 1).

charges : 
[[
 [-0.535689]  
 [0.133921]  
 [0.133922]  
 [0.133923]  
 [0.133923]
]] elementary_charge : 
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> : 
series_atom

Datasets will also contain information about the energy, although the name of this will depend on the dataset itself. For example, in QM9, we have internal_energy_at_0K, which is of format series_mol, meaning there will be a single unique value for each conformer, hence of shape (n_configs, 1) in this case.

internal_energy_at_0K : 
[
 [-106277.4161215308]
] kilojoule_per_mole : 
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> : 
series_mol
`

Again, as the last dimension of the shape of series_mol entries are variable (and will be inferred during data load), and can represent not just a single float value per molecule, but also a vector. For example, harmonic vibrational frequencies is of length (n_configs, 9) in this case:

harmonic_vibrational_frequencies : 
[
 [1341.307 1341.3284 1341.365 1562.6731 1562.7453 3038.3205 3151.6034  3151.6788 3151.7078]
] / centimeter : 
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> : 
series_mol

This data array, along with the "format" information, is written to an HDF5 file, in roughly the same general structure. HDF5 files can be accessed in a very similar fashion to dictionaries using h5py. The key differences in the datastructure are as follows: the name field is used to create a top level key in the HDF5 datastructure, with properties stored the level below this. Units are no longer attached to values/arrays, but instead stored in the attributes (attrs) associated with each property; the format (e.g., series_mol) is also stored as an attribute. A sketch of the hierarchy is as follows:

1- name
2-- property
3--- attrs: units as "u", format

The following script demonstrates how to access the data (although in general, users will not need to directly access files, as these will be automatically loaded in the dataset classes).

import h5py

filename = "datasets/hdf5_files/qm9_dataset.hdf5"

with h5py.File(filename) as h5:
    for molecule_name in h5.keys():
        print("molecule_name:", molecule_name)

        for property in h5[molecule_name].keys():
            print("-Property:", property)
            print(h5[molecule_name][property].attrs["format"])
            if "rec" not in h5[molecule_name][property].attrs["format"]:
                print(h5[molecule_name][property].shape)
            print(h5[molecule_name][property][()])
            if "u" in h5[molecule_name][property].attrs:
                print(h5[molecule_name][property].attrs["u"])

The first few outputs are as follows:

molecule_name: dsgdb9nsd_000001
-Property: atomic_numbers
single_atom
(5, 1)
[[6]
 [1]
 [1]
 [1]
 [1]]
-Property: charges
series_atom
(1, 5, 1)
[[[-0.535689]
  [ 0.133921]
  [ 0.133922]
  [ 0.133923]
  [ 0.133923]]]
elementary_charge

Note, in this format, units are written as strings; openff units allows these to be easily reattached to the quantity of interest, simply by passing the string to Quantity.

from openff.units import Quantity

value_without_units = h5[molecule_name][property][()]
units_string = h5[molecule_name][property].attrs["u"]

value_with_units =  value_without_units* Quantity(units_string)

PhalKetHOH

image

ANI2x

image

QM9

image

SPICE2

image