Dataset and curation - choderalab/modelforge GitHub Wiki
Dataset module
The dataset module provides functions and classes to load datasets from curated HDF5 files, saving to torch.DataSet
or LighningDataSet
instances in order to train NNPs.
The dataset module implements actions associated with data storage, caching, and retrieval, as well as the pipeline from the stored hdf5 files to the pytorch dataset
class that can be used for training.
The general workflow to interact with public datasets will be as follows:
- obtaining the dataset
- processing the dataset and storing it in a hdf5 file with standard naming and units
- uploading to Zenodo and updating the retrieval link in the
dataset
implementation
The specific dataset classes like QM9Dataset
or SPICEDataset
download a hdf5 file with defined key names and values in a particular format from Zenodo and load the data in memory. The values in the dataset need to be specified in the [openMM unit system](http://docs.openmm.org/6.2.0/userguide/theory.html#units).
The public API for creating a TorchDataset
is implemented in the specific data classes (e.g., QM9Dataset
) and the DatasetFactory
.
The TorchDataset
can be loaded in a Pytorch
Dataloader.
modelforge
datasets/ # defines the interaction with public datasets
dataset.py
TorchDataset(torch.utils.data.Dataset) # A custom dataset class to wrap numpy datasets for PyTorch.
* __init__(self, dataset: np.ndarray, property_name: PropertyNames)
* __len__()
* __getitem__(self, idx:int) -> Dict[str, torch.Tensor]
HDF5Dataset() # Base class for data stored in HDF5 format.
* _{to|from}_file_cache() # write/read high-performance numpy cache file (can change a lot)
* _{from}_hdf5() # read our HDF5 format (reproducible and archival) (also supports gzipped files)
* _perform_transformations(label_transform: Optional[Dict[str, Callable]], transforms: Dict[str, Callable]) # transform any entry of the dataset using a custom function
DatasetFactory # Factory class for creating Dataset instances.
* create_dataset(data: HDF5Dataset, label_transform: Optional[Dict[str, Callable]],transform: Optional[Dict[str, Callable]]) -> TorchDataset
# Creates a TorchDataset instance given an HDF5Dataset.
TorchDataModule(pl.LighningDataModule) # A custom data module class to handle data loading and preparation for PyTorch Lightning training.
* def __init__(self, data: HDF5Dataset, SplittingStrategy: SplittingStrategy,batch_size)
* prepare_data()
* setup()
* {train|val|test}_dataloader()-> DataLoader
{qm9|spice|phalkethoh|ani1x|ani2x|tmqm}.py
QM9Dataset(HDF5Dataset) # Data class for handling QM9 data.
* properties_of_interest() -> List[str] # [getter|setter], entries in dataset that are retrieved
* available_properties() -> List[str] # list of available properties in the dataset
* _download() # Download the hdf5 file containing the data from source.
transformation.py # transformation functions applied to entries in dataset
* default_transformations
utils.py
RandomSplittingStrategy(SplittingStrategy)
* split(dataset:TorchDataset) -> Tuple[Subset, Subset, Subset] # Splits the provided dataset into training, validation, and testing subsets
Curate module (modelforge.curate)
This provides information related the modelforge hdf5 schema 2.
The curate module provides functionality to retrieve source datasets and generate HDF5 datafiles with a consistent format and units, to be loaded by the dataset module. This model puts an emphasis on validation at the time of constructing/curating the dataset to avoid inconsistencies or errors in the final hdf5 file.
The purpose of including this module in the package is to encapsulate all routines used to generate the input datafile, including any and all manipulation of the underlying data (e.g., unit conversion, summing of quantities, calculation of reference energy, etc.), to ensure transparency and reproducibility.
For efficient data writing/reading configurations of the same system are grouped together into a single entry. The HDF5 files generated by the curate module have:
- the units associated with each entry defined in the "u" attribute for each quantity. This uses openff-units compatible names.
- property type of data represented by the array as the attribute "property_type": this defines whether a quantity e.g., represents, length, energy, force, etc. This is used for unit conversion/validation when using the datafile. - the format of the data inside the array as an attribute "format". This can take on the value of "atomic_numbers", "per_system", "per_atom", or "meta_data". The use of the "format" attribute allows the dataloader to be more general and thus work with different datasets where the names of the underlying quantities may vary, as the dataloader uses this information to know how to parse arrays.
- "atomic_numbers" -- the atomic numbers of the system with shape [n_atoms, 1]. e.g., methane:
[6], [1], [1], [1], [1](/choderalab/modelforge/wiki/6],-[1],-[1],-[1],-[1)
. Note, since modelforge does not allow the atomic_number array to change within a system, this array applies to all configurations - "per_system" -- a property that applies to the entire system being considering which will have shape [n_configurations, *], where axis=0 allows us to index into the value for a given configuration. E.g., energy is an example of a per_system property that would have shape [n_configurations, 1].
- "per_atom" -- a property that applies to individual atoms, with shape [n_configurations, n_atoms, *]. e.g., positions are a per_atom property with shape [n_configurations, n_atoms, 3].
- "meta_data" -- additional data about the system (e.g., smiles); this type of data is not used by the data loader at this time.
- "atomic_numbers" -- the atomic numbers of the system with shape [n_atoms, 1]. e.g., methane:
A full description of the curate module can be found in the documentation: https://modelforge.readthedocs.io/en/latest/curate.html
Data format:
Note: the following is outdated and will be updated:
As an example, let us load the first record for the QM9 dataset
from modelforge.curation.qm9_curation import QM9Curation
qm9_dataset = QM9Curation(
hdf5_file_name="qm9_dataset.hdf5",
output_file_dir="datasets/hdf5_files",
local_cache_dir="datasets/qm9_dataset_raw",
)
qm9_dataset.process(max_records=1)
In all the curated datasets, a list named data
is generated. Each entry in the list corresponds to a specific molecule, where the molecule information is stored as a dictionary.
For example, we can access all of the properties stored in the dataset as follows:
for data_point in qm9_dataset.data:
for key, val in data_point.items():
print(f"{key} : {val} : {qm9_dataset._record_entries_series[key]}")
Note this also accesses the _record_entries_series
dictionary in the dataset, which stores the descriptor
discussed above.
Let us examine a small selection of the stored data to discuss the specific format and common elements between all datasets.
In all datasets, each entry in the data
list will contain several keys:
name
-- unique identifying string of the molecule, typically taken from the original datasetn_configs
-- number of configurations/conformers for the moleculeatomic_numbers
-- array of atomic numbers (in order) of the molecule.geometry
-- array of atomic positions of the conformers
name
and n_configs
are both considered to be of format single_rec
(see above) as these values apply to all data in the molecule and are not conformer dependent.
name : dsgdb9nsd_000001 : <class 'str'> : single_rec
n_configs : 1 : <class 'int'> : single_rec
atomic_numbers
is marked as a single_atom
, as this array applies to all conformers (order of the atomic indices cannot change), but is also a per-atom property, hence why we consider it a single_atom
as opposed to single_rec
.
Note as can be seen below, the shape of atomic_numbers
is (n_atoms,1), where in this case n_atoms=5. We defined this as (n_atoms, 1) instead of (n_atoms) for consistency with other per-atom properties:
atomic_numbers :
[
[6]
[1]
[1]
[1]
[1]
]
<class 'numpy.ndarray'>
single_atom
The geometry
is of format series_atom
as we will have a unique set of coordinates for each conformer. This is of shape (n_configs, n_atoms, 3), which since n_configs=1, is of shape (1,5,3). Note that this is a numpy.ndarray with units attached (using openff-units, based on pint).
geometry :
[[
[-0.0012698135899999999 0.10858041577999998 0.00080009958]
[0.00021504159999999998 -0.0006031317599999999 0.00019761204]
[0.10117308433 0.14637511618 2.7657479999999996e-05]
[-0.05408150689999999 0.14475266137999998 -0.08766437151999999]
[-0.05238136344999999 0.14379326442999998 0.09063972942]
]] nanometer :
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> :
series_atom
Note, for data of format series_atom
, the final dimension is variable. For example, the charges in this dataset are series_atom
, but only a single charge is associated with each atom, rather than a vector of a shape 3. Hence, we have an entry of shape (n_configs, n_atoms, 1).
charges :
[[
[-0.535689]
[0.133921]
[0.133922]
[0.133923]
[0.133923]
]] elementary_charge :
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> :
series_atom
Datasets will also contain information about the energy, although the name of this will depend on the dataset itself. For example, in QM9, we have internal_energy_at_0K
, which is of format series_mol
, meaning there will be a single unique value for each conformer, hence of shape (n_configs, 1) in this case.
internal_energy_at_0K :
[
[-106277.4161215308]
] kilojoule_per_mole :
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> :
series_mol
`
Again, as the last dimension of the shape of series_mol
entries are variable (and will be inferred during data load), and can represent not just a single float value per molecule, but also a vector. For example, harmonic vibrational frequencies is of length (n_configs, 9) in this case:
harmonic_vibrational_frequencies :
[
[1341.307 1341.3284 1341.365 1562.6731 1562.7453 3038.3205 3151.6034 3151.6788 3151.7078]
] / centimeter :
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> :
series_mol
This data array, along with the "format" information, is written to an HDF5 file, in roughly the same general structure. HDF5 files can be accessed in a very similar fashion to dictionaries using h5py. The key differences in the datastructure are as follows: the name
field is used to create a top level key in the HDF5 datastructure, with properties stored the level below this. Units are no longer attached to values/arrays, but instead stored in the attributes (attrs
) associated with each property; the format (e.g., series_mol
) is also stored as an attribute. A sketch of the hierarchy is as follows:
1- name
2-- property
3--- attrs: units as "u", format
The following script demonstrates how to access the data (although in general, users will not need to directly access files, as these will be automatically loaded in the dataset classes).
import h5py
filename = "datasets/hdf5_files/qm9_dataset.hdf5"
with h5py.File(filename) as h5:
for molecule_name in h5.keys():
print("molecule_name:", molecule_name)
for property in h5[molecule_name].keys():
print("-Property:", property)
print(h5[molecule_name][property].attrs["format"])
if "rec" not in h5[molecule_name][property].attrs["format"]:
print(h5[molecule_name][property].shape)
print(h5[molecule_name][property][()])
if "u" in h5[molecule_name][property].attrs:
print(h5[molecule_name][property].attrs["u"])
The first few outputs are as follows:
molecule_name: dsgdb9nsd_000001
-Property: atomic_numbers
single_atom
(5, 1)
[[6]
[1]
[1]
[1]
[1]]
-Property: charges
series_atom
(1, 5, 1)
[[[-0.535689]
[ 0.133921]
[ 0.133922]
[ 0.133923]
[ 0.133923]]]
elementary_charge
Note, in this format, units are written as strings; openff units allows these to be easily reattached to the quantity of interest, simply by passing the string to Quantity
.
from openff.units import Quantity
value_without_units = h5[molecule_name][property][()]
units_string = h5[molecule_name][property].attrs["u"]
value_with_units = value_without_units* Quantity(units_string)