ML GeoDatasets - jejjohnson/ml4eo GitHub Wiki
Concepts
- NonGeoDatasets
- GeoDatasets
- Transforms
- DataLoaders
- DataModules
Datasets
- Numpy Arrays
- GeoTiffs
GeoDatasets
These are datasets which are geoscience oriented. In many cases, this means datasets which keep track of data and meta-data. These are datasets which have values but also have important meta-data related to the coordinates and geospatial positioning. Something like xarray, rasterio or geopandas.
GeoTiff
# save dataset
xds["name_of_variable"].rio.to_raster('path/to/file/file.tif')
NetCDF
# save dataset
xds[“variable”].to_netcdf(“path/to/file/file.nc”)
- xrpatcher
- TorchGeo - Quickie Tutorial
- Raster-Vision
Non-GeoDatasets
These include all data types that we typically use in many standard ML scenarios. These include images for discrete data and numpy arrays for continuous data. These are datasets which have the values but do not keep the meta-data contained within the same dataset.
Numpy Array
# convert to array
np_data: Array = xds[“variable”].values.astype(np.float32)
# save dataset to numpy array
np.save(“path/to/file/file.npy”, data)
Generic Image
from imageio import imsave
# convert to image
image: Array = xds[“variable”].values.astype(np.uint8)
# save as image
imsave(save_path, image)
- JAX - Minimal Dataset and Dataloader in Repo
- jaxonloader | hydrax | jax-dataloader
- Dataset and DataLoader for Coordinate-Based Data - Merlin DataLoader
- Framework Agnostic - mlx-data
- PyTorch - Torch.utils.data
- TensorFlow - TensorFlow-Datasets
Transforms
These are transformations that happen on-the-fly as we load each item of our datasets. There is a lot
- Framework Agnostic (Numpy) - Albumentations
- PyTorch - TorchVision is probably the most well known library for transformations.
- Keras - Keras Preprocessing Layers | KerasCV
- TensorFlow
- Jax - PIX