Dask and Xarray - BKJackson/BKJackson_Wiki GitHub Wiki
xarray and Dask
(xarray used to be called xray)
Nearly all existing xarray methods (including those for indexing, computation, concatenating and grouped operations) have been extended to work automatically with Dask arrays. When you load data as a Dask array in an xarray data structure, almost all xarray operations will keep it as a Dask array; when this is not possible, they will raise an exception rather than unexpectedly loading data into memory. Converting a Dask array into memory generally requires an explicit conversion step. One notable exception is indexing operations: to enable label based indexing, xarray will automatically load coordinate labels into memory.
Make a dask cluster and load a dataset
import dask
import distributed
cluster = distributed.LocalCluster()
# demonstrate dask dataset
dasky = xr.tutorial.open_dataset(
"air_temperature",
chunks={"time": 10}, # 10 time steps in each block; chunks tells xarray to return a dask dataset
)
dasky.air
# demonstrate lazy mean
dasky.air.mean("lat")
# "compute" the mean
dasky.air.mean("lat").compute()
Data selection with xarray
ds.sel(time='2014-12-11').max(['latitude', 'longitude'])
Serialize Dask data to and from disk
ds.to_netcdf('my-dataset_.nc')
saved_ds = xray.open_dataset('my-dataset_.nc')
assert saved_ds.equals(ds)
Pass Dask data to a Pandas dataframe
df = ds.to_dataframe()
Reading and writing data with xarray and Dask
This assumes NetCDF data.
# Open one file with chunking
ds = xr.open_dataset('example-data.nc', chunks={'time': 10})
# Open multiple files, automatically concatenate and merge data set into a single Dask array
xr.open_mfdataset('my/files/*.nc')
Print a Dask array
In [3]: ds.temperature
Out[3]:
<xarray.DataArray 'temperature' (time: 365, latitude: 180, longitude: 360)>
dask.array<shape=(365, 180, 360), dtype=float64, chunksize=(10, 180, 360)>
Coordinates:
* time (time) datetime64[ns] 2015-01-01 2015-01-02 ... 2015-12-31
* longitude (longitude) int64 0 1 2 3 4 5 6 7 ... 353 354 355 356 357 358 359
* latitude (latitude) float64 89.5 88.5 87.5 86.5 ... -87.5 -88.5 -89.5
Write a Dask array that's too big to fit into memory
ds.to_netcdf('manipulated-example-data.nc')
Using Dask delayed objects to view a progress bar
In [5]: from dask.diagnostics import ProgressBar
# or distributed.progress when using the distributed scheduler
In [6]: delayed_obj = ds.to_netcdf('manipulated-example-data.nc', compute=False)
In [7]: with ProgressBar():
...: results = delayed_obj.compute()
...:
[ ] | 0% Completed | 0.0s
[# ] | 3% Completed | 0.2s
[################ ] | 40% Completed | 0.3s
[############################ ] | 72% Completed | 0.4s
[####################################### ] | 98% Completed | 1.3s
[########################################] | 100% Completed | 1.4s
Convert a dataset to a Dask DataFrame
df = ds.to_dask_dataframe()
Dask DataFrames do not support multi-indexes so the coordinate variables from the dataset are included as columns in the Dask DataFrame.
Convert an xarray data structure from lazy Dask arrays into in-memory NumPy arrays
Multiple methods:
ds.load()
ds.temperature.values
np.asarray(ds.temperature)
Load Dask data into memory, keeping the arrays as Dask arrays
This is particularly useful when using a distributed cluster because the data will be loaded into distributed memory across your machines and be much faster to use than reading repeatedly from disk. Warning that on a single machine this operation will try to load all of your data into memory. You should make sure that your dataset is not larger than available memory.
# Load all of the data
ds = ds.persist()
Convert xarray data to Dask array with chunking
rechunked = ds.chunk({'latitude': 100, 'longitude': 100})
View size of existing chunks on an array
rechunked.chunks
Apply NumPy ufuncs (like np.sin) to xarray objects that store lazy Dask arrays
import xarray.ufuncs as xu
xu.sin(rechunked)
Access Dask array directly
ds.temperature.data
Automatic parallelization with xarray and Dask
Bottleneck
Use with bottleneck - a collection of fast NumPy array functions written in C
Only arrays with data type (dtype) int32, int64, float32, and float64 are accelerated.
import numpy as np
import bottleneck as bn
a = np.array([1, 2, np.nan, 4, 5])
# Find nanmean
bn.nanmean(a)
# Find moving window mean
bn.move_mean(a, window=2, min_count=1)
# Benchmark bottleneck against numpy
bn.bench_detailed("move_median", fraction_nan=0.3)
xray + dask: out-of-core, labeled arrays in Python - Dask is a new Python library that extends NumPy to out-of-core datasets by blocking arrays into small chunks and executing on those chunks in parallel. It allows xray to easily process large data and also simultaneously make use of all of our CPU resources.
Parallel computing with xarray and Dask
Xarray related projects
Xarray geoscience projects
xshape - Tools for working with shapefiles, topographies, and polygons in xarray
xarray-simlab - xarray-simlab provides a framework for easily building custom computational models from a set of modular components (i.e., Python classes), called processes.
xsimlab example for Landscape Evolution Modeling - Also uses Holoviews
xESMF: Universal Regridder for Geospatial Data
sklearn-xarray - sklearn-xarray is an open-source python package that combines the n-dimensional labeled arrays of xarray with the machine learning and model selection tools of scikit-learn. The package contains wrappers that allow the user to apply scikit-learn estimators to xarray types without losing their labels.
Elm - Ensemble Learning Models - Ensemble Learning Models (elm) is a set of tools for creating multiple unsupervised and supervised machine learning models and training them in parallel on datasets too large to fit into the RAM of a single machine, with a focus on applications in climate science, GIS, and satellite imagery. Elm draws from Dask, Sklearn, and xarray.
MetPy - MetPy is a collection of tools in Python for reading, visualizing, and performing calculations with weather data.
EarthML - Machine learning and visualization in Python for Earth science
GrebCut - The GrabCut algorithm provides a way to annotate an image using polygons or lines to demark the foreground and background.