Dask and Xarray - BKJackson/BKJackson_Wiki GitHub Wiki

xarray and Dask

(xarray used to be called xray)

Nearly all existing xarray methods (including those for indexing, computation, concatenating and grouped operations) have been extended to work automatically with Dask arrays. When you load data as a Dask array in an xarray data structure, almost all xarray operations will keep it as a Dask array; when this is not possible, they will raise an exception rather than unexpectedly loading data into memory. Converting a Dask array into memory generally requires an explicit conversion step. One notable exception is indexing operations: to enable label based indexing, xarray will automatically load coordinate labels into memory.

Make a dask cluster and load a dataset

import dask
import distributed

cluster = distributed.LocalCluster()  

# demonstrate dask dataset
dasky = xr.tutorial.open_dataset(
    "air_temperature",
    chunks={"time": 10},  # 10 time steps in each block; chunks tells xarray to return a dask dataset
)

dasky.air  

# demonstrate lazy mean
dasky.air.mean("lat")  

# "compute" the mean
dasky.air.mean("lat").compute()

Data selection with xarray

ds.sel(time='2014-12-11').max(['latitude', 'longitude'])

Serialize Dask data to and from disk

ds.to_netcdf('my-dataset_.nc')
saved_ds = xray.open_dataset('my-dataset_.nc')
assert saved_ds.equals(ds)

Pass Dask data to a Pandas dataframe

df = ds.to_dataframe()

Reading and writing data with xarray and Dask

This assumes NetCDF data.

# Open one file with chunking    
ds = xr.open_dataset('example-data.nc', chunks={'time': 10})  

# Open multiple files, automatically concatenate and merge data set into a single Dask array    
xr.open_mfdataset('my/files/*.nc')

Print a Dask array

In [3]: ds.temperature
Out[3]: 
<xarray.DataArray 'temperature' (time: 365, latitude: 180, longitude: 360)>
dask.array<shape=(365, 180, 360), dtype=float64, chunksize=(10, 180, 360)>
Coordinates:
  * time       (time) datetime64[ns] 2015-01-01 2015-01-02 ... 2015-12-31
  * longitude  (longitude) int64 0 1 2 3 4 5 6 7 ... 353 354 355 356 357 358 359
  * latitude   (latitude) float64 89.5 88.5 87.5 86.5 ... -87.5 -88.5 -89.5

Write a Dask array that's too big to fit into memory

ds.to_netcdf('manipulated-example-data.nc')

Using Dask delayed objects to view a progress bar

In [5]: from dask.diagnostics import ProgressBar

# or distributed.progress when using the distributed scheduler
In [6]: delayed_obj = ds.to_netcdf('manipulated-example-data.nc', compute=False)

In [7]: with ProgressBar():
   ...:     results = delayed_obj.compute()
   ...: 

[                                        ] | 0% Completed |  0.0s
[#                                       ] | 3% Completed |  0.2s
[################                        ] | 40% Completed |  0.3s
[############################            ] | 72% Completed |  0.4s
[####################################### ] | 98% Completed |  1.3s
[########################################] | 100% Completed |  1.4s

Convert a dataset to a Dask DataFrame

df = ds.to_dask_dataframe()

Dask DataFrames do not support multi-indexes so the coordinate variables from the dataset are included as columns in the Dask DataFrame.

Convert an xarray data structure from lazy Dask arrays into in-memory NumPy arrays

Multiple methods:

ds.load()  

ds.temperature.values

np.asarray(ds.temperature)

Load Dask data into memory, keeping the arrays as Dask arrays

This is particularly useful when using a distributed cluster because the data will be loaded into distributed memory across your machines and be much faster to use than reading repeatedly from disk. Warning that on a single machine this operation will try to load all of your data into memory. You should make sure that your dataset is not larger than available memory.

# Load all of the data  
ds = ds.persist()

Convert xarray data to Dask array with chunking

rechunked = ds.chunk({'latitude': 100, 'longitude': 100})

View size of existing chunks on an array

rechunked.chunks

Apply NumPy ufuncs (like np.sin) to xarray objects that store lazy Dask arrays

import xarray.ufuncs as xu 

xu.sin(rechunked)

Access Dask array directly

ds.temperature.data

Automatic parallelization with xarray and Dask

Bottleneck

Use with bottleneck - a collection of fast NumPy array functions written in C
Only arrays with data type (dtype) int32, int64, float32, and float64 are accelerated.

import numpy as np
import bottleneck as bn

a = np.array([1, 2, np.nan, 4, 5])

# Find nanmean  
bn.nanmean(a)

# Find moving window mean  
bn.move_mean(a, window=2, min_count=1)  

# Benchmark bottleneck against numpy  
bn.bench_detailed("move_median", fraction_nan=0.3)

xray + dask: out-of-core, labeled arrays in Python - Dask is a new Python library that extends NumPy to out-of-core datasets by blocking arrays into small chunks and executing on those chunks in parallel. It allows xray to easily process large data and also simultaneously make use of all of our CPU resources.
Parallel computing with xarray and Dask

Xarray related projects

Xarray geoscience projects
xshape - Tools for working with shapefiles, topographies, and polygons in xarray
xarray-simlab - xarray-simlab provides a framework for easily building custom computational models from a set of modular components (i.e., Python classes), called processes.
xsimlab example for Landscape Evolution Modeling - Also uses Holoviews
xESMF: Universal Regridder for Geospatial Data
sklearn-xarray - sklearn-xarray is an open-source python package that combines the n-dimensional labeled arrays of xarray with the machine learning and model selection tools of scikit-learn. The package contains wrappers that allow the user to apply scikit-learn estimators to xarray types without losing their labels.
Elm - Ensemble Learning Models - Ensemble Learning Models (elm) is a set of tools for creating multiple unsupervised and supervised machine learning models and training them in parallel on datasets too large to fit into the RAM of a single machine, with a focus on applications in climate science, GIS, and satellite imagery. Elm draws from Dask, Sklearn, and xarray.
MetPy - MetPy is a collection of tools in Python for reading, visualizing, and performing calculations with weather data.
EarthML - Machine learning and visualization in Python for Earth science
GrebCut - The GrabCut algorithm provides a way to annotate an image using polygons or lines to demark the foreground and background.