Xarray - BKJackson/BKJackson_Wiki GitHub Wiki
Creating a DataArray
data = np.random.rand(4, 3)
locs = ['IA', 'IL', 'IN']
times = pd.date_range('2000-01-01', periods=4)
foo = xr.DataArray(data, coords=[times, locs], dims=['time', 'space'])
Specifying coordinates and dimensions as tuples:
xr.DataArray(data, coords=[('time', times), ('space', locs)])
If only data is passed to DataArray(), coordinates and dimensions will be filled with default values
foo_default = xr.DataArray(data)
Good code example of loading tiff map grid files into an xarray dataset
Return a numpy ndarray from a DataArray
t = ds["sst"].data
type(t)
Out: numpy.ndarray
Indexing and selecting xarray data
Extract an xarray data array from a dataset by label
This keeps the attribute information.
da = ds["sst"]
Numpy-style indexing of a data array
This preserves labels and metadata.
da[:, 20, 150]
Label-based indexing with .sel
By a single date
x.sel(time='2000-01-01')
or
x.mean(dim='time')
By year
da.sel(lat=50.0, lon=200.0, time="2020")
By date range
ds.sel(time=slice("2019-05", "2020-07"))
Group by season or day of the week
# seasonal groups
ds.groupby("time.season")
Out:
DatasetGroupBy, grouped over 'season'
4 groups with labels 'DJF', 'JJA', 'MAM', 'SON'.
# day of the week groups
ds.groupby("time.dayofweek")
Out:
DatasetGroupBy, grouped over 'dayofweek'
7 groups with labels 0, 1, 2, 3, 4, 5, 6.
# The seasons are out of order (they are alphabetically sorted). This is a common annoyance. The solution is to use .reindex
seasonal_mean = seasonal_mean.reindex(season=["DJF", "MAM", "JJA", "SON"])
seasonal_mean
Make a facetgrid plot by season in one line of code
seasonal_mean.sst.plot(col="season", robust=True, cmap="turbo")
Nearest neighbor lookup
da.sel(lat=52.25, lon=200.8998, method="nearest")
Positional indexing with isel
This will return a time series plot at lat 60, lon 40.
da.isel(lat=60, lon=40).plot()
Replace nans with where()
ds.sst.where(ds.sst.notnull(), -99)
Working with xarray attributes
Datasets can have attributes and DataArrays within Datasets can have their own attributes, such as units and long_name.
View all attributes for a dataset
ds.attrs
Return a specific attribute for a dataset
ds.attrs["citation"]
Return attributes for a DataArray in dataset ds
ds.sst.attrs
Set an arbitrary attribute on a dataarray
ds.sst.attrs["my_custom_attribute"] = "Foo Bar"
Doing math with xarray
groupby: Bin data in to groups and reduceresample: Groupby specialized for time axes. Either downsample or upsample your data.rolling: Operate on rolling windows of your data e.g. running meancoarsen: Downsample your dataweighted: Weight your data before applying reductions
x - x.mean(dim='time)
Rolling mean operation over time with a window size of 7
ds.sst.rolling(time=7).mean()
Xarray Datasets
xarray.Dataset is xarray’s multi-dimensional equivalent of a DataFrame. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions. It is designed as an in-memory representation of the data model from the netCDF file format.
Xarray metadata
Xarray draws a firm line between
- Labels that the library understands:
dimsandcoords - Labels for users and user code:
attrs
This also means that attrs are not automatically propagated through most operations unless explicitly flagged.
Xarray and Pandas
Convert from xarray to Pandas
Need to specify dtype conversion:
arr = xr.DataArray([1, 2, 3])
pd.Series({'x': arr[0], 'mean': arr.mean(), 'std': arr.std()}, dtype=float)
# Returns:
x 1.000000
mean 2.000000
std 0.816497
dtype: float64
# Alternatively, use the item method or float constructor
pd.Series({'x': arr[0].item(), 'mean': float(arr.mean())})
# Returns:
x 1.0
mean 2.0
dtype: float64
Convert xarray to series and back while keeping labels
series = data.to_series()
series.to_xarray()
When to use Pandas instead of xarray
From xarray docs:
That said, you should only bother with xarray if some aspect of data is fundamentally multi-dimensional. If your data is unstructured or one-dimensional, pandas is usually the right choice: it has better performance for common operations such as groupby and you’ll find far more usage examples online.
XClim
xclim is a library of functions to compute climate indices. It is built using xarray and can benefit from the parallelization handling provided by dask. Its objective is to make it as simple as possible for users to compute indices from large climate datasets and for scientists to write new indices with very little boilerplate.
xclim is built on very powerful multiprocessing and distributed computation libraries, notably xarray and dask.
xclim and xarray Workflow Examples
Xarray usage with image data and netCDF
Storing JPEG compressed images with xarray and netCDF - Github - In order to interact with the netCDF library we use the open source package xarray. NetCDF allows us to store n-dim arrays with labeled coordinates. But for this example we wont use any labels. We simply store the image as numpy array. In order to reduce memory we use xarrays zlib compression.
The one cool trick: Since xarray does not yet support JPEG compression we had to come up with an alternative. Instead of storing the image as a n-dim array we apply JPEG compression as if we were to store a jpeg file. But instead of storing the resulting byte string on a disk we put it in an numpy array and create a xarray dataset with it. This dataset can be stored in a netCDF file with as much meta data and labels as you need.
Articles
Thoughts on the state of Xarray within the broader scientific Python ecosystem