Xarray - BKJackson/BKJackson_Wiki GitHub Wiki

Creating a DataArray

data = np.random.rand(4, 3)  
locs = ['IA', 'IL', 'IN']  
times = pd.date_range('2000-01-01', periods=4)  
foo = xr.DataArray(data, coords=[times, locs], dims=['time', 'space'])   

Specifying coordinates and dimensions as tuples:

xr.DataArray(data, coords=[('time', times), ('space', locs)])

If only data is passed to DataArray(), coordinates and dimensions will be filled with default values

foo_default = xr.DataArray(data)

Good code example of loading tiff map grid files into an xarray dataset

Return a numpy ndarray from a DataArray

t = ds["sst"].data    
type(t)  

Out: numpy.ndarray  

Indexing and selecting xarray data

Extract an xarray data array from a dataset by label

This keeps the attribute information.

da = ds["sst"]  

Numpy-style indexing of a data array

This preserves labels and metadata.

da[:, 20, 150]

Label-based indexing with .sel

By a single date

x.sel(time='2000-01-01')  
or
x.mean(dim='time')  

By year

da.sel(lat=50.0, lon=200.0, time="2020")  

By date range

ds.sel(time=slice("2019-05", "2020-07"))  

Group by season or day of the week

# seasonal groups
ds.groupby("time.season")

Out:
DatasetGroupBy, grouped over 'season' 
4 groups with labels 'DJF', 'JJA', 'MAM', 'SON'.  

# day of the week groups
ds.groupby("time.dayofweek")

Out: 
DatasetGroupBy, grouped over 'dayofweek' 
7 groups with labels 0, 1, 2, 3, 4, 5, 6.  

# The seasons are out of order (they are alphabetically sorted). This is a common annoyance. The solution is to use .reindex
seasonal_mean = seasonal_mean.reindex(season=["DJF", "MAM", "JJA", "SON"])
seasonal_mean

Make a facetgrid plot by season in one line of code

seasonal_mean.sst.plot(col="season", robust=True, cmap="turbo")

Nearest neighbor lookup

da.sel(lat=52.25, lon=200.8998, method="nearest")

Positional indexing with isel

This will return a time series plot at lat 60, lon 40.

da.isel(lat=60, lon=40).plot()

Replace nans with where()

ds.sst.where(ds.sst.notnull(), -99)  

Working with xarray attributes

Datasets can have attributes and DataArrays within Datasets can have their own attributes, such as units and long_name.

View all attributes for a dataset

ds.attrs  

Return a specific attribute for a dataset

ds.attrs["citation"]  

Return attributes for a DataArray in dataset ds

ds.sst.attrs  

Set an arbitrary attribute on a dataarray

ds.sst.attrs["my_custom_attribute"] = "Foo Bar"  

Doing math with xarray

x - x.mean(dim='time)  

Rolling mean operation over time with a window size of 7

ds.sst.rolling(time=7).mean()

Xarray Datasets

xarray.Dataset is xarray’s multi-dimensional equivalent of a DataFrame. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions. It is designed as an in-memory representation of the data model from the netCDF file format.

Xarray metadata

Xarray draws a firm line between

  1. Labels that the library understands: dims and coords
  2. Labels for users and user code: attrs

This also means that attrs are not automatically propagated through most operations unless explicitly flagged.

Xarray and Pandas

Convert from xarray to Pandas

Need to specify dtype conversion:

arr = xr.DataArray([1, 2, 3])  

pd.Series({'x': arr[0], 'mean': arr.mean(), 'std': arr.std()}, dtype=float)

# Returns: 
x       1.000000
mean    2.000000
std     0.816497
dtype: float64

# Alternatively, use the item method or float constructor   
pd.Series({'x': arr[0].item(), 'mean': float(arr.mean())}) 

# Returns:    
x       1.0
mean    2.0
dtype: float64 

Convert xarray to series and back while keeping labels

series = data.to_series()

series.to_xarray()  

When to use Pandas instead of xarray

From xarray docs:
That said, you should only bother with xarray if some aspect of data is fundamentally multi-dimensional. If your data is unstructured or one-dimensional, pandas is usually the right choice: it has better performance for common operations such as groupby and you’ll find far more usage examples online.

XClim

xclim is a library of functions to compute climate indices. It is built using xarray and can benefit from the parallelization handling provided by dask. Its objective is to make it as simple as possible for users to compute indices from large climate datasets and for scientists to write new indices with very little boilerplate.

xclim is built on very powerful multiprocessing and distributed computation libraries, notably xarray and dask.

xclim and xarray Workflow Examples

Xarray usage with image data and netCDF

Storing JPEG compressed images with xarray and netCDF - Github - In order to interact with the netCDF library we use the open source package xarray. NetCDF allows us to store n-dim arrays with labeled coordinates. But for this example we wont use any labels. We simply store the image as numpy array. In order to reduce memory we use xarrays zlib compression.
The one cool trick: Since xarray does not yet support JPEG compression we had to come up with an alternative. Instead of storing the image as a n-dim array we apply JPEG compression as if we were to store a jpeg file. But instead of storing the resulting byte string on a disk we put it in an numpy array and create a xarray dataset with it. This dataset can be stored in a netCDF file with as much meta data and labels as you need.

Articles

Thoughts on the state of Xarray within the broader scientific Python ecosystem