You can run this notebook in a live session Binder or view it on Github.

724babf4a2eb493b9a99be9a470adef5

Data structures for multi-dimensional data

In this lesson, we discuss cover the basics of Xarray data structures. Our learning goals are as follows. By the end of the lesson, we will be able to:

  • Understand the basic data structures in Xarray

  • Build and inspect DataArray and Dataset objects.

  • Read and write various data formats (e.g. NetCDF, Zarr) using Xarray.


Introduction

Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”) are an essential part of computational science. They are encountered in a wide range of fields, including physics, astronomy, geoscience, bioinformatics, engineering, finance, and deep learning. In Python, NumPy provides the fundamental data structure and API for working with raw ND arrays. However, real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc.

Here is an example of how we might structure a dataset for a weather forecast:

a3150d8fa8224c96839b446ed1db5fe0

You’ll notice multiple data variables (temperature, precipitation), coordinate variables (latitude, longitude), and dimensions (x, y, t). We’ll cover how these fit into Xarray’s data structures below.

Xarray doesn’t just keep track of labels on arrays – it uses them to provide a powerful and concise interface. For example:

  • Apply operations over dimensions by name: x.sum('time').

  • Select values by label (or logical location) instead of integer location: x.loc['2014-01-01'] or x.sel(time='2014-01-01').

  • Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape.

  • Easily use the split-apply-combine paradigm with groupby: x.groupby('time.dayofyear').mean().

  • Database-like alignment based on coordinate labels that smoothly handles missing values: x, y = xr.align(x, y, join='outer').

  • Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs.

The N-dimensional nature of xarray’s data structures makes it suitable for dealing with multi-dimensional scientific data, and its use of dimension names instead of axis labels (dim='time' instead of axis=0) makes such arrays much more manageable than the raw numpy ndarray: with xarray, you don’t need to keep track of the order of an array’s dimensions or insert dummy dimensions of size 1 to align arrays (e.g., using np.newaxis).

The immediate payoff of using xarray is that you’ll write less code. The long-term payoff is that you’ll understand what you were thinking when you come back to look at it weeks or months later.

Data structures

Xarray provides two data structures: the DataArray and Dataset. The DataArray class attaches dimension names, coordinates and attributes to multi-dimensional arrays while Dataset combines multiple arrays.

Both classes are most commonly created by reading data, but to understand them let’s first look at creating them programmatically.

DataArray

The DataArray class is used to attach a name, dimension names, labels, and attributes to an array.

As an example, let’s create a DataArray named a with three dimensions (named x, y, z) from a numpy array:

[1]:
import numpy as np
import xarray as xr

rng = np.random.default_rng(seed=0)  # we'll use this later
[2]:
da = xr.DataArray(
    np.ones((3, 4, 2)),
    dims=("x", "y", "z"),
    name="a",
    coords={"z": [-1, 1], "u": ("x", [0.1, 1.2, 2.3])},
    attrs={"attr": "value"},
)

In this case, we used a 3x4 numpy array with all values being equal to 1, but it can be anything that either behaves like a numpy array or can be coerced to a numpy array using `numpy.array <https://numpy.org/doc/stable/reference/generated/numpy.array.html>`__.

We also passed a sequence (a tuple here, but could also be a list) containing the dimension names x, y, and z to dims. In case we have only a single dimension we can also pass just the dimension name. For example:

xr.DataArray([1, 1], dims="x")

The dimension names (and the array’s name) can be anything that fits into a python set (i.e. calling hash() on it doesn’t raise an error), but to be useful they should be strings.

coords is a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings). We will look at its format later.

We can also attach arbitrary metadata (attributes) to the DataArray by passing a dict-like to the attrs parameter.

String representations

Now that we have the DataArray we can look at its string representation.

xarray has two representation types: "html" (which is only available in notebooks) and "text". To choose between them, use the display_style option.

Let’s first look at the text representation:

[3]:
with xr.set_options(display_style="text"):
    display(da)
<xarray.DataArray 'a' (x: 3, y: 4, z: 2)>
array([[[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]]])
Coordinates:
  * z        (z) int64 -1 1
    u        (x) float64 0.1 1.2 2.3
Dimensions without coordinates: x, y
Attributes:
    attr:     value

It consists of:

  • the name of the DataArray ('a'). If we didn’t provide a name, this will be omitted

  • the dimensions of the array (x: 3, y: 4, z: 2): this tells us that the first dimension is named x and has a size of 3, the second dimension is named y and has a size of 4, and the third dimension is named z and has a size of 2

  • a preview of the data

  • a (unordered) list of coordinates or dimensions with coordinates with one item per line. Each item has a name, one or more dimensions in parentheses, a dtype and a preview of the values. Also, if it is a dimension coordinate, it will be marked with a *.

  • a alphabetically sorted list of dimensions without coordinates

  • a (unordered) list of attributes

The "html" representation looks similar:

[4]:
with xr.set_options(display_style="html"):
    display(da)
<xarray.DataArray 'a' (x: 3, y: 4, z: 2)>
array([[[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]]])
Coordinates:
  * z        (z) int64 -1 1
    u        (x) float64 0.1 1.2 2.3
Dimensions without coordinates: x, y
Attributes:
    attr:     value

except the data preview was collapsed to a single line (we can expand it by clicking on the symbol on the left) and the dimensions are marked by a bold font instead of a * prefix.

Going forward, we’ll use the HTML representation except when explaining the text representation.

Once we have created the DataArray, we can look at its data:

[5]:
da.data
[5]:
array([[[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]]])
[6]:
da.dims
[6]:
('x', 'y', 'z')
[7]:
da.coords
[7]:
Coordinates:
  * z        (z) int64 -1 1
    u        (x) float64 0.1 1.2 2.3
[8]:
da.attrs
[8]:
{'attr': 'value'}

coordinates

As mentioned above, coords is a dict-like mapping names to values. These values can be either

  • another DataArray object

  • a tuple of the form (dims, data, attrs) where attrs is optional. This is roughly equivalent to creating a new DataArray object with DataArray(dims=dims, data=data, attrs=attrs)

  • a numpy array (or anything that can be coerced to one using numpy.array).

Let’s look at an example:

[9]:
da = xr.DataArray(
    np.ones((3, 4)),
    dims=("x", "y"),
    coords={
        "x": ["a", "b", "c"],
        "y": np.arange(4),
        "u": ("x", np.arange(3), {"attr1": 0}),
    },
)
da
[9]:
<xarray.DataArray (x: 3, y: 4)>
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])
Coordinates:
  * x        (x) <U1 'a' 'b' 'c'
  * y        (y) int64 0 1 2 3
    u        (x) int64 0 1 2

We can see that we assigned labels to the x and y dimensions and also created a coordinate named u along x with its own metadata (click on the sheet icon to look at them).

The difference between the dimension labels (dimension coordinates) and normal coordinates is that for now it only is possible to use indexing operations (sel, reindex, etc) with dimension coordinates. Also, while coordinates can have arbitrary dimensions, dimension coordinates have to be one-dimensional.

Exercises

create a DataArray named “height” from random data

  1. with dimensions named “latitude” and “longitude”

[10]:
height = rng.random((180, 360)) * 400
xr.DataArray(
    # your code here
)
[10]:
<xarray.DataArray ()>
array(nan)
  1. with dimension coordinates:

  • “latitude”: -90 to 90 with step size 1

  • “longitude”: -180 to 180 with step size 1

[11]:
xr.DataArray(
    # your code here
)
[11]:
<xarray.DataArray ()>
array(nan)
  1. with metadata for both data and coordinates:

  • height: “type”: “ellipsoid”

  • latitude: “type”: “geodetic”

  • longitude: “prime_meridian”: “greenwich”

[12]:
xr.DataArray(
    # your code here
)
[12]:
<xarray.DataArray ()>
array(nan)

Dataset

Dataset objects collect multiple data variables, each with possibly different dimensions.

The constructor of Dataset takes three parameters:

  • data_vars: dict-like mapping names to values. It has the format described in coordinates except we need to use either DataArray objects or the tuple syntax since we have to provide dimensions

  • coords: same as for DataArray

  • attrs: same as for Dataset

For example, let’s create a Dataset with two variables:

[13]:
ds = xr.Dataset(
    data_vars={
        "a": (("x", "y"), np.ones((3, 4))),
        "b": ("t", np.full((8,), 3), {"attr": "value"}),
    },
    coords={"x": [-1, 0, 1],},
    attrs={"attr": "value"},
)

String representations

Let’s again first look at the text representation:

[14]:
with xr.set_options(display_style="text"):
    display(ds)
<xarray.Dataset>
Dimensions:  (t: 8, x: 3, y: 4)
Coordinates:
  * x        (x) int64 -1 0 1
Dimensions without coordinates: t, y
Data variables:
    a        (x, y) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
    b        (t) int64 3 3 3 3 3 3 3 3
Attributes:
    attr:     value

It consists of

  • a summary of all dimensions in the dataset and their lengths

  • a unordered list of coordinates (same format as the DataArray)

  • a unordered list of dimensions without coordinates

  • a unordered list of data variables: each item has the same format as the coordinates with the exception of the dimension mark (*)

Again, the HTML representation is similar:

[15]:
with xr.set_options(display_style="html"):
    display(ds)
<xarray.Dataset>
Dimensions:  (t: 8, x: 3, y: 4)
Coordinates:
  * x        (x) int64 -1 0 1
Dimensions without coordinates: t, y
Data variables:
    a        (x, y) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
    b        (t) int64 3 3 3 3 3 3 3 3
Attributes:
    attr:     value

coordinates

As with DataArray, a Dataset really becomes useful once we assign coordinates. Here we also introduce using Pandas datetime objects as coordinate values:

[16]:
import pandas as pd

xr.Dataset(
    data_vars={
        "a": (("x", "y"), np.ones((3, 4))),
        "b": (("t", "x"), np.full((8, 3), 3)),
    },
    coords={
        "x": ["a", "b", "c"],
        "y": np.arange(4),
        "t": pd.date_range("2020-07-05", periods=8, freq="D"),
    },
)
[16]:
<xarray.Dataset>
Dimensions:  (t: 8, x: 3, y: 4)
Coordinates:
  * x        (x) <U1 'a' 'b' 'c'
  * y        (y) int64 0 1 2 3
  * t        (t) datetime64[ns] 2020-07-05 2020-07-06 ... 2020-07-11 2020-07-12
Data variables:
    a        (x, y) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
    b        (t, x) int64 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

If we have variables with different values along the same dimension, we can’t use the shortcut syntax anymore. Instead, we need to use DataArray objects:

[17]:
x_a = np.arange(1, 4)
x_b = np.arange(-1, 3)

a = xr.DataArray(np.linspace(0, 1, 3), dims="x", coords={"x": x_a})
b = xr.DataArray(np.zeros(4), dims="x", coords={"x": x_b})

xr.Dataset(data_vars={"a": a, "b": b})
[17]:
<xarray.Dataset>
Dimensions:  (x: 5)
Coordinates:
  * x        (x) int64 -1 0 1 2 3
Data variables:
    a        (x) float64 nan nan 0.0 0.5 1.0
    b        (x) float64 0.0 0.0 0.0 0.0 nan

which combines the coordinates and fills in floating-point nan values for missing data (converting the data type to float in the process). For example, b doesn’t have a value for x == 3 so nan was used.

Exercises

  1. create a Dataset with two variables along latitude and longitude: height and gravity_anomaly

[18]:
height = rng.random((180, 360)) * 400
gravity_anomaly = rng.random((180, 360)) * 400 - 200

xr.Dataset(
    # your code here
)
[18]:
<xarray.Dataset>
Dimensions:  ()
Data variables:
    *empty*
  1. add coordinates to latitude and longitude:

  • latitude: from -90 to 90 with step size 1

  • longitude: from -180 to 180 with step size 1

[19]:
xr.Dataset(
    # your code here
)
[19]:
<xarray.Dataset>
Dimensions:  ()
Data variables:
    *empty*
  1. add metadata to coordinates and variables:

  • latitude: “type”: “geodetic”

  • longitude: “prime_meridian”: “greenwich”

  • height: “ellipsoid”: “wgs84”

  • gravity_anomaly: “ellipsoid”: “grs80”

[20]:
xr.Dataset(
    # your code here
)
[20]:
<xarray.Dataset>
Dimensions:  ()
Data variables:
    *empty*

Roundtripping and I/O

DataArray and Dataset objects are frequently created by converting from other libraries such as pandas or by reading from data storage formats such as NetCDF or zarr.

To convert from / to pandas, we can use the to_xarray methods on pandas objects or the to_pandas methods on xarray objects:

[21]:
series = pd.Series(np.ones((10,)), index=list("abcdefghij"))
series
[21]:
a    1.0
b    1.0
c    1.0
d    1.0
e    1.0
f    1.0
g    1.0
h    1.0
i    1.0
j    1.0
dtype: float64
[22]:
arr = series.to_xarray()
arr
[22]:
<xarray.DataArray (index: 10)>
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
Coordinates:
  * index    (index) object 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j'
[23]:
arr.to_pandas()
[23]:
index
a    1.0
b    1.0
c    1.0
d    1.0
e    1.0
f    1.0
g    1.0
h    1.0
i    1.0
j    1.0
dtype: float64

We can also control what pandas object is used by calling to_series / to_dataframe:

[24]:
ds = xr.Dataset(
    data_vars={"a": ("x", np.arange(5)), "b": (("x", "y"), np.ones((5, 4)))}
)

to_series: This will always convert DataArray objects to pandas.Series, using a MultiIndex for higher dimensions

[25]:
ds.a.to_series()
[25]:
x
0    0
1    1
2    2
3    3
4    4
Name: a, dtype: int64
[26]:
ds.b.to_series()
[26]:
x  y
0  0    1.0
   1    1.0
   2    1.0
   3    1.0
1  0    1.0
   1    1.0
   2    1.0
   3    1.0
2  0    1.0
   1    1.0
   2    1.0
   3    1.0
3  0    1.0
   1    1.0
   2    1.0
   3    1.0
4  0    1.0
   1    1.0
   2    1.0
   3    1.0
Name: b, dtype: float64

to_dataframe: This will always convert DataArray or Dataset objects to a pandas.DataFrame. Note that DataArray objects have to be named for this.

[27]:
ds.a.to_dataframe()
[27]:
a
x
0 0
1 1
2 2
3 3
4 4

Since columns in a DataFrame need to have the same index, they are broadcasted.

[28]:
ds.to_dataframe()
[28]:
a b
x y
0 0 0 1.0
1 0 1.0
2 0 1.0
3 0 1.0
1 0 1 1.0
1 1 1.0
2 1 1.0
3 1 1.0
2 0 2 1.0
1 2 1.0
2 2 1.0
3 2 1.0
3 0 3 1.0
1 3 1.0
2 3 1.0
3 3 1.0
4 0 4 1.0
1 4 1.0
2 4 1.0
3 4 1.0

I/O

One of Xarray’s most widely used features is its ability to read from and write to a variety of data formats. For example, Xarray can read the following formats:

  • NetCDF / GRIB (via open_dataset / open_mfdataset, to_netcdf / save_mfdataset)

  • Zarr (via open_zarr, to_zarr)

  • GeoTIFF / GDAL rasters (via open_rasterio)

NetCDF

The recommended way to store xarray data structures is NetCDF, which is a binary file format for self-described datasets that originated in the geosciences. Xarray is based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects.

Xarray reads and writes to NetCDF files using the open_dataset / open_dataarray functions and the to_netcdf method.

Let’s first create some datasets and write them to disk using to_netcdf, which takes the path we want to write to:

[29]:
ds1 = xr.Dataset(
    data_vars={
        "a": (("x", "y"), np.random.randn(4, 2)),
        "b": (("z", "x"), np.random.randn(6, 4)),
    },
    coords={"x": np.arange(4), "y": np.arange(-2, 0), "z": np.arange(-3, 3),},
)
ds2 = xr.Dataset(
    data_vars={
        "a": (("x", "y"), np.random.randn(7, 3)),
        "b": (("z", "x"), np.random.randn(2, 7)),
    },
    coords={"x": np.arange(6, 13), "y": np.arange(3), "z": np.arange(3, 5),},
)

# write datasets
ds1.to_netcdf("ds1.nc")
ds2.to_netcdf("ds2.nc")

# write dataarray
ds1.a.to_netcdf("da1.nc")

Reading those files is just as simple:

[30]:
xr.open_dataset("ds1.nc")
[30]:
<xarray.Dataset>
Dimensions:  (x: 4, y: 2, z: 6)
Coordinates:
  * x        (x) int64 0 1 2 3
  * y        (y) int64 -2 -1
  * z        (z) int64 -3 -2 -1 0 1 2
Data variables:
    a        (x, y) float64 0.7848 0.6376 1.038 -1.265 ... -0.03825 0.2744 2.287
    b        (z, x) float64 -0.1233 -0.3147 1.052 ... -1.949 -1.866 0.7951
[31]:
xr.open_dataarray("da1.nc")
[31]:
<xarray.DataArray 'a' (x: 4, y: 2)>
array([[ 0.784845,  0.637612],
       [ 1.038295, -1.264836],
       [-0.305506, -0.038254],
       [ 0.274398,  2.286821]])
Coordinates:
  * x        (x) int64 0 1 2 3
  * y        (y) int64 -2 -1

Zarr

Zarr is a Python package and data format providing an implementation of chunked, compressed, N-dimensional arrays. Zarr has the ability to store arrays in a range of ways, including in memory, in files, and in cloud-based object storage such as Amazon S3 and Google Cloud Storage. Xarray’s Zarr backend allows xarray to leverage these capabilities.

Zarr files can be written with:

[32]:
ds1.to_zarr("ds1.zarr", mode="w")
[32]:
<xarray.backends.zarr.ZarrStore at 0x7fdbbacb2450>
or to any `MutableMapping` interface:
[33]:
mystore = {}

ds1.to_zarr(store=mystore)
[33]:
<xarray.backends.zarr.ZarrStore at 0x7fdbbacc47c0>

We can then read the created file with:

[34]:
xr.open_zarr("ds1.zarr", chunks=None)
[34]:
<xarray.Dataset>
Dimensions:  (x: 4, y: 2, z: 6)
Coordinates:
  * x        (x) int64 0 1 2 3
  * y        (y) int64 -2 -1
  * z        (z) int64 -3 -2 -1 0 1 2
Data variables:
    a        (x, y) float64 ...
    b        (z, x) float64 ...

setting the chunks parameter to None avoids dask (more on that in a later session)