You can run this notebook in a live session Binder or view it on Github.

66b019dbfc2242f38ed7c5eaeef6c542

Xarray in 45 minutes

In this lesson, we discuss cover the basics of Xarray data structures. By the end of the lesson, we will be able to:

  • Understand the basic data structures in Xarray

  • Inspect DataArray and Dataset objects.

  • Read and write netCDF files using Xarray.

  • Understand that there are many packages that build on top of xarray

A practical example

[1]:
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr

%matplotlib inline
[2]:
# load tutorial dataset
ds = xr.tutorial.load_dataset("air_temperature")

What’s in a dataset? many DataArrays

[3]:
# dataset repr
ds
[3]:
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Datasets are dict-like containers of DataArrays i.e. they are a mapping of variable name to DataArray.

[4]:
# pull out "air" dataarray with dictionary syntax
ds["air"]
[4]:
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
array([[[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    ,
         238.59999],
        [243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999,
         239.29999],
        [250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   ,
         241.7    ],
        ...,
        [296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    ,
         294.69998],
        [295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    ,
         295.19998],
        [296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   ,
         296.6    ]],

       [[242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999,
         235.79999],
        [243.59999, 244.09999, 244.2    , ..., 231.     , 232.5    ,
         235.7    ],
        [253.2    , 252.89   , 252.09999, ..., 230.79999, 233.39   ,
         238.5    ],
...
        [293.69   , 293.88998, 295.38998, ..., 295.09   , 294.69   ,
         294.29   ],
        [296.29   , 297.19   , 297.59   , ..., 295.29   , 295.09   ,
         294.38998],
        [297.79   , 298.38998, 298.49   , ..., 295.69   , 295.49   ,
         295.19   ]],

       [[245.09   , 244.29   , 243.29   , ..., 241.68999, 241.48999,
         241.79   ],
        [249.89   , 249.29   , 248.39   , ..., 239.59   , 240.29   ,
         241.68999],
        [262.99   , 262.19   , 261.38998, ..., 239.89   , 242.59   ,
         246.29   ],
        ...,
        [293.79   , 293.69   , 295.09   , ..., 295.29   , 295.09   ,
         294.69   ],
        [296.09   , 296.88998, 297.19   , ..., 295.69   , 295.69   ,
         295.19   ],
        [297.69   , 298.09   , 298.09   , ..., 296.49   , 296.19   ,
         295.69   ]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

You can save some typing by using the “attribute” or “dot” notation. This won’t work for variable names that clash with a built-in method name (like mean for example).

[5]:
# pull out dataarray using dot notation
ds.air  ## same as ds["air"]
[5]:
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
array([[[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    ,
         238.59999],
        [243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999,
         239.29999],
        [250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   ,
         241.7    ],
        ...,
        [296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    ,
         294.69998],
        [295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    ,
         295.19998],
        [296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   ,
         296.6    ]],

       [[242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999,
         235.79999],
        [243.59999, 244.09999, 244.2    , ..., 231.     , 232.5    ,
         235.7    ],
        [253.2    , 252.89   , 252.09999, ..., 230.79999, 233.39   ,
         238.5    ],
...
        [293.69   , 293.88998, 295.38998, ..., 295.09   , 294.69   ,
         294.29   ],
        [296.29   , 297.19   , 297.59   , ..., 295.29   , 295.09   ,
         294.38998],
        [297.79   , 298.38998, 298.49   , ..., 295.69   , 295.49   ,
         295.19   ]],

       [[245.09   , 244.29   , 243.29   , ..., 241.68999, 241.48999,
         241.79   ],
        [249.89   , 249.29   , 248.39   , ..., 239.59   , 240.29   ,
         241.68999],
        [262.99   , 262.19   , 261.38998, ..., 239.89   , 242.59   ,
         246.29   ],
        ...,
        [293.79   , 293.69   , 295.09   , ..., 295.29   , 295.09   ,
         294.69   ],
        [296.09   , 296.88998, 297.19   , ..., 295.69   , 295.69   ,
         295.19   ],
        [297.69   , 298.09   , 298.09   , ..., 296.49   , 296.19   ,
         295.69   ]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

What’s in a DataArray? data + (a lot of) metadata

Named dimensions .dims

[6]:
ds.air.dims
[6]:
('time', 'lat', 'lon')

Coordinate variables or “tick labels” (.coords)

.coords is a simple data container for coordinate variables.

[7]:
ds.air.coords
[7]:
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00

Coordinates objects support similar indexing notation

[8]:
# extracting coordinate variables
ds.air.lon
[8]:
<xarray.DataArray 'lon' (lon: 53)>
array([200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5,
       225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5,
       250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5,
       275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5,
       300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5,
       325. , 327.5, 330. ], dtype=float32)
Coordinates:
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Attributes:
    standard_name:  longitude
    long_name:      Longitude
    units:          degrees_east
    axis:           X
[9]:
# extracting coorindate variables from .coords
ds.coords["lon"]
[9]:
<xarray.DataArray 'lon' (lon: 53)>
array([200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5,
       225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5,
       250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5,
       275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5,
       300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5,
       325. , 327.5, 330. ], dtype=float32)
Coordinates:
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Attributes:
    standard_name:  longitude
    long_name:      Longitude
    units:          degrees_east
    axis:           X

It is useful to think of the values in these coordinate variables as axis “labels” such as “tick labels” in a figure. These are coordinate locations on a grid at which you have data.

Arbitrary attributes (.attrs)

.attrs is a dictionary that can contain arbitrary python objects. Your only limitation is that some attributes may not be writeable to a netCDF file

[10]:
ds.air.attrs
[10]:
{'long_name': '4xDaily Air temperature at sigma level 995',
 'units': 'degK',
 'precision': 2,
 'GRIB_id': 11,
 'GRIB_name': 'TMP',
 'var_desc': 'Air temperature',
 'dataset': 'NMC Reanalysis',
 'level_desc': 'Surface',
 'statistic': 'Individual Obs',
 'parent_stat': 'Other',
 'actual_range': array([185.16, 322.1 ], dtype=float32)}
[11]:
# assign your own attribute
ds.air.attrs["who_is_awesome"] = "xarray"
ds.air.attrs
[11]:
{'long_name': '4xDaily Air temperature at sigma level 995',
 'units': 'degK',
 'precision': 2,
 'GRIB_id': 11,
 'GRIB_name': 'TMP',
 'var_desc': 'Air temperature',
 'dataset': 'NMC Reanalysis',
 'level_desc': 'Surface',
 'statistic': 'Individual Obs',
 'parent_stat': 'Other',
 'actual_range': array([185.16, 322.1 ], dtype=float32),
 'who_is_awesome': 'xarray'}

Underlying data (.data)

Xarray structures wrap underlying simpler data structures. In this case, the underlying data is a numpy array which you may be familiar with.

This part of xarray is quite extensible allowing for GPU arrays, sparse arrays, arrays with units etc. See the demo at the end.

[12]:
ds.air.data
[12]:
array([[[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    ,
         238.59999],
        [243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999,
         239.29999],
        [250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   ,
         241.7    ],
        ...,
        [296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    ,
         294.69998],
        [295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    ,
         295.19998],
        [296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   ,
         296.6    ]],

       [[242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999,
         235.79999],
        [243.59999, 244.09999, 244.2    , ..., 231.     , 232.5    ,
         235.7    ],
        [253.2    , 252.89   , 252.09999, ..., 230.79999, 233.39   ,
         238.5    ],
        ...,
        [296.4    , 295.9    , 296.19998, ..., 295.4    , 295.1    ,
         294.79   ],
        [296.19998, 296.69998, 296.79   , ..., 295.6    , 295.5    ,
         295.1    ],
        [296.29   , 297.19998, 297.4    , ..., 296.4    , 296.4    ,
         296.6    ]],

       [[242.29999, 242.2    , 242.29999, ..., 234.29999, 236.09999,
         238.7    ],
        [244.59999, 244.39   , 244.     , ..., 230.29999, 232.     ,
         235.7    ],
        [256.19998, 255.5    , 254.2    , ..., 231.2    , 233.2    ,
         238.2    ],
        ...,
        [295.6    , 295.4    , 295.4    , ..., 296.29   , 295.29   ,
         295.     ],
        [296.19998, 296.5    , 296.29   , ..., 296.4    , 296.     ,
         295.6    ],
        [296.4    , 296.29   , 296.4    , ..., 297.     , 297.     ,
         296.79   ]],

       ...,

       [[243.48999, 242.98999, 242.09   , ..., 244.18999, 244.48999,
         244.89   ],
        [249.09   , 248.98999, 248.59   , ..., 240.59   , 241.29   ,
         242.68999],
        [262.69   , 262.19   , 261.69   , ..., 239.39   , 241.68999,
         245.18999],
        ...,
        [294.79   , 295.29   , 297.49   , ..., 295.49   , 295.38998,
         294.69   ],
        [296.79   , 297.88998, 298.29   , ..., 295.49   , 295.49   ,
         294.79   ],
        [298.19   , 299.19   , 298.79   , ..., 296.09   , 295.79   ,
         295.79   ]],

       [[245.79   , 244.79   , 243.48999, ..., 243.29   , 243.98999,
         244.79   ],
        [249.89   , 249.29   , 248.48999, ..., 241.29   , 242.48999,
         244.29   ],
        [262.38998, 261.79   , 261.29   , ..., 240.48999, 243.09   ,
         246.89   ],
        ...,
        [293.69   , 293.88998, 295.38998, ..., 295.09   , 294.69   ,
         294.29   ],
        [296.29   , 297.19   , 297.59   , ..., 295.29   , 295.09   ,
         294.38998],
        [297.79   , 298.38998, 298.49   , ..., 295.69   , 295.49   ,
         295.19   ]],

       [[245.09   , 244.29   , 243.29   , ..., 241.68999, 241.48999,
         241.79   ],
        [249.89   , 249.29   , 248.39   , ..., 239.59   , 240.29   ,
         241.68999],
        [262.99   , 262.19   , 261.38998, ..., 239.89   , 242.59   ,
         246.29   ],
        ...,
        [293.79   , 293.69   , 295.09   , ..., 295.29   , 295.09   ,
         294.69   ],
        [296.09   , 296.88998, 297.19   , ..., 295.69   , 295.69   ,
         295.19   ],
        [297.69   , 298.09   , 298.09   , ..., 296.49   , 296.19   ,
         295.69   ]]], dtype=float32)
[13]:
# what is the type of the underlying data
type(ds.air.data)
[13]:
numpy.ndarray

A numpy array!

aad537922bd74e24aac5a449917ae15d

Review

Xarray provides two main data structures

  • DataArrays that wrap underlying data containers (e.g. numpy arrays) and contain associated metadata

  • Datasets that are dict-like containers of DataArrays

For more see


Why xarray? Use metadata for fun and profit papers!

Analysis without xarray: X(

[14]:
# plot the first timestep
lat = ds.air.lat.data  # numpy array
lon = ds.air.lon.data  # numpy array
temp = ds.air.data  # numpy array
plt.figure()
plt.pcolormesh(lon, lat, temp[0, :, :])
<ipython-input-1-0f3446d5e912>:6: MatplotlibDeprecationWarning: shading='flat' when X and Y have the same dimensions as C is deprecated since 3.3.  Either specify the corners of the quadrilaterals with X and Y, or pass shading='auto', 'nearest' or 'gouraud', or set rcParams['pcolor.shading'].  This will become an error two minor releases later.
  plt.pcolormesh(lon, lat, temp[0, :, :])
[14]:
<matplotlib.collections.QuadMesh at 0x7ff8da2091c0>
../_images/oceanhackweek-2020_xarray-oceanhackweek20_27_2.png
[15]:
temp.mean(axis=1)  ## what did I just do? I can't tell by looking at this line.
[15]:
array([[279.39798, 279.6664 , 279.66122, ..., 279.9508 , 280.31522,
        280.6624 ],
       [279.05722, 279.538  , 279.7296 , ..., 279.77563, 280.27002,
        280.79764],
       [279.0104 , 279.2808 , 279.5508 , ..., 279.682  , 280.19763,
        280.81403],
       ...,
       [279.63   , 279.934  , 280.534  , ..., 279.802  , 280.346  ,
        280.77798],
       [279.398  , 279.66602, 280.31796, ..., 279.766  , 280.34198,
        280.834  ],
       [279.27   , 279.354  , 279.88202, ..., 279.42596, 279.96997,
        280.48196]], dtype=float32)

Analysis with xarray =)

How readable is this code?

[16]:
ds.air.isel(time=1).plot(x="lon")
[16]:
<matplotlib.collections.QuadMesh at 0x7ff8da0b91f0>
../_images/oceanhackweek-2020_xarray-oceanhackweek20_30_1.png

Use dimension names instead of axis numbers

[17]:
ds.air.mean("time")
[17]:
<xarray.DataArray 'air' (lat: 25, lon: 53)>
array([[260.37564, 260.1826 , 259.88593, ..., 250.81511, 251.93733,
        253.43741],
       [262.7337 , 262.7936 , 262.7489 , ..., 249.75496, 251.5852 ,
        254.35849],
       [264.7681 , 264.3271 , 264.0614 , ..., 250.60707, 253.58247,
        257.71475],
       ...,
       [297.64932, 296.95294, 296.62912, ..., 296.81033, 296.28793,
        295.81622],
       [298.1287 , 297.93646, 297.47006, ..., 296.8591 , 296.77686,
        296.44348],
       [298.36594, 298.38593, 298.11386, ..., 297.33777, 297.28104,
        297.30502]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0

Extracting data or “indexing” : .sel, .isel

Xarray supports

  • label-based indexing using .sel

  • position-based indexing using .isel

For more see https://xarray.pydata.org/en/stable/indexing.html

Label-based indexing

Xarray inherits its label-based indexing rules from pandas; this means great support for dates and times!

[18]:
# pull out data for all of 2013-May
ds.sel(time="2013-05")
[18]:
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 124)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-05-01 ... 2013-05-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 259.2 259.3 259.1 ... 298.2 297.6 297.5
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
[19]:
# demonstrate slicing
ds.sel(time=slice("2013-05", "2013-07"))
[19]:
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 368)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-05-01 ... 2013-07-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 259.2 259.3 259.1 ... 299.4 299.5 299.7
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
[20]:
# demonstrate "nearest" indexing
ds.sel(lon=240.2, method="nearest")
[20]:
<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
    lon      float32 240.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat) float32 239.6 237.2 240.1 249.0 ... 294.8 296.9 298.4
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
[21]:
# "nearest indexing at multiple points"
ds.sel(lon=[240.125, 234], lat=[40.3, 50.3], method="nearest")
[21]:
<xarray.Dataset>
Dimensions:  (lat: 2, lon: 2, time: 2920)
Coordinates:
  * lat      (lat) float32 40.0 50.0
  * lon      (lon) float32 240.0 235.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 268.1 283.0 265.5 ... 285.2 256.8 268.6
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Position-based indexing

This is similar to your usual numpy array[0, 2, 3] but with the power of named dimensions!

[22]:
# pull out time index 0 and lat index 0
ds.air.isel(time=0, lat=0)  #  much better than ds.air[0, 0, :]
[22]:
<xarray.DataArray 'air' (lon: 53)>
array([241.2    , 242.5    , 243.5    , 244.     , 244.09999, 243.89   ,
       243.59999, 243.09999, 242.5    , 241.89   , 241.2    , 240.29999,
       239.5    , 238.79999, 238.5    , 238.7    , 239.59999, 241.     ,
       242.89   , 244.79999, 246.5    , 247.79999, 248.59999, 249.     ,
       249.09999, 249.09999, 249.     , 248.89   , 248.7    , 248.59999,
       248.39   , 248.29999, 248.29999, 248.59999, 249.     , 249.5    ,
       249.59999, 249.09999, 247.79999, 245.39   , 242.2    , 238.5    ,
       234.7    , 231.29999, 228.79999, 227.29999, 227.     , 227.5    ,
       228.79999, 230.59999, 232.79999, 235.5    , 238.59999],
      dtype=float32)
Coordinates:
    lat      float32 75.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    time     datetime64[ns] 2013-01-01
Attributes:
    long_name:       4xDaily Air temperature at sigma level 995
    units:           degK
    precision:       2
    GRIB_id:         11
    GRIB_name:       TMP
    var_desc:        Air temperature
    dataset:         NMC Reanalysis
    level_desc:      Surface
    statistic:       Individual Obs
    parent_stat:     Other
    actual_range:    [185.16 322.1 ]
    who_is_awesome:  xarray
[23]:
# demonstrate slicing
ds.air.isel(lat=slice(10))
[23]:
<xarray.DataArray 'air' (time: 2920, lat: 10, lon: 53)>
array([[[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    ,
         238.59999],
        [243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999,
         239.29999],
        [250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   ,
         241.7    ],
        ...,
        [274.79   , 275.19998, 275.6    , ..., 277.19998, 277.     ,
         277.     ],
        [275.9    , 276.9    , 276.9    , ..., 280.9    , 280.5    ,
         279.69998],
        [276.69998, 277.4    , 277.69998, ..., 283.29   , 284.1    ,
         283.9    ]],

       [[242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999,
         235.79999],
        [243.59999, 244.09999, 244.2    , ..., 231.     , 232.5    ,
         235.7    ],
        [253.2    , 252.89   , 252.09999, ..., 230.79999, 233.39   ,
         238.5    ],
...
        [275.59   , 276.29   , 277.49   , ..., 275.19   , 275.79   ,
         276.59   ],
        [276.88998, 277.88998, 278.69   , ..., 273.59   , 274.29   ,
         275.29   ],
        [276.79   , 277.29   , 278.29   , ..., 274.19   , 275.38998,
         276.88998]],

       [[245.09   , 244.29   , 243.29   , ..., 241.68999, 241.48999,
         241.79   ],
        [249.89   , 249.29   , 248.39   , ..., 239.59   , 240.29   ,
         241.68999],
        [262.99   , 262.19   , 261.38998, ..., 239.89   , 242.59   ,
         246.29   ],
        ...,
        [274.29   , 274.49   , 275.59   , ..., 274.69   , 274.99   ,
         275.38998],
        [276.79   , 277.49   , 277.99   , ..., 273.19   , 273.59   ,
         274.19   ],
        [276.88998, 277.29   , 277.59   , ..., 273.79   , 274.99   ,
         276.19   ]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:       4xDaily Air temperature at sigma level 995
    units:           degK
    precision:       2
    GRIB_id:         11
    GRIB_name:       TMP
    var_desc:        Air temperature
    dataset:         NMC Reanalysis
    level_desc:      Surface
    statistic:       Individual Obs
    parent_stat:     Other
    actual_range:    [185.16 322.1 ]
    who_is_awesome:  xarray

Concepts for computation

Broadcasting: expanding data

Let’s try to calculate grid cell area associated with the air temperature data. We may want this to make a proper area-weighted domain-average for example

A very approximate formula is

\begin{equation} Δlat \times Δlon \times \cos(\text{latitude}) \end{equation}

assuming that \(Δlon\) = 111km and \(Δlat\) = 111km

[24]:
dlon = np.cos(ds.air.lat * np.pi / 180) * 111e3
dlon
[24]:
<xarray.DataArray 'lat' (lat: 25)>
array([ 28728.904,  33378.348,  37964.227,  42477.863,  46910.63 ,
        51254.094,  55499.996,  59640.254,  63666.984,  67572.516,
        71349.42 ,  74990.51 ,  78488.85 ,  81837.78 ,  85030.93 ,
        88062.22 ,  90925.875,  93616.445,  96128.82 ,  98458.2  ,
       100600.164, 102550.625, 104305.88 , 105862.58 , 107217.766],
      dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
[25]:
dlat = 111e3 * xr.ones_like(ds.air.lon)
dlat
[25]:
<xarray.DataArray 'lon' (lon: 53)>
array([111000., 111000., 111000., 111000., 111000., 111000., 111000.,
       111000., 111000., 111000., 111000., 111000., 111000., 111000.,
       111000., 111000., 111000., 111000., 111000., 111000., 111000.,
       111000., 111000., 111000., 111000., 111000., 111000., 111000.,
       111000., 111000., 111000., 111000., 111000., 111000., 111000.,
       111000., 111000., 111000., 111000., 111000., 111000., 111000.,
       111000., 111000., 111000., 111000., 111000., 111000., 111000.,
       111000., 111000., 111000., 111000.], dtype=float32)
Coordinates:
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
[26]:
cell_area = dlon * dlat
cell_area
[26]:
<xarray.DataArray (lat: 25, lon: 53)>
array([[3.1889083e+09, 3.1889083e+09, 3.1889083e+09, ..., 3.1889083e+09,
        3.1889083e+09, 3.1889083e+09],
       [3.7049966e+09, 3.7049966e+09, 3.7049966e+09, ..., 3.7049966e+09,
        3.7049966e+09, 3.7049966e+09],
       [4.2140291e+09, 4.2140291e+09, 4.2140291e+09, ..., 4.2140291e+09,
        4.2140291e+09, 4.2140291e+09],
       ...,
       [1.1577953e+10, 1.1577953e+10, 1.1577953e+10, ..., 1.1577953e+10,
        1.1577953e+10, 1.1577953e+10],
       [1.1750746e+10, 1.1750746e+10, 1.1750746e+10, ..., 1.1750746e+10,
        1.1750746e+10, 1.1750746e+10],
       [1.1901172e+10, 1.1901172e+10, 1.1901172e+10, ..., 1.1901172e+10,
        1.1901172e+10, 1.1901172e+10]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0

The result has two dimensions because xarray realizes that dimensions lon and lat are different so it automatically “broadcasts” to get a 2D result. See the last row in this image from Jake VanderPlas Python Data Science Handbook

9dfe08b044624ab89700e39205ee8875

Because xarray knows about dimension names we avoid having to create unnecessary size-1 dimensions using np.newaxis or .reshape. For more, see https://xarray.pydata.org/en/stable/computation.html#broadcasting-by-dimension-name


Alignment: putting data on the same grid

When doing arithmetic operations xarray automatically “aligns” i.e. puts the data on the same grid. In this case cell_area and ds.air are at the same lat, lon points so things are multiplied as you would expect

[27]:
(cell_area * ds.air.isel(time=1))
[27]:
<xarray.DataArray (lat: 25, lon: 53)>
array([[7.72034658e+11, 7.73948047e+11, 7.75223575e+11, ...,
        7.39826729e+11, 7.44928969e+11, 7.51944532e+11],
       [9.02537150e+11, 9.04389657e+11, 9.04760132e+11, ...,
        8.55854219e+11, 8.61411738e+11, 8.73267659e+11],
       [1.06699214e+12, 1.06568581e+12, 1.06235671e+12, ...,
        9.72597887e+11, 9.83512252e+11, 1.00504594e+12],
       ...,
       [3.43170535e+12, 3.42591642e+12, 3.42938957e+12, ...,
        3.42012723e+12, 3.41665409e+12, 3.41306507e+12],
       [3.48057082e+12, 3.48644626e+12, 3.48750401e+12, ...,
        3.47352072e+12, 3.47234553e+12, 3.46764529e+12],
       [3.52619830e+12, 3.53702799e+12, 3.53940852e+12, ...,
        3.52750718e+12, 3.52750718e+12, 3.52988771e+12]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    time     datetime64[ns] 2013-01-01T06:00:00

Now lets make cell_area unaligned i.e. change the coordinate labels

[28]:
# make a copy of cell_area
# then add 1e-5 to lat
cell_area_bad = cell_area.copy(deep=True)
cell_area_bad["lat"] = cell_area.lat + 1e-5
cell_area_bad
[28]:
<xarray.DataArray (lat: 25, lon: 53)>
array([[3.1889083e+09, 3.1889083e+09, 3.1889083e+09, ..., 3.1889083e+09,
        3.1889083e+09, 3.1889083e+09],
       [3.7049966e+09, 3.7049966e+09, 3.7049966e+09, ..., 3.7049966e+09,
        3.7049966e+09, 3.7049966e+09],
       [4.2140291e+09, 4.2140291e+09, 4.2140291e+09, ..., 4.2140291e+09,
        4.2140291e+09, 4.2140291e+09],
       ...,
       [1.1577953e+10, 1.1577953e+10, 1.1577953e+10, ..., 1.1577953e+10,
        1.1577953e+10, 1.1577953e+10],
       [1.1750746e+10, 1.1750746e+10, 1.1750746e+10, ..., 1.1750746e+10,
        1.1750746e+10, 1.1750746e+10],
       [1.1901172e+10, 1.1901172e+10, 1.1901172e+10, ..., 1.1901172e+10,
        1.1901172e+10, 1.1901172e+10]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
[29]:
cell_area_bad * ds.air.isel(time=1)
[29]:
<xarray.DataArray (lat: 0, lon: 53)>
array([], shape=(0, 53), dtype=float32)
Coordinates:
  * lat      (lat) float64
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    time     datetime64[ns] 2013-01-01T06:00:00

Tip: If you notice extra NaNs or missing points after xarray computation, it means that your xarray coordinates were not aligned exactly.

For more, see https://xarray.pydata.org/en/stable/computation.html#automatic-alignment


High level computation: groupby, resample, rolling, coarsen, weighted

Xarray has some very useful high level objects that let you do common computations:

  1. groupby : Bin data in to groups and reduce

  2. resample : Groupby specialized for time axes. Either downsample or upsample your data.

  3. rolling : Operate on rolling windows of your data e.g. running mean

  4. coarsen : Downsample your data

  5. weighted : Weight your data before reducing

groupby

[30]:
# seasonal groups
ds.groupby("time.season")
[30]:
DatasetGroupBy, grouped over 'season'
4 groups with labels 'DJF', 'JJA', 'MAM', 'SON'.
[31]:
# make a seasonal mean
seasonal_mean = ds.groupby("time.season").mean()
seasonal_mean
[31]:
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, season: 4)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * season   (season) object 'DJF' 'JJA' 'MAM' 'SON'
Data variables:
    air      (season, lat, lon) float32 247.0 247.0 246.7 ... 299.4 299.4 299.5

The seasons are out of order (they are alphabetically sorted). This is a common annoyance. The solution is to use .reindex

[32]:
seasonal_mean = seasonal_mean.reindex(season=["DJF", "MAM", "JJA", "SON"])
seasonal_mean
[32]:
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, season: 4)
Coordinates:
  * season   (season) object 'DJF' 'MAM' 'JJA' 'SON'
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Data variables:
    air      (season, lat, lon) float32 247.0 247.0 246.7 ... 299.4 299.4 299.5

resample

[33]:
# resample to monthly frequency
ds.resample(time="M").mean()
[33]:
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 24)
Coordinates:
  * time     (time) datetime64[ns] 2013-01-31 2013-02-28 ... 2014-12-31
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Data variables:
    air      (time, lat, lon) float32 244.5 244.7 244.7 ... 297.7 297.7 297.7

weighted

[34]:
# weight by cell_area and take mean over (time, lon)
ds.weighted(cell_area).mean(["lon", "time"]).air.plot()
[34]:
[<matplotlib.lines.Line2D at 0x7ff8da018c10>]
../_images/oceanhackweek-2020_xarray-oceanhackweek20_63_1.png

Visualization: .plot

For more see https://xarray.pydata.org/en/stable/plotting.html and https://xarray.pydata.org/en/stable/examples/visualization_gallery.html

We have seen very simple plots earlier. Xarray has some support for visualizing 3D and 4D datasets by presenting multiple facets (or panels or subplots) showing variations across rows and/or columns.

[35]:
# facet the seasonal_mean
seasonal_mean.air.plot(col="season")
[35]:
<xarray.plot.facetgrid.FacetGrid at 0x7ff8d9f7a730>
../_images/oceanhackweek-2020_xarray-oceanhackweek20_65_1.png
[36]:
# contours
seasonal_mean.air.plot.contour(col="season", levels=20, add_colorbar=True)
[36]:
<xarray.plot.facetgrid.FacetGrid at 0x7ff8d1de5370>
../_images/oceanhackweek-2020_xarray-oceanhackweek20_66_1.png
[37]:
# line plots too? wut
seasonal_mean.air.mean("lon").plot.line(hue="season", y="lat")
[37]:
[<matplotlib.lines.Line2D at 0x7ff8d1ccb490>,
 <matplotlib.lines.Line2D at 0x7ff8d1ccb580>,
 <matplotlib.lines.Line2D at 0x7ff8d1ccb640>,
 <matplotlib.lines.Line2D at 0x7ff8d1ccb700>]
../_images/oceanhackweek-2020_xarray-oceanhackweek20_67_1.png

Reading and writing to disk

Xarray supports many disk formats. Below is a small example using netCDF. For more see https://xarray.pydata.org/en/stable/io.html

[38]:
# write ds to netCDF
ds.to_netcdf("my-example-dataset.nc")
<ipython-input-1-ab674c4a7a43>:2: SerializationWarning: saving variable air with floating point data as an integer dtype without any _FillValue to use for NaNs
  ds.to_netcdf("my-example-dataset.nc")
[39]:
# read from disk
fromdisk = xr.open_dataset("my-example-dataset.nc")
fromdisk
[39]:
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
[40]:
# check that the two are identical
ds.identical(fromdisk)
[40]:
True

Tip: A common use case to read datasets that are a collection of many netCDF files. See https://xarray.pydata.org/en/stable/io.html#reading-multi-file-datasets for how to handle that


More information

  1. A description of common terms used in the xarray documentation: https://xarray.pydata.org/en/stable/terminology.html

  2. For information on how to create a DataArray from an existing numpy array: https://xarray.pydata.org/en/stable/data-structures.html#creating-a-dataarray

  3. Answers to common questions on “how to do X” are here: https://xarray.pydata.org/en/stable/howdoi.html

  4. Our more extensive Scipy 2020 tutorial material: https://xarray-contrib.github.io/xarray-tutorial/

  5. Ryan Abernathey has a book on data analysis with a chapter on Xarray: https://earth-env-data-science.github.io/lectures/xarray/xarray_intro.html


The scientific python / pangeo ecosystem: demo

Xarray ties in to the larger scientific python ecosystem and in turn many packages build on top of xarray. A long list of such packages is here: https://xarray.pydata.org/en/stable/related-projects.html.

Now we will demonstrate some cool features.

Pandas: tabular data structures

You can easily convert between xarray and pandas structures: https://pandas.pydata.org/

This allows you to conveniently use the extensive pandas ecosystem of packages (like seaborn) for your work.

See https://xarray.pydata.org/en/stable/pandas.html

[41]:
# convert to pandas dataframe
df = ds.isel(time=slice(10)).to_dataframe()
df
[41]:
air
lat lon time
75.0 200.0 2013-01-01 00:00:00 241.199997
2013-01-01 06:00:00 242.099991
2013-01-01 12:00:00 242.299988
2013-01-01 18:00:00 241.889999
2013-01-02 00:00:00 243.199997
... ... ... ...
15.0 330.0 2013-01-02 06:00:00 296.899994
2013-01-02 12:00:00 297.100006
2013-01-02 18:00:00 297.600006
2013-01-03 00:00:00 297.000000
2013-01-03 06:00:00 297.100006

13250 rows × 1 columns

[42]:
# convert dataframe to xarray
df.to_xarray()
[42]:
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 10)
Coordinates:
  * lat      (lat) float64 15.0 17.5 20.0 22.5 25.0 ... 65.0 67.5 70.0 72.5 75.0
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2013-01-03T06:00:00
Data variables:
    air      (lat, lon, time) float32 296.3 296.3 296.4 ... 244.6 244.8 242.6

xarray can wrap other array types, not just numpy

4cbd4a590b0644478a08a8f663e16bdb

dask : parallel arrays https://xarray.pydata.org/en/stable/dask.html & https://docs.dask.org/en/latest/array.html

8d7c2c80b3f3481f96eb7e0b7af73fea

pydata/sparse : sparse arrays http://sparse.pydata.org

dfe9f02abe394e4f9dfd243c6dffeb11

cupy : GPU arrays http://cupy.chainer.org

df81e7b238334d3d8ea300c8bf2fe32c

pint : unit-aware computations https://pint.readthedocs.org & https://github.com/xarray-contrib/pint-xarray

Xarray + dask

Dask cuts up NumPy arrays into blocks and parallelizes your analysis code across these blocks

a175f7e092184423b1a1af3fd25b7731

[43]:
# make dask cluster; this is for demo purposes
import dask
import distributed

cluster = distributed.LocalCluster()
[44]:
client = distributed.Client(cluster)
client
[44]:

Client

Cluster

  • Workers: 2
  • Cores: 2
  • Memory: 8.36 GB
[45]:
# demonstrate dask dataset
dasky = xr.tutorial.open_dataset(
    "air_temperature",
    chunks={"time": 10},  # 10 time steps in each block
)

dasky.air
[45]:
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
dask.array<open_dataset-c0813d18c428d35f1eb142400f38ce20air, shape=(2920, 25, 53), dtype=float32, chunksize=(10, 25, 53), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

All computations with dask-backed xarray objects are lazy, allowing you to build up a complicated chain of analysis steps quickly

[46]:
# demonstrate lazy mean
dasky.air.mean("lat")
[46]:
<xarray.DataArray 'air' (time: 2920, lon: 53)>
dask.array<mean_agg-aggregate, shape=(2920, 53), dtype=float32, chunksize=(10, 53), chunktype=numpy.ndarray>
Coordinates:
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00

To get concrete values, call .compute or .load

[47]:
# "compute" the mean
dasky.air.mean("lat").compute()
[47]:
<xarray.DataArray 'air' (time: 2920, lon: 53)>
array([[279.39798, 279.6664 , 279.66122, ..., 279.9508 , 280.31522,
        280.6624 ],
       [279.05722, 279.538  , 279.7296 , ..., 279.77563, 280.27002,
        280.79764],
       [279.0104 , 279.2808 , 279.5508 , ..., 279.682  , 280.19763,
        280.81403],
       ...,
       [279.63   , 279.934  , 280.534  , ..., 279.802  , 280.346  ,
        280.77798],
       [279.398  , 279.66602, 280.31796, ..., 279.766  , 280.34198,
        280.834  ],
       [279.27   , 279.354  , 279.88202, ..., 279.42596, 279.96997,
        280.48196]], dtype=float32)
Coordinates:
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00

holoviews: javascript interactive plots

the hvplot package is a nice easy way to access holoviews functionality. It attaches itself to all xarray objects under the .hvplot namespace. So instead of using .plot use .hvplot

[48]:
import hvplot.xarray

ds.air.hvplot(groupby="time", clim=(270, 300))