Speeding up reading of very large netcdf file in python

I highly recommend that you take a look at the xarray and dask projects. Using these powerful tools will allow you to easily split up the computation in chunks. This brings up two advantages: you can compute on data which does not fit in memory, and you can use all of the cores in your machine for better performance. You can optimize the performance by appropriately choosing the chunk size (see documentation).

You can load your data from netCDF by doing something as simple as

import xarray as xr
ds = xr.open_dataset(path_file)

If you want to chunk your data in years along the time dimension, then you specify the chunks parameter (assuming that the year coordinate is named 'year'):

ds = xr.open_dataset(path_file, chunks={'year': 10})

Since the other coordinates do not appear in the chunks dict, then a single chunk will be used for them. (See more details in the documentation here.). This will be useful for your first requirement, where you want to multiply each year by a 2D array. You would simply do:

ds['new_var'] = ds['var_name'] * arr_2d

Now, xarray and dask are computing your result lazily. In order to trigger the actual computation, you can simply ask xarray to save your result back to netCDF:

ds.to_netcdf(new_file)

The computation gets triggered through dask, which takes care of splitting the processing out in chunks and thus enables working with data that does not fit in memory. In addition, dask will take care of using all your processor cores for computing chunks.

The xarray and dask projects still don't handle nicely situations where chunks do not "align" well for parallel computation. Since in this case we chunked only in the 'year' dimension, we expect to have no issues.

If you want to add two different netCDF files together, it is as simple as:

ds1 = xr.open_dataset(path_file1, chunks={'year': 10})
ds2 = xr.open_dataset(path_file2, chunks={'year': 10})
(ds1 + ds2).to_netcdf(new_file)

I have provided a fully working example using a dataset available online.

In [1]:

import xarray as xr
import numpy as np

# Load sample data and strip out most of it:
ds = xr.open_dataset('ECMWF_ERA-40_subset.nc', chunks = {'time': 4})
ds.attrs = {}
ds = ds[['latitude', 'longitude', 'time', 'tcw']]
ds

Out[1]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
    tcw        (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...

In [2]:

arr2d = np.ones((73, 144)) * 3.
arr2d.shape

Out[2]:

(73, 144)

In [3]:

myds = ds
myds['new_var'] = ds['tcw'] * arr2d

In [4]:

myds

Out[4]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
    tcw        (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...
    new_var    (time, latitude, longitude) float64 30.46 30.46 30.46 30.46 ...

In [5]:

myds.to_netcdf('myds.nc')
xr.open_dataset('myds.nc')

Out[5]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
    tcw        (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...
    new_var    (time, latitude, longitude) float64 30.46 30.46 30.46 30.46 ...

In [6]:

(myds + myds).to_netcdf('myds2.nc')
xr.open_dataset('myds2.nc')

Out[6]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
Data variables:
    tcw        (time, latitude, longitude) float64 20.31 20.31 20.31 20.31 ...
    new_var    (time, latitude, longitude) float64 60.92 60.92 60.92 60.92 ...

Check chunking of file. ncdump -s <infile> will give the answer. If chunk size in time dimension is larger than one, You should read the same amount of years at once, otherwise You are reading several years at once from disk and using only one at a time. How slow is slow? Max few seconds per timestep sounds reasonable for an array of this size. Giving more info on what You do with the data later may give us more guidance on where the problem may be.

Speeding up reading of very large netcdf file in python

Tags:

Python

Numpy

Netcdf

Dask

Python Xarray

Related

Recent Posts