How to join data from multiple netCDF files with xarray in Python?

Thank you @AdrianTompkins and @jhamman. After your comments I realize that due different time periods I really can't get what I want, with xarray.

My main purpose to create such array is to get in one single N-D array all data for different events, with same time duration. Thus, I can get easily, for example, composite fields of all events for each time (hour, day, etc).

I'm trying to do the same as I do with NCL. See below a code for NCL that works as expected (for me) for the same data:

f = addfiles( (/"eraINTERIM_t2m_201812.nc", "eraINTERIM_t2m_201901.nc"/), "r" )
ListSetType( f, "join" )
temp = f[:]->t2m
printVarSummary( temp )

The final result is an array with 4 dimensions, with the new one automatically named as ncl_join.

However, NCL doesn't respect time axis, joins the arrays and gives to the resulting time axis the coordinates of the first file. So, time axis become useless.

However, as well said for @AdrianTompkins, the time periods are different and xarray can't join data like this. So, to create such array, in Python with xarray, I think the only way is to delete time coordinate from arrays. Thus, time dimension would have only integer indexes.

The array given by xarray works like @AdrianThompkins said in his small example. Since it keep time coordinates for all merged data, I think xarray solution is the correct one, in comparison with NCL. But, now I think that a computation of composites (getting same example given above) wouldn't be done as easyly as it seems with NCL.

In a small test, I print two values from merged array with xarray with

print( da_t2m[ 0, 0, 0, 0 ].values )
print( da_t2m[ 1, 0, 0, 0 ].values )

What results in

252.11412
nan

For the second case, there isn't data for the first time, as expected.

UPDATE: all answers help me to understand better this problem, so I had to add an update here to also thanks @kmuehlbauer for his answer, indicating that his code give the expected array.

Again, thank you all for help!

Mateus


The result makes sense if the times are different.

To simplify it, forget about the lat-lon dimension for a moment and imagine you have two files that are simply data at 2 timeslices. The first has data at timesteps 1,2 and the second file with timesteps of 3 and 4. You can't create a combined dataset with a time dimension that only spans 2 timeslices; the time dimension variable has to have the times 1,2,3,4. So if you say you want a new dimension "cases", then the data is then combined as a 2d array and would look like this:

times: 1,2,3,4

cases: 1,2

data: 
               time
          1    2    3    4
cases 1:  x1   x2 
      2:            x3   x4

Think of the netcdf file that would be the equivalent, the time dimension has to span the range of values present in both files. The only way you could combine two files and get (cases: 2, time: 124, latitude: 241, longitude: 480) would be if both files have the same time, lat AND lon values, i.e. point to exactly the same region in time-lat-lon space.

ps: Somewhat off-topic for the question, but if you are just starting a new analysis, why not instead switch to the new generation, higher resolution ERA-5 reanalysis, which is now available back to 1979 too (and eventually will be extended further back), you can download it straight to your desktop with the python api scripts from here:

https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset


Extending from my comment I would try this:

def preproc(ds):
    ds = ds.assign({'stime': (['time'], ds.time)}).drop('time').rename({'time': 'ntime'})
    # we might need to tweak this a bit further, depending on the actual data layout
    return ds

DS = xr.open_mfdataset( 'eraINTERIM_t2m_*.nc', concat_dim='cases', preprocess=preproc)

The good thing here is, that you keep the original time coordinate in stime while renaming the original dimension (time -> ntime).

If everything works well, you should get resulting dimensions as (cases, ntime, latitude, longitude).

Disclaimer: I do similar in a loop with a final concat (wich works very well), but did not test the preprocess-approach.