Most efficient way to forward-fill NaN values in numpy array

I liked Divakar's answer on pure numpy. Here's a generalized function for n-dimensional arrays:

def np_ffill(arr, axis):
    idx_shape = tuple([slice(None)] + [np.newaxis] * (len(arr.shape) - axis - 1))
    idx = np.where(~np.isnan(arr), np.arange(arr.shape[axis])[idx_shape], 0)
    np.maximum.accumulate(idx, axis=axis, out=idx)
    slc = [np.arange(k)[tuple([slice(None) if dim==i else np.newaxis
        for dim in range(len(arr.shape))])]
        for i, k in enumerate(arr.shape)]
    slc[axis] = idx
    return arr[tuple(slc)]

AFIK pandas can only work with two dimensions, despite having multi-index to make up for it. The only way to accomplish this would be to flatten a DataFrame, unstack desired level, restack, and finally reshape as original. This unstacking/restacking/reshaping, with the pandas sorting involved, is just unnecessary overhead to achieve the same result.

Testing:

def random_array(shape):
    choices = [1, 2, 3, 4, np.nan]
    out = np.random.choice(choices, size=shape)
    return out

ra = random_array((2, 4, 8))
print('arr')
print(ra)
print('\nffull')
print(np_ffill(ra, 1))
raise SystemExit

Output:

arr
[[[ 3. nan  4.  1.  4.  2.  2.  3.]
  [ 2. nan  1.  3. nan  4.  4.  3.]
  [ 3.  2. nan  4. nan nan  3.  4.]
  [ 2.  2.  2. nan  1.  1. nan  2.]]

 [[ 2.  3.  2. nan  3.  3.  3.  3.]
  [ 3.  3.  1.  4.  1.  4.  1. nan]
  [ 4.  2. nan  4.  4.  3. nan  4.]
  [ 2.  4.  2.  1.  4.  1.  3. nan]]]

ffull
[[[ 3. nan  4.  1.  4.  2.  2.  3.]
  [ 2. nan  1.  3.  4.  4.  4.  3.]
  [ 3.  2.  1.  4.  4.  4.  3.  4.]
  [ 2.  2.  2.  4.  1.  1.  3.  2.]]

 [[ 2.  3.  2. nan  3.  3.  3.  3.]
  [ 3.  3.  1.  4.  1.  4.  1.  3.]
  [ 4.  2.  1.  4.  4.  3.  1.  4.]
  [ 2.  4.  2.  1.  4.  1.  3.  4.]]]

Update: As pointed out by financial_physician in the comments, my initially proposed solution can simply be exchanged with ffill on the reversed array and then reversing the result. There is no relevant performance loss. My initial solution seems to be 2% or 3% faster according to %timeit. I updated the code example below but left my initial text as it was.


For those that came here looking for the backward-fill of NaN values, I modified the solution provided by Divakar above to do exactly that. The trick is that you have to do the accumulation on the reversed array using the minimum except for the maximum.

Here is the code:


# ffill along axis 1, as provided in the answer by Divakar
def ffill(arr):
    mask = np.isnan(arr)
    idx = np.where(~mask, np.arange(mask.shape[1]), 0)
    np.maximum.accumulate(idx, axis=1, out=idx)
    out = arr[np.arange(idx.shape[0])[:,None], idx]
    return out

# Simple solution for bfill provided by financial_physician in comment below
def bfill(arr): 
    return ffill(arr[:, ::-1])[:, ::-1]

# My outdated modification of Divakar's answer to do a backward-fill
def bfill_old(arr):
    mask = np.isnan(arr)
    idx = np.where(~mask, np.arange(mask.shape[1]), mask.shape[1] - 1)
    idx = np.minimum.accumulate(idx[:, ::-1], axis=1)[:, ::-1]
    out = arr[np.arange(idx.shape[0])[:,None], idx]
    return out


# Test both functions
arr = np.array([[5, np.nan, np.nan, 7, 2],
                [3, np.nan, 1, 8, np.nan],
                [4, 9, 6, np.nan, np.nan]])
print('Array:')
print(arr)

print('\nffill')
print(ffill(arr))

print('\nbfill')
print(bfill(arr))

Output:

Array:
[[ 5. nan nan  7.  2.]
 [ 3. nan  1.  8. nan]
 [ 4.  9.  6. nan nan]]

ffill
[[5. 5. 5. 7. 2.]
 [3. 3. 1. 8. 8.]
 [4. 9. 6. 6. 6.]]

bfill
[[ 5.  7.  7.  7.  2.]
 [ 3.  1.  1.  8. nan]
 [ 4.  9.  6. nan nan]]

Edit: Update according to comment of MS_


Use Numba. This should give a significant speedup:

import numba
@numba.jit
def loops_fill(arr):
    ...

Here's one approach -

mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]

If you don't want to create another array and just fill the NaNs in arr itself, replace the last step with this -

arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]

Sample input, output -

In [179]: arr
Out[179]: 
array([[  5.,  nan,  nan,   7.,   2.,   6.,   5.],
       [  3.,  nan,   1.,   8.,  nan,   5.,  nan],
       [  4.,   9.,   6.,  nan,  nan,  nan,   7.]])

In [180]: out
Out[180]: 
array([[ 5.,  5.,  5.,  7.,  2.,  6.,  5.],
       [ 3.,  3.,  1.,  8.,  8.,  5.,  5.],
       [ 4.,  9.,  6.,  6.,  6.,  6.,  7.]])