Pandas Dataframe: Replacing NaN with row average

As commented the axis argument to fillna is NotImplemented.

df.fillna(df.mean(axis=1), axis=1)

Note: this would be critical here as you don't want to fill in your nth columns with the nth row average.

For now you'll need to iterate through:

m = df.mean(axis=1)
for i, col in enumerate(df):
    # using i allows for duplicate columns
    # inplace *may* not always work here, so IMO the next line is preferred
    # df.iloc[:, i].fillna(m, inplace=True)
    df.iloc[:, i] = df.iloc[:, i].fillna(m)

print(df)

   c1  c2   c3
0   1   4  7.0
1   2   5  3.5
2   3   6  9.0

An alternative is to fillna the transpose and then transpose, which may be more efficient...

df.T.fillna(df.mean(axis=1)).T

As an alternative, you could also use an apply with a lambda expression like this:

df.apply(lambda row: row.fillna(row.mean()), axis=1)

yielding also

    c1   c2   c3
0  1.0  4.0  7.0
1  2.0  5.0  3.5
2  3.0  6.0  9.0

For an efficient solution, use `DataFrame.where`:

We could use where on axis=0:

df.where(df.notna(), df.mean(axis=1), axis=0)

or mask on axis=0:

df.mask(df.isna(), df.mean(axis=1), axis=0)

By using axis=0, we can fill in the missing values in each column with the row averages.

These methods perform very similarly (where does slightly better on large DataFrames (300_000, 20)) and is ~35-50% faster than the numpy methods posted here and is 110x faster than the double transpose method.

Some benchmarks:

df = creator()

>>> %timeit df.where(df.notna(), df.mean(axis=1), axis=0)
542 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df.mask(df.isna(), df.mean(axis=1), axis=0)
555 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df.fillna(0) + df.isna().values * df.mean(axis=1).values.reshape(-1,1)
751 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape), columns=df.columns, index=df.index); df.update(fill, overwrite=False)
848 ms ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df.apply(lambda row: row.fillna(row.mean()), axis=1)
1min 4s ± 5.32 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit df.T.fillna(df.mean(axis=1)).T
1min 5s ± 2.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

def creator():
    A = np.random.rand(300_000, 20)
    A.ravel()[np.random.choice(A.size, 300_000, replace=False)] = np.nan
    return pd.DataFrame(A)

Pandas Dataframe: Replacing NaN with row average

For an efficient solution, use `DataFrame.where`:

Tags:

Python

Pandas

Dataframe

Missing Data

Related

Recent Posts

Pandas Dataframe: Replacing NaN with row average

For an efficient solution, use DataFrame.where:

Tags:

Python

Pandas

Dataframe

Missing Data

Related

For an efficient solution, use `DataFrame.where`: