How to transform some columns only with SimpleImputer or equivalent

This is methode I use, you can replace low_cardinality_cols by cols you want to encode. But this works also justt set value unique to max(df.columns.nunique()).

#check cardinalité des cols a encoder
low_cardinality_cols = [cname for cname in df.columns if df[cname].nunique() < 16 and 
                        df[cname].dtype == "object"]

Why thes columns, it's recommanded, to encode only cols with cardinality near 10.

# Replace NaN, if not you'll stuck
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # feel free to use others strategy
df[low_cardinality_cols]  = imp.fit_transform(df[low_cardinality_cols])

# Apply label encoder 
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in low_cardinality_cols:
    df[col] = label_encoder.fit_transform(df[col])
    ```

Following Dan's advice, an example of using ColumnTransformer and SimpleImputer to backfill the columns is:

import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]

column_trans = ColumnTransformer(
[('imp_col1', SimpleImputer(strategy='mean'), [1]),
 ('imp_col2', SimpleImputer(strategy='constant', fill_value=29), [2])],
remainder='passthrough')

print(column_trans.fit_transform(A)[:, [2,0,1]])
# [[7 2.0 3]
#  [4 3.5 6]
#  [10 5.0 29]]

This approach helps with constructing pipelines which are more suitable for larger applications.

There is no need to use the SimpleImputer.
DataFrame.fillna() can do the work as well

For the second column, use

column.fillna(column.mean(), inplace=True)
For the third column, use

column.fillna(constant, inplace=True)

Of course, you will need to replace column with your DataFrame's column you want to change and constant with your desired constant.

Edit
Since the use of inplace is discouraged and will be deprecated, the syntax should be

column = column.fillna(column.mean())

How to transform some columns only with SimpleImputer or equivalent

Tags:

Python

Pandas

Imputation

Scikit Learn

Data Science

Related

Recent Posts