How to give column names after one-hot encoding with sklearn?

This example could help for future readers:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

train_X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})

>>>
     Sex     AgeGroup
0    male         0
1  female        15
2    male        30
3  female        45
4    male        60
5  female        75

encoder=OneHotEncoder(sparse=False)

train_X_encoded = pd.DataFrame (encoder.fit_transform(train_X[['Sex']]))

train_X_encoded.columns = encoder.get_feature_names(['Sex'])

train_X.drop(['Sex'] ,axis=1, inplace=True)

OH_X_train= pd.concat([train_X, train_X_encoded ], axis=1)

>>>
    AgeGroup  Sex_female  Sex_male
0         0         0.0       1.0
1        15         1.0       0.0
2        30         0.0       1.0
3        45         1.0       0.0
4        60         0.0       1.0
5        75         1.0       0.0`

Hey I had the same problem whereby I had a custom Estimator which extended the BaseEstimator Class from Sklearn.base

I added a class attribute into the init called self.feature_names then as a last step in the transform method just updated self.feature_names with the columns from the result.

from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd

class CustomOneHotEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, **kwargs):
        self.feature_names = []

    def fit(self, X, y=None):
        return self

    def transform(self, X):

        result = pd.get_dummies(X)
        self.feature_names = result.columns

        return result

A bit basic I know but it does the job I need it to.

If you want to retrieve the column names for the feature importances from your sklearn pipeline you can get the features from the classifier step and the column names from the one hot encoding step.

a = model.best_estimator_.named_steps["clf"].feature_importances_
b = model.best_estimator_.named_steps["ohc"].feature_names

df = pd.DataFrame(a,b)
df.sort_values(by=[0], ascending=False).head(20)

You can get the column names using .get_feature_names() attribute.

>>> ohenc.get_feature_names()
>>> x_cat_df.columns = ohenc.get_feature_names()

Detailed example is here.

Update

from Version 1.0, use get_feature_names_out

How to give column names after one-hot encoding with sklearn?

Tags:

Python

Encoding

Scikit Learn

One Hot Encoding

Related

Recent Posts