How to apply Polynomial Transformation to subset of features in scikitlearn

PolynomialFeatures, like many other transformers in sklearn, does not have a parameter that specifies which column(s) of the data to apply, so it is not straightforward to put it in a Pipeline and expect to work.

A more general way to do this, you can use FeatureUnion and specify transformer(s) for each feature you have in your dataframe using another pipeline.

A simple example could be:

from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline


X = pd.DataFrame({'cat_var': ['a', 'b', 'c'], 'num_var': [1, 2, 3]})


class ColumnExtractor(object):
    def __init__(self, columns=None):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_cols = X[self.columns]

    return X_cols


pipeline = Pipeline([
    ('features', FeatureUnion([
        ('num_var', Pipeline([
            ('extract', ColumnExtractor(columns=['num_var'])),
            ('poly', PolynomialFeatures(degree=2))
        ])),
        ('cat_var', Pipeline([
            ('extract', ColumnExtractor(columns=['cat_var'])),
            ('le', LabelEncoder()),
            ('ohe', OneHotEncoder()),
        ]))
    ])),
    ('estimator', LogisticRegression())
])

Yes there is, check out sklearn-pandas

This should work (there should be a more elegant solution, but can't test it now):

from sklearn.preprocessing import PolynomialFeatures
from sklearn_pandas import DataFrameMapper

X2.columns = ['col0', 'col1', 'col2', 'col3', 'col4', 'col5', 'animal']

mapper = DataFrameMapper([
('col0', PolynomialFeatures(2)),
('col1', PolynomialFeatures(2)),
('col2', PolynomialFeatures(2)),
('col3', PolynomialFeatures(2)),
('col4', PolynomialFeatures(2)),
('col5', PolynomialFeatures(2)),
('Animal', None)])

X3 = mapper.fit_transform(X2)

In response to the answer from Peng Jun Huang - the approach is terrific but implementation has issues. (This should be a comment but it's a bit long for that. Also, don't have enough cookies for that.)

I tried to use the code and had some problems. After fooling around a bit, I found the following answer to the original question. The main issue is that the ColumnExtractor needs to inherit from BaseEstimator and TransformerMixin to turn it into an estimator that can be used with other sklearn tools.

My example data shows two numerical variables and one categorical variable. I used pd.get_dummies to do the one-hot encoding to keep the pipeline a bit simpler. Also, I left out the last stage of the pipeline (the estimator) because we have no y data to fit; the main point is to show select, process separately and join.

Enjoy.

import pandas as pd
import numpy as np
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

X = pd.DataFrame({'cat': ['a', 'b', 'c'], 'n1': [1, 2, 3], 'n2':[5, 7, 9] })

   cat  n1  n2
0   a   1   5
1   b   2   7
2   c   3   9

# original version had class ColumnExtractor(object)
# estimators need to inherit from these classes to play nicely with others
class ColumnExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_cols = X[self.columns]
        return X_cols

# Using pandas get dummies to make pipeline a bit simpler by
# avoiding one-hot and label encoder.     
# Build the pipeline from a FeatureUnion that processes 
# numerical and one-hot encoded separately.
# FeatureUnion puts them back together when it's done.
pipe2nvars = Pipeline([
    ('features', FeatureUnion([('num', 
                                Pipeline([('extract', 
                                           ColumnExtractor(columns=['n1', 'n2'])),
                                          ('poly', 
                                           PolynomialFeatures())  ])),
                               ('cat_var', 
                                ColumnExtractor(columns=['cat_b','cat_c']))])
    )])    

# now show it working...
for p in range(1, 4):
    pipe2nvars.set_params(features__num__poly__degree=p)
    res = pipe2nvars.fit_transform(pd.get_dummies(X, drop_first=True))
    print('polynomial degree: {}; shape: {}'.format(p, res.shape))
    print(res)

polynomial degree: 1; shape: (3, 5)
[[1. 1. 5. 0. 0.]
 [1. 2. 7. 1. 0.]
 [1. 3. 9. 0. 1.]]
polynomial degree: 2; shape: (3, 8)
[[ 1.  1.  5.  1.  5. 25.  0.  0.]
 [ 1.  2.  7.  4. 14. 49.  1.  0.]
 [ 1.  3.  9.  9. 27. 81.  0.  1.]]
polynomial degree: 3; shape: (3, 12)
[[  1.   1.   5.   1.   5.  25.   1.   5.  25. 125.   0.   0.]
 [  1.   2.   7.   4.  14.  49.   8.  28.  98. 343.   1.   0.]
 [  1.   3.   9.   9.  27.  81.  27.  81. 243. 729.   0.   1.]]

How to apply Polynomial Transformation to subset of features in scikitlearn

Tags:

Python

Scikit Learn

Related

Recent Posts