How can I use a custom feature selection function in scikit-learn's `pipeline`

I just want to post my solution for completeness, and maybe it is useful to one or the other:

class ColumnExtractor(object):

    def transform(self, X):
        cols = X[:,2:4] # column 3 and 4 are "extracted"
        return cols

    def fit(self, X, y=None):
        return self

Then, it can be used in the Pipeline like so:

clf = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('reduce_dim', ColumnExtractor()),           
    ('classification', GaussianNB())   
    ])

EDIT: General solution

And for a more general solution ,if you want to select and stack multiple columns, you can basically use the following Class as follows:

import numpy as np

class ColumnExtractor(object):

    def __init__(self, cols):
        self.cols = cols

    def transform(self, X):
        col_list = []
        for c in self.cols:
            col_list.append(X[:, c:c+1])
        return np.concatenate(col_list, axis=1)

    def fit(self, X, y=None):
        return self

    clf = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('dim_red', ColumnExtractor(cols=(1,3))),   # selects the second and 4th column      
    ('classification', GaussianNB())   
    ])

Adding on Sebastian Raschka's and eickenberg's answers, the requirements a transformer object should hold are specified in scikit-learn's documentation.

There are several more requirements than just having fit and transform, if you want the estimator to usable in parameter estimation, such as implementing set_params.


If you want to use the Pipeline object, then yes, the clean way is to write a transformer object. The dirty way to do this is

select_3_and_4.transform = select_3_and_4.__call__
select_3_and_4.fit = lambda x: select_3_and_4

and use select_3_and_4 as you had it in your pipeline. You can evidently also write a class.

Otherwise, you could also just give X_train[:, 2:4] to your pipeline if you know that the other features are irrelevant.

Data driven feature selection tools are maybe off-topic, but always useful: Check e.g. sklearn.feature_selection.SelectKBest using sklearn.feature_selection.f_classif or sklearn.feature_selection.f_regression with e.g. k=2 in your case.