How to get SVMs to play nicely with missing data in scikit-learn?

You can do data imputation to handle missing values before using SVM.

EDIT: In scikit-learn, there's a really easy way to do this, illustrated on this page.

(copied from page and modified)

>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> # missing_values is the value of your placeholder, strategy is if you'd like mean, median or mode, and axis=0 means it calculates the imputation based on the other feature values for that sample
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit(train)
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> train_imp = imp.transform(train)

The most popular answer here is outdated. "Imputer" is now "SimpleImputer". The current way to solve this issue is given here. Imputing the training and testing data worked for me as follows:

from sklearn import svm
import numpy as np
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(x_train)

X_train_imp = imp.transform(x_train)
X_test_imp = imp.transform(x_test)
    
clf = svm.SVC()
clf = clf.fit(X_train_imp, y_train)
predictions = clf.predict(X_test_imp)

You can either remove the samples with missing features or replace the missing features with their column-wise medians or means.

How to get SVMs to play nicely with missing data in scikit-learn?

Tags:

Python

Machine Learning

Scikit Learn

Related

Recent Posts