Why feature scaling in SVM?

Just personal thoughts from another perspective.
1. why feature scaling influence?
There's a word in applying machine learning algorithm, 'garbage in, garbage out'. The more real reflection of your features, the more accuracy your algorithm will get. That applies too for how machine learning algorithms treat relationship between features. Different from human's brain, when machine learning algorithms do the classify for example, all the features are expressed and calculated by the same coordinate system, which in some sense, establish a priori assumption between the features(not really reflection of data itself). And also the nature of most algorithms is to find the most appropriate weight percentage between the features to fittest the data. So when these algorithms' input is unscaled features, large scale data has more influence on the weight. Actually it's not the reflection of data iteself.
2. why usually feature scaling improve the accuracy?
The common practice in unsupervised machine learning algorithms about the hyper-parameters(or hyper-hyper parameters) selection(for example, hierachical Dirichlet process, hLDA) is that you should not add any personal subjective assumption about data. The best way is just to assume that they have the equality probability to appear. I think it applies here too. The feature scaling just try to make the assumption that all the features has the equality opportunity to influence the weight, which more really reflects the information/knowledge you know about the data. Commonly also result in better accuracy.

BTW, about the affine transformation invariant and converge faster, there's are interest link here on stats.stackexchange.com.


Feature scaling is a general trick applied to optimization problems (not just SVM). The underline algorithm to solve the optimization problem of SVM is gradient descend. Andrew Ng has a great explanation in his coursera videos here.

I will illustrate the core ideas here (I borrow Andrew's slides). Suppose you have only two parameters and one of the parameters can take a relatively large range of values. Then the contour of the cost function can look like very tall and skinny ovals (see blue ovals below). Your gradients (the path of gradient is drawn in red) could take a long time and go back and forth to find the optimal solution.
enter image description here

Instead if your scaled your feature, the contour of the cost function might look like circles; then the gradient can take a much more straight path and achieve the optimal point much faster. enter image description here


The true reason behind scaling features in SVM is the fact, that this classifier is not affine transformation invariant. In other words, if you multiply one feature by a 1000 than a solution given by SVM will be completely different. It has nearly nothing to do with the underlying optimization techniques (although they are affected by these scales problems, they should still converge to global optimum).

Consider an example: you have man and a woman, encoded by their sex and height (two features). Let us assume a very simple case with such data:

0 -> man 1 -> woman

╔═════╦════════╗
║ sex ║ height ║
╠═════╬════════╣
║  1  ║  150   ║
╠═════╬════════╣
║  1  ║  160   ║
╠═════╬════════╣
║  1  ║  170   ║
╠═════╬════════╣
║  0  ║  180   ║
╠═════╬════════╣
║  0  ║  190   ║
╠═════╬════════╣
║  0  ║  200   ║
╚═════╩════════╝

And let us do something silly. Train it to predict the sex of the person, so we are trying to learn f(x,y)=x (ignoring second parameter).

It is easy to see, that for such data largest margin classifier will "cut" the plane horizontally somewhere around height "175", so once we get new sample "0 178" (a woman of 178cm height) we get the classification that she is a man.

However, if we scale down everything to [0,1] we get sth like

╔═════╦════════╗
║ sex ║ height ║
╠═════╬════════╣
║  1  ║  0.0   ║
╠═════╬════════╣
║  1  ║  0.2   ║
╠═════╬════════╣
║  1  ║  0.4   ║
╠═════╬════════╣
║  0  ║  0.6   ║
╠═════╬════════╣
║  0  ║  0.8   ║
╠═════╬════════╣
║  0  ║  1.0   ║
╚═════╩════════╝

and now largest margin classifier "cuts" the plane nearly vertically (as expected) and so given new sample "0 178" which is also scaled to around "0 0.56" we get that it is a woman (correct!)

So in general - scaling ensures that just because some features are big it won't lead to using them as a main predictor.