Is there a rule-of-thumb for how to divide a dataset into training and validation sets?

There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.

If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive).

Assuming you have enough data to do proper held-out test data (rather than cross-validation), the following is an instructive way to get a handle on variances:

  1. Split your data into training and testing (80/20 is indeed a good starting point)
  2. Split the training data into training and validation (again, 80/20 is a fair split).
  3. Subsample random selections of your training data, train the classifier with this, and record the performance on the validation set
  4. Try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. You should see both greater performance with more data, but also lower variance across the different random samples
  5. To get a handle on variance due to the size of test data, perform the same procedure in reverse. Train on all of your training data, then randomly sample a percentage of your validation data a number of times, and observe performance. You should now find that the mean performance on small samples of your validation data is roughly the same as the performance on all the validation data, but the variance is much higher with smaller numbers of test samples

You'd be surprised to find out that 80/20 is quite a commonly occurring ratio, often referred to as the Pareto principle. It's usually a safe bet if you use that ratio.

However, depending on the training/validation methodology you employ, the ratio may change. For example: if you use 10-fold cross validation, then you would end up with a validation set of 10% at each fold.

There has been some research into what is the proper ratio between the training set and the validation set:

The fraction of patterns reserved for the validation set should be inversely proportional to the square root of the number of free adjustable parameters.

In their conclusion they specify a formula:

Validation set (v) to training set (t) size ratio, v/t, scales like ln(N/h-max), where N is the number of families of recognizers and h-max is the largest complexity of those families.

What they mean by complexity is:

Each family of recognizer is characterized by its complexity, which may or may not be related to the VC-dimension, the description length, the number of adjustable parameters, or other measures of complexity.

Taking the first rule of thumb (i.e.validation set should be inversely proportional to the square root of the number of free adjustable parameters), you can conclude that if you have 32 adjustable parameters, the square root of 32 is ~5.65, the fraction should be 1/5.65 or 0.177 (v/t). Roughly 17.7% should be reserved for validation and 82.3% for training.


Last year, I took Prof: Andrew Ng’s online machine learning course. His recommendation was:

Training: 60%

Cross validation: 20%

Testing: 20%