Splitting data using time-based splitting in test and train datasets

On time-series datasets, data splitting takes place in a different way. See this link for more info. Alternatively, you can try TimeSeriesSplit from scikit-learn package. So the main idea is this, suppose you have 10 points of data according to timestamp. Now the splits will be like this :

Split 1 : 
Train_indices : 1 
Test_indices  : 2


Split 2 : 
Train_indices : 1, 2 
Test_indices  : 3


Split 3 : 
Train_indices : 1, 2, 3 
Test_indices  : 4

Split 4 : 
Train_indices : 1, 2, 3, 4 
Test_indices  : 5

So on and so forth. You can check the example shown in the link above to get a better idea of how TimeSeriesSplit works in sklearn

Update If you have a separate time column, you can simply sort the data based on that column and apply timeSeriesSplit as mentioned above to get the splits.

In order to ensure 67% training and 33% testing data in final split, specify number of splits as following:

no_of_split = int((len(data)-3)/3)

Example

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4],[1, 2], [3, 4],[3, 4],[1, 2],     [3, 4],[3, 4],[1, 2], [3, 4] ])
y = np.array([1, 2, 3, 4, 5, 6,7,8,9,10,11,12])
tscv = TimeSeriesSplit(n_splits=int((len(y)-3)/3))
for train_index, test_index in tscv.split(X):
     print("TRAIN:", train_index, "TEST:", test_index)

     #To get the indices 
     X_train, X_test = X[train_index], X[test_index]
     y_train, y_test = y[train_index], y[test_index]

OUTPUT :

('TRAIN:', array([0, 1, 2]), 'TEST:', array([3, 4, 5]))
('TRAIN:', array([0, 1, 2, 3, 4, 5]), 'TEST:', array([6, 7, 8]))
('TRAIN:', array([0, 1, 2, 3, 4, 5, 6, 7, 8]), 'TEST:', array([ 9, 10, 11]))


One easy way to do it..

First: sort the data by time

Second:

import numpy as np 
train_set, test_set= np.split(data, [int(.67 *len(data))])

That makes the train_set with the first 67% of the data, and the test_set with rest 33% of the data.