Randomly split a numpy array

The error is that randint is giving some repeated indices. You can test it by printing len(set(ind)) and you will see it is smaller than 5000.

To use the same idea, simply replace the first line with

ind = np.random.choice(range(input_matrix.shape[0]), size=(5000,), replace=False)

That being said, the second line of your code is pretty slow because of the iteration over the list. It would be much faster to define the indices you want with a vector of booleans, which would allow you to use the negation operator ~.

choice = np.random.choice(range(matrix.shape[0]), size=(5000,), replace=False)    
ind = np.zeros(matrix.shape[0], dtype=bool)
ind[choice] = True
rest = ~ind

On my machine, this method is exactly as fast as implementing scikit.learn's train_test_split, which makes me think that the two are doing exactly the same thing.


One way may be to try using train_test_split from sklearn documentation:

import numpy as np
from sklearn.model_selection import train_test_split

# creating matrix
input_matrix = np.arange(46928*28*28).reshape((46928,28,28))
print('Input shape: ', input_matrix.shape)
# splitting into two matrices of second matrix by size
second_size = 5000/46928

X1, X2 = train_test_split(input_matrix, test_size=second_size)

print('X1 shape: ', X1.shape)
print('X2 shape: ', X2.shape)

Result:

Input shape:  (46928, 28, 28)
X1 shape:  (41928, 28, 28)
X2 shape:  (5000, 28, 28)

Tags:

Python

Numpy