Sample two pandas dataframes the same way

I like the Alexander answer but I will add an index reset before sampling. The full code:

# index reset
X.reset_index(inplace=True, drop=True)
y.reset_index(inplace=True, drop=True)
# sampling
X_sample = X.sample(10)
y_sample = y[X_sample.index]

Reset of the index is used to not have problem with matching.


Below you can find my solution, which doesn't involve any extra variables.

  1. Use .sample method to get sample of your data
  2. Use .index method on sample, to get indexes
  3. Apply slice()ing by index for second dataframe

E.g. Let's say you have X and Y and you want to get 10 pieces sample on each. And it should be same samples, of course

X_sample = X.sample(10)
y_sample = y[X_sample.index]

If you make rows a boolean array of length len(df), then you can get the True rows with df[rows] and get the False rows with df[~rows]:

import pandas as pd
import numpy as np
import random
np.random.seed(2013)

df_source = pd.DataFrame(
    np.random.randn(5, 2), index=range(0, 10, 2), columns=list('AB'))

rows = np.random.randint(2, size=len(df_source)).astype('bool')

df_source_train = df_source[rows]
df_source_test = df_source[~rows]

print(rows)
# [ True  True False  True False]

# if for some reason you need the index values of where `rows` is True
print(np.where(rows))  
# (array([0, 1, 3]),)

print(df_source)
#           A         B
# 0  0.279545  0.107474
# 2  0.651458 -1.516999
# 4 -1.320541  0.679631
# 6  0.833612  0.492572
# 8  1.555721  1.741279

print(df_source_train)
#           A         B
# 0  0.279545  0.107474
# 2  0.651458 -1.516999
# 6  0.833612  0.492572

print(df_source_test)
#           A         B
# 4 -1.320541  0.679631
# 8  1.555721  1.741279

Tags:

Python

Pandas