Pandas read_csv() 1.2GB file out of memory on VM with 140GB RAM

This sounds like a job for chunksize. It splits the input process into multiple chunks, reducing the required reading memory.

df = pd.DataFrame()
for chunk in pd.read_csv('Check1_900.csv', header=None, names=['id', 'text', 'code'], chunksize=1000):
    df = pd.concat([df, chunk], ignore_index=True)

This is weird.

Actually I ran into the same situation.

df_train = pd.read_csv('./train_set.csv')

But after I tried a lot of stuff to solve this error. And it works. Like this:

dtypes = {'id': pd.np.int8,
          'article':pd.np.str,
          'word_seg':pd.np.str,
          'class':pd.np.int8}
df_train = pd.read_csv('./train_set.csv', dtype=dtypes)
df_test = pd.read_csv('./test_set.csv', dtype=dtypes)

Or this:

ChunkSize = 10000
i = 1
for chunk in pd.read_csv('./train_set.csv', chunksize=ChunkSize): #分块合并
    df_train = chunk if i == 1 else pd.concat([df_train, chunk])
    print('-->Read Chunk...', i)
    i += 1

BUT!!!!!Suddenlly the original version works fine as well!

Like I did some useless work and I still have no idea where really went wrong.

I don't know what to say.

Tags:

Python

Pandas