OverflowError while saving large Pandas df to hdf

Having done some reading on this topic, it seems like the issue is dealing with string-type columns. My string columns contain a mixture of all-number strings and strings with characters. Pandas has the flexible option of keeping strings as an object, without a declared type, but when serializing to hdf5 or feather the content of the column is converted to a type (str or double, say) and cannot be mixed. Both of these libraries fail when confronted with a sufficiently large library of mixed type.

Force-converting my mixed column to strings allowed me to save it in feather, but in HDF5 the file ballooned and the process ended when I ran out of disk space.

Here is an answer in a comparable case where a commenter notes (2 years ago) "This problem is very standard, but solutions are few".

Some background:

String types in Pandas are called object, but this obscures that they may either be pure strings or mixed dtypes (numpy has builtin string types, but Pandas never uses them for text). So the first thing to do in a case like this is to enforce all string cols as string type (with df[col].astype(str)). But even so, in a large enough file (16GB, with long strings) this still failed. Why?

The reason I was encountering this error was that I had data that long and high-entropy (many different unique values) strings. (With low-entropy data, it might have been worthwhile to switch to categorical dtype.) In my case, I realised that I only needed these strings in order to identify rows - so I could replace them with unique integers!

df[col] = df[col].map(dict(zip(df[col].unique(), range(df[col].nunique()))))

Other Solutions:

For text data, there are other recommended solutions than hdf5/feather, including:

  • json
  • msgpack (note that in Pandas 0.25 read_msgpack is deprecated)
  • pickle (which has known security issues, so be careful - but it should be OK for internal storage/transfer of dataframes)
  • parquet, part of the Apache Arrow ecosystem.

Here is an answer from Matthew Rocklin (one of the dask developers) comparing msgpack and pickle. He wrote a broader comparison on his blog.


HDF5 is not the suitable solution for this use case. hdf5 is a better solution if you have many dataframes you want to store in a single structure. It has more overhead when opening the file and then it allows you to efficiently load each dataframe and also easily load slices of them. It should be thought of as a file system that stores dataframes.

In the case of a single dataframe of time series event the recommended formats would be one of the Apache Arrow project formats, i.e. feather or parquet. One should think of those as column based (compressed) csv files. The particular trade off between those two is laid out nicely under What are the differences between feather and parquet?.

One particular issue to consider is data types. Since feather is not designed to optimize disk space by compression it can offer support for a larger variety of data types. While parquet tries to provide very efficient compression it can support only a limited subset that would allow it to handle the data compression better.