Transfer and write Parquet with python and pandas got timestamp error

Pandas already forwards unknown kwargs to the underlying parquet-engine since at least v0.22. As such, using table.to_parquet(allow_truncated_timestamps=True) should work - I verified it for pandas v0.25.0 and pyarrow 0.13.0. For more keywords see the pyarrow docs.

Thanks to @axel for the link to Apache Arrow documentation:

allow_truncated_timestamps (bool, default False) – Allow loss of data when coercing timestamps to a particular resolution. E.g. if microsecond or nanosecond data is lost when coercing to ‘ms’, do not raise an exception.

It seems like in modern Pandas versions we can pass parameters to ParquetWriter.

The following code worked properly for me (Pandas 1.1.1, PyArrow 1.0.1):

df.to_parquet(filename, use_deprecated_int96_timestamps=True)

I think this is a bug and you should do what Wes says. However, if you need working code now, I have a workaround.

The solution that worked for me was to specify the timestamp columns to be millisecond precision. If you need nanosecond precision, this will ruin your data... but if that's the case, it may be the least of your problems.

import pandas as pd

table1 = pd.read_parquet(path=('path1.parquet'))
table2 = pd.read_parquet(path=('path2.parquet'))

table1["Date"] = table1["Date"].astype("datetime64[ms]")
table2["Date"] = table2["Date"].astype("datetime64[ms]")

table = pd.concat([table1, table2], ignore_index=True) 
table.to_parquet('./file.gzip', compression='gzip')

Transfer and write Parquet with python and pandas got timestamp error

Tags:

Python

Pandas

Parquet

Related

Recent Posts