Updating values in apache parquet file

Lets start with basics:

Parquet is a file format that needs to be saved in a file system.

Key questions:

  1. Does parquet support append operations?
  2. Does the file system (namely, HDFS) allow append on files?
  3. Can the job framework (Spark) implement append operations?

Answers:

  1. parquet.hadoop.ParquetFileWriter only supports CREATE and OVERWRITE; there is no append mode. (Not sure but this could potentially change in other implementations -- parquet design does support append)

  2. HDFS allows append on files using the dfs.support.append property

  3. Spark framework does not support append to existing parquet files, and with no plans to; see this JIRA

It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.

More details are here:

  • http://bytepadding.com/big-data/spark/read-write-parquet-files-using-spark/

  • http://bytepadding.com/linux/understanding-basics-of-filesystem/


Look at this nice blog which can answer your question and provide a method to perform updates using Spark(Scala):

http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html

Copy & Paste from the blog:

when we need to edit the data, in our data structures (Parquet), that are immutable.

You can add partitions to Parquet files, but you can’t edit the data in place.

But ultimately we can mutate the data, we just need to accept that we won’t be doing it in place. We will need to recreate the Parquet files using a combination of schemas and UDFs to correct the bad data.

If you want to incrementally append the data in Parquet (you did n't ask this question, still it would be useful for other readers) Refer this well written blog:

http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html

Disclaimer: I have n't written those blogs, I just read it and found it might be useful for others.


There are workarounds, but you need to create your parquet file in a certain way to make it easier to update.

Best practices:

A. Use row groups to create parquet files. You need to optimize how many rows of data can go into a row group before features like data compression and dictionary encoding stop kicking in.

B. Scan row groups one at a time and figure out which row groups need to be updated. Generate new parquet files with amended data for each modified row group. It is more memory efficient to work with one row group's worth of data at a time instead of everything in the file.

C. Rebuild the original parquet file by appending unmodified row groups and with modified row groups generated by reading in one parquet file per row group.

it's surprisingly fast to reassemble a parquet file using row groups.

In theory it should be easy to append to existing parquet file if you just strip the footer (stats info), append new row groups and add new footer with update stats, but there isn't an API / Library that supports it..