Read local Parquet file without Hadoop Path API

Unfortunately the java parquet implementation is not independent of some hadoop libraries. There is an existing issue in their bugtracker to make it easy to read and write parquet files in java without depending on hadoop but there does not seem to be much progress on it. The InputFile interface was added to add a bit of decoupling, but a lot of the classes that implement the metadata part of parquet and also all compression codecs live inside the hadoop dependency.

I found another implementation of InputFile in the smile library, this might be more efficient than going through the hadoop filesystem abstraction, but does not solve the dependency problem.

As other answers already mention, you can create an hadoop Path for a local file and use that without problems.

java.io.File file = ...
new org.apache.hadoop.fs.Path(file.toURI())

The dependency tree that is pulled in by hadoop can be reduced a lot by defining some exclusions. I'm using the following to reduce the bloat (using gradle syntax):

compile("org.apache.hadoop:hadoop-common:3.1.0") {
    exclude(group: 'org.slf4j')
    exclude(group: 'org.mortbay.jetty')
    exclude(group: 'javax.servlet.jsp')
    exclude(group: 'com.sun.jersey')
    exclude(group: 'log4j')
    exclude(group: 'org.apache.curator')
    exclude(group: 'org.apache.zookeeper')
    exclude(group: 'org.apache.kerby')
    exclude(group: 'com.google.protobuf')
}

If the need for not using Hadoop is really unavoidable, you can try Spark and run it in a local version. A quick start guide can be find here: https://spark.apache.org/docs/latest/index.html. For downloading, you can download at this link: https://archive.apache.org/dist/spark/ (find a version you like, there is always a build without hadoop. Unfortunately, the size of compressed version is still around 10-15M). You will also able to find some Java example at examples/src/main.

After that, you can read the file in as a Spark Dataframe like this

import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*; 

SparkSession spark = SparkSession.builder().appName("Reducing dependecy by adding more dependencies").master("local[*]").getOrCreate();
        DataFrame parquet = sqlContext.read().parquet("C:/files/myfile.csv.parquet");
    parquet.show(20);

This solution do satisfy the original conditions in the question. However, it doesn't devoid from the fact that it's like beating around the bush (but hell yeah it's funny). Still, it might helps to open a new possible way to tackle this.

parquet-tools utility seems like a good place to start. It does have some Hadoop dependencies, but works as well with local files as with HDFS (depending on defaultFS in Configuration). If you have licensing restrictions (tools are Apache V2, as everything else), you can probably just review the source for one of the content-printing commands (cat, head, or dump) for inspiration.

The closest thing to your Avro example would be using ParquetFileReader, I guess.

  Configuration conf = new Configuration();
  Path path = new Path("/parquet/file/path");
  ParquetMetadata footer = ParquetFileReader.readFooter(conf, path, ParquetMetadataConverter.NO_FILTER);
  ParquetFileReader reader = new ParquetFileReader(conf, path, footer);

Read local Parquet file without Hadoop Path API

Tags:

Java

Hadoop

Parquet

Related

Recent Posts