Read few parquet files at the same time in Spark

InputPath = [hdfs_path + "parquets/date=18-07-23/hour=2*/*.parquet",
             hdfs_path + "parquets/date=18-07-24/hour=0*/*.parquet"]

df = spark.read.parquet(*InputPath)

See this issue on the spark jira. It is supported from 1.4 onwards.

Without upgrading to 1.4, you could either point at the top level directory:

sqlContext.parquetFile('/path/to/dir/')

which will load all files in the directory. Alternatively, you could use the HDFS API to find the files you want, and pass them to parquetFile (it accepts varargs).

FYI, you can also:

read subset of parquet files using the wildcard symbol * sqlContext.read.parquet("/path/to/dir/part_*.gz")
read multiple parquet files by explicitly specifying them sqlContext.read.parquet("/path/to/dir/part_1.gz", "/path/to/dir/part_2.gz")

For Read: Give the file's path and '*'

Example

pqtDF=sqlContext.read.parquet("Path_*.parquet")

Read few parquet files at the same time in Spark

Tags:

Apache Spark

Parquet

Related

Recent Posts