Reading HDF5 files

There's a new product that can talk to HDF5 from Apache Spark via Scala:

https://www.hdfgroup.org/downloads/hdf5-enterprise-support/hdf5-connector-for-apache-spark/

With the above product, you can open and read HDF5 like below in Scala:

//
// HOW TO RUN:
//
// $spark-2.3.0-SNAPSHOT-bin-hdf5s-0.0.1/bin/spark-shell -i demo.scala

import org.hdfgroup.spark.hdf5._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL HDF5 example").getOrCreate()

// We assume that HDF5 files (e.g., GSSTF_NCEP.3.2008.12.31.he5) are 
// under /tmp directory. Change the path name ('/tmp') if necessary.
val df=spark.read.option("extension", "he5").option("recursion", "false").hdf5("/tmp/", "/HDFEOS/GRIDS/NCEP/Data Fields/SST")

// Let's print some values from the dataset.
df.show()

// The output will look like below.
//
//+------+-----+------+
//|FileID|Index| Value|
//+------+-----+------+
//|     0|    0|-999.0|
//|     0|    1|-999.0|
//|     0|    2|-999.0|
//...

System.exit(0)

There isn't a Hadoop InputFormatimplementation for HDF5 because it is not capable of being arbitrarily split:

Breaking the container into blocks is a bit like taking an axe and chopping it to pieces, severing blindly the content and the smart wiring in the process. The result is a mess, because there's no alignment or correlation between HDFS block boundaries and the internal HDF5 cargo layout or container support structure. Reference

The same site discusses the possibility of transforming HDF5 files to Avro files, thus enabling them to be read by Hadoop/Spark, but the PySpark example you alluded to is probably a simpler way to go, but as the linked document mentions, there are a number of technical challenges that need to be addressed to efficiently and effectively work with HDF5 documents in Hadoop/Spark.