Reading JSON with Apache Spark - `corrupt_record`

To read the multi-line JSON as a DataFrame:

val spark = SparkSession.builder().getOrCreate()

val df = spark.read.json(spark.sparkContext.wholeTextFiles("file.json").values)

Reading large files in this manner is not recommended, from the wholeTextFiles docs

Small files are preferred, large file is also allowable, but may cause bad performance.

Spark cannot read JSON-array to a record on top-level, so you have to pass:

{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1} 
{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2} 
{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3} 
{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}

As it's described in the tutorial you're referring to:

Let's begin by loading a JSON file, where each line is a JSON object

The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying).

To put more light on it, here is a quote form the official doc

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

This format is called JSONL. Basically it's an alternative to CSV.

I run into the same problem. I used sparkContext and sparkSql on the same configuration:

val conf = new SparkConf()
  .setMaster("local[1]")
  .setAppName("Simple Application")


val sc = new SparkContext(conf)

val spark = SparkSession
  .builder()
  .config(conf)
  .getOrCreate()

Then, using the spark context I read the whole json (JSON - path to file) file:

 val jsonRDD = sc.wholeTextFiles(JSON).map(x => x._2)

You can create a schema for future selects, filters...

val schema = StructType( List(
  StructField("toid", StringType, nullable = true),
  StructField("point", ArrayType(DoubleType), nullable = true),
  StructField("index", DoubleType, nullable = true)
))

Create a DataFrame using spark sql:

var df: DataFrame = spark.read.schema(schema).json(jsonRDD).toDF()

For testing use show and printSchema:

df.show()
df.printSchema()

sbt build file:

name := "spark-single"

version := "1.0"

scalaVersion := "2.11.7"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.2"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "2.0.2"

As Spark expects "JSON Line format" not a typical JSON format, we can tell spark to read typical JSON by specifying:

val df = spark.read.option("multiline", "true").json("<file>")

Reading JSON with Apache Spark - `corrupt_record`

Tags:

Json

Scala

Apache Spark

Related

Recent Posts