Spark: Why does Python significantly outperform Scala in my use case?

The Scala job takes longer because it has a misconfiguration and, therefore, the Python and Scala jobs had been provided with unequal resources.

There are two mistakes in the code:

val sc = new SparkContext(config) // LINE #1
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
sc.hadoopConfiguration.set("spark.executor.instances", "4") // LINE #4
sc.hadoopConfiguration.set("spark.executor.cores", "8") // LINE #5
  1. LINE 1. Once the line has been executed, the resource configuration of the Spark job is already established and fixed. From this point on, no way to adjust anything. Neither the number of executors nor the number of cores per executor.
  2. LINE 4-5. sc.hadoopConfiguration is a wrong place to set any Spark configuration. It should be set in the config instance you pass to new SparkContext(config).

[ADDED] Bearing the above in mind, I would propose to change the code of the Scala job to

config.set("spark.executor.instances", "4")
config.set("spark.executor.cores", "8")
val sc = new SparkContext(config) // LINE #1
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")

and re-test it again. I bet the Scala version is going to be X times faster now.


Your basic assumption, that Scala or Java should be faster for this specific task, is just incorrect. You can easily verify it with minimal local applications. Scala one:

import scala.io.Source
import java.time.{Duration, Instant}

object App {
  def main(args: Array[String]) {
    val Array(filename, string) = args

    val start = Instant.now()

    Source
      .fromFile(filename)
      .getLines
      .filter(line => line.startsWith(string))
      .length

    val stop = Instant.now()
    val duration = Duration.between(start, stop).toMillis
    println(s"${start},${stop},${duration}")
  }
}

Python one

import datetime
import sys

if __name__ == "__main__":
    _, filename, string = sys.argv
    start = datetime.datetime.now()
    with open(filename) as fr:
        # Not idiomatic or the most efficient but that's what
        # PySpark will use
        sum(1 for _ in filter(lambda line: line.startswith(string), fr))

    end = datetime.datetime.now()
    duration = round((end - start).total_seconds() * 1000)
    print(f"{start},{end},{duration}")

Results (300 repetitions each, Python 3.7.6, Scala 2.11.12), on Posts.xml from hermeneutics.stackexchange.com data dump with mix of matching and non matching patterns:

boxplots of durartion in millis for above programs

  • Python 273.50 (258.84, 288.16)
  • Scala 634.13 (533.81, 734.45)

As you see Python is not only systematically faster, but also is more consistent (lower spread).

Take away message is ‒ don't believe unsubstantiated FUD ‒ languages can be faster or slower on specific tasks or with specific environments (for example here Scala can be hit by JVM startup and / or GC and / or JIT), but if you claims like " XYZ is X4 faster" or "XYZ is slow as compared to ZYX (..) Approximately, 10x slower" it usually means that someone wrote really bad code to test things.

Edit:

To address some concerns raised in the comments:

  • In the OP code data is passed in mostly in one direction (JVM -> Python) and no real serialization is required (this specific path just passes bytestring as-is and decodes on UTF-8 on the other side). That's as cheap as it gets when it comes to "serialization".
  • What is passed back is just a single integer by partition, so in that direction impact is negligible.
  • Communication is done over local sockets (all communication on worker beyond initial connect and auth is performed using file descriptor returned from local_connect_and_auth, and its nothing else than socket associated file). Again, as cheap as it gets when it comes to communication between processes.
  • Considering difference in raw performance shown above (much higher than what you see in you program), there is a lot of margin for overheads listed above.
  • This case is completely different from cases where either simple or complex objects have to be passed to and from Python interpreter in a form that is accessible to both parties as pickle-compatible dumps (most notable examples include old-style UDF, some parts of old-style MLLib).

Edit 2:

Since jasper-m was concerned about startup cost here, one can easily prove that Python has still significant advantage over Scala even if input size is significantly increased.

Here are results for 2003360 lines / 5.6G (the same input, just duplicated multiple times, 30 repetitions), which way exceeds anything you can expect in a single Spark task.

enter image description here

  • Python 22809.57 (21466.26, 24152.87)
  • Scala 27315.28 (24367.24, 30263.31)

Please note non-overlapping confidence intervals.

Edit 3:

To address another comment from Jasper-M:

The bulk of all the processing is still happening inside a JVM in the Spark case.

That is simply incorrect in this particular case:

  • The job in question is map job with single global reduce using PySpark RDDs.
  • PySpark RDD (unlike let's say DataFrame) implement gross of functionality natively in Python, with exception input, output and inter-node communication.
  • Since it is single stage job, and final output is small enough to be ignored, the main responsibility of JVM (if one was to nitpick, this is implemented mostly in Java not Scala) is to invoke Hadoop input format, and push data through socket file to Python.
  • The read part is identical for JVM and Python API, so it can be considered as constant overhead. It also doesn't qualify as the bulk of the processing, even for such simple job like this one.