Spark Streaming Processing Time vs Total Delay vs Processing Delay

Let's break down each metric. For that, let's define a basic streaming application which reads a batch at a given 4 second interval from some arbitrary source, and computes the classic word count:

inputDStream.flatMap(line => line.split(" "))
            .map(word => (word, 1))
            .reduceByKey(_ + _)
            .saveAsTextFile("hdfs://...")
  • Processing Time: The time it takes to compute a given batch for all its jobs, end to end. In our case this means a single job which starts at flatMap and ends at saveAsTextFile, and assumes as a prerequisite that the job has been submitted.

  • Scheduling Delay: The time taken by Spark Streaming scheduler to submit the jobs of the batch. How is this computed? As we've said, our batch reads from the source every 4 seconds. Now let's assume that a given batch took 8 seconds to compute. This means that we're now 8 - 4 = 4 seconds behind, thus making the scheduling delay 4 seconds long.

  • Total Delay: This is Scheduling Delay + Processing Time. Following the same example, if we're 4 seconds behind, meaning our scheduling delay is 4 seconds, and the next batch took another 8 seconds to compute, this means that the total delay is now 8 + 4 = 12 seconds long.

A live example from a working Streaming application:

Streaming application

We see that:

  • The bottom job took 11 seconds to process. So now the next batches scheduling delay is 11 - 4 = 7 seconds.
  • If we look at the second row from the bottom, we see that scheduling delay + processing time = total delay, in that case (rounding 0.9 to 1) 7 + 1 = 8.