What is the difference between Spark Structured Streaming and DStreams?

Until Spark 2.2, the DStream[T] was the abstract data type for streaming data which can be viewed as RDD[RDD[T]].From Spark 2.2 onwards, the DataSet is a abstraction on DataFrame that embodies both the batch (cold) as well as streaming data.

From the docs

Discretized Streams (DStreams) Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset (see Spark Programming Guide for more details). Each RDD in a DStream contains data from a certain interval, as shown in the following figure.

API using Datasets and DataFrames Since Spark 2.0, DataFrames and Datasets can represent static, bounded data, as well as streaming, unbounded data. Similar to static Datasets/DataFrames, you can use the common entry point SparkSession (Scala/Java/Python/R docs) to create streaming DataFrames/Datasets from streaming sources, and apply the same operations on them as static DataFrames/Datasets. If you are not familiar with Datasets/DataFrames, you are strongly advised to familiarize yourself with them using the DataFrame/Dataset Programming Guide.

Brief description about Spark Streaming(RDD/DStream) and Spark Structured Streaming(Dataset/DataFrame)

Spark Streaming is based on DStream. A DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Spark Streaming has the following problems.

Difficult - it was not simple to built streaming pipelines supporting delivery policies: exactly once guarantee, handling data arrival in late or fault tolerance. Sure, all of them were implementable but they needed some extra work from the part of programmers.

Incosistent - API used to generate batch processing (RDD, Dataset) was different that the API of streaming processing (DStream). Sure, nothing blocker to code but it's always simpler (maintenance cost especially) to deal with at least abstractions as possible.

see the example

Spark Streaming flow diagram :-

Spark Structured Streaming be understood as an unbounded table, growing with new incoming data, i.e. can be thought as stream processing built on Spark SQL.

More concretely, structured streaming brought some new concepts to Spark.

exactly-once guarantee - structured streaming focuses on that concept. It means that data is processed only once and output doesn't contain duplicates.

event time - one of observed problems with DStream streaming was processing order, i.e the case when data generated earlier was processed after later generated data. Structured streaming handles this problem with a concept called event time that, under some conditions, allows to correctly aggregate late data in processing pipelines.

sink,Result Table,output mode and watermark are other features of spark structured streaming.

see the example

Spark Structured Streaming flow diagram :-

enter image description here

What is the difference between Spark Structured Streaming and DStreams?

Tags:

Apache Spark

Spark Streaming

Related

Recent Posts