what is exact difference between Spark Transform in DStream and map.?

DStream has Several RDD's, since every batch interval is a different RDD.
So by using transform(), you get the chance to to apply an RDD operation on the entire DStream.

Example from Spark Docs: http://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation

map is an elementary transformation and transform is an RDD transformation

map

map(func) : Return a new DStream by passing each element of the source DStream through a function func.

Here is an example which demonstrates both map operation and transform operation on a DStream

val conf = new SparkConf().setMaster("local[*]").setAppName("StreamingTransformExample")
val ssc = new StreamingContext(conf, Seconds(5))    

val rdd1 = ssc.sparkContext.parallelize(Array(1,2,3))
val rdd2 = ssc.sparkContext.parallelize(Array(4,5,6))
val rddQueue = new Queue[RDD[Int]]
rddQueue.enqueue(rdd1)
rddQueue.enqueue(rdd2)

val numsDStream = ssc.queueStream(rddQueue, true)
val plusOneDStream = numsDStream.map(x => x+1)
plusOneDStream.print()

the map operation adds 1 to each element in all the RDDs within DStream, gives an output as shown below

-------------------------------------------
Time: 1501135220000 ms
-------------------------------------------
2
3
4

-------------------------------------------
Time: 1501135225000 ms
-------------------------------------------
5
6
7

-------------------------------------------

transform

transform(func) : Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.

val commonRdd = ssc.sparkContext.parallelize(Array(0))
val combinedDStream = numsDStream.transform(rdd=>(rdd.union(commonRdd)))
combinedDStream.print()

transform allows to perform RDD operation such as join, union etc upon the RDDs within DStream, the example code given here will produce an output as below

-------------------------------------------
Time: 1501135490000 ms
-------------------------------------------
1
2
3
0

-------------------------------------------
Time: 1501135495000 ms
-------------------------------------------
4
5
6
0

-------------------------------------------
Time: 1501135500000 ms
-------------------------------------------
0

-------------------------------------------
Time: 1501135505000 ms
-------------------------------------------
0
-------------------------------------------

here the commonRdd which contains the element 0 is performed a union operation with all the underlying RDDs within the DStream.

The transform function in Spark Streaming allows you to perform any transformation on underlying RDD's in Stream. For example you can join two RDD's in streaming using Transform wherein one RDD would be some RDD made from textfile or parallelized collection and other RDD is coming from Stream of textfile/socket etc.

Map works on each element of RDD in a particular batch and results in the RDD after applying the function passed to Map.

The transform function in Spark streaming allows one to use any of Apache Spark's transformations on the underlying RDDs for the stream. map is used for an element to element transform, and could be implemented using transform. Essentially, map works on the elements of the DStream and transform allows you to work with the RDDs of the DStream. You may find http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams to be useful.

what is exact difference between Spark Transform in DStream and map.?

map

transform

Tags:

Apache Spark

Spark Streaming

Related

Recent Posts