How to fix "java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord" in Spark Streaming Kafka Consumer?

The Consumer record object is received from Dstream. When you try to print it, it gives error because that object is not serailizable. Instead you should get values from ConsumerRecord object and print it.

instead of stream.print(), do:

stream.map(record=>(record.value().toString)).print

This should solve your problem.

GOTCHA

For anyone else seeing this exception, any call to checkpoint will call a persist with storageLevel = MEMORY_ONLY_SER, so don't call checkpoint until you call map


KafkaUtils.createDirectStream creates as a org.apache.spark.streaming.dstream.DStream. It is not a RDD. Spark Streaming will create RDDs temporarily as is runs. To retrieve an RDD use stream.foreach() to get the RDD and then RDD.foreach to get each object in the RDD. Those will be Kafka ConsumerRecords of which you use use the value() method to read the message from the Kafka topic:

stream.foreachRDD { rdd => 
    rdd.foreach { record => 
    val value = record.value()
    println(map.get(value)) 
    }
}

ConsumerRecord does not implement serialization, when performing operations that require serialization, ie persist or window, print. You need to add the below config to avoid the error.

    sparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerialize");
    sparkConf.registerKryoClasses((Class<ConsumerRecord>[] )Arrays.asList(ConsumerRecord.class).toArray());