How to read streaming data in XML format from Kafka?

.format("kafka") \
.format('com.databricks.spark.xml') \

The last one with com.databricks.spark.xml wins and becomes the streaming source (hiding Kafka as the source).

In order words, the above is equivalent to .format('com.databricks.spark.xml') alone.

As you may have experienced, the Databricks spark-xml package does not support streaming reading (i.e. cannot act as a streaming source). The package is not for streaming.

Is there any way I can extract XML data from Kafka topic using structured streaming?

You are left with accessing and processing the XML yourself with a standard function or a UDF. There's no built-in support for streaming XML processing in Structured Streaming up to Spark 2.2.0.

That should not be a big deal anyway. A Scala code could look as follows.

val input = spark.
  readStream.
  format("kafka").
  ...
  load

val values = input.select('value cast "string")

val extractValuesFromXML = udf { (xml: String) => ??? }
val numbersFromXML = values.withColumn("number", extractValuesFromXML('value))

// print XMLs and numbers to the stdout
val q = numbersFromXML.
  writeStream.
  format("console").
  start

Another possible solution could be to write your own custom streaming Source that would deal with the XML format in def getBatch(start: Option[Offset], end: Offset): DataFrame. That is supposed to work.


import xml.etree.ElementTree as ET
df = spark \
      .readStream \
      .format("kafka") \
      .option("kafka.bootstrap.servers", "localhost:9092") \
      .option(subscribeType, "test") \
      .load()

Then I wrote a python UDF

def parse(s):
  xml = ET.fromstring(s)
  ns = {'real_person': 'http://people.example.com',
      'role': 'http://characters.example.com'}
  actor_el = xml.find("DNmS:actor",ns)

  if(actor_el ):
       actor = actor_el.text
  role_el.find('real_person:role', ns)
  if(role_el):
       role = role_el.text
  return actor+"|"+role

Register this UDF

extractValuesFromXML = udf(parse)

   XML_DF= df .withColumn("mergedCol",extractroot("value"))

   AllCol_DF= xml_DF.withColumn("actorName", split(col("mergedCol"), "\\|").getItem(0))\
        .withColumn("Role", split(col("mergedCol"), "\\|").getItem(1))