Apache Spark and Nifi Integration

You can't send data directly to spark unless it is spark streaming. If it is traditional Spark with batch execution, then Spark needs to read the data from some type of storage like HDFS. The purpose of ExecuteSparkInteractive is to trigger a Spark job to run on data that has been delivered to HDFS.

If you want to go the streaming route then there are two options...

1) Directly integrate NiFi with Spark streaming

https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark

2) Use Kafka to integrate NiFi and Spark

NiFi writes to a Kafka topic, Spark reads from a Kafka topic, Spark writes back to a Kafka topic, NiFi reads from a Kafka topic. This approach would probably be the best option.


This might help :

you can do everything in Nifi by following below steps :-

  1. Use ListSFTP to list files from Landing location.
  2. Use UpdateAttribute processor and assign absolute file path to a variable. Use this vaiable in your spark code as processor in next step support Expression language.
  3. Use ExecuteSparkInteractive processor, here you can write spark code (using python or scala or Java) and you can read your input file from landing location (use absolute path variable from step 2) without it being flowing as a Nifi flow file and perform operation/transformation on that file ( use spark.read... to read file into rdd). YOu may right your output to either hive external table or temp hdfs location.
  4. use FetchSFTP processor to read file from temp hdfs location and continue with your further Nifi operations.

Here, you need Livy setup to run spark code from Nifi (through ExecuteSparkINteractive). You may look at how to setup Livy and nifi controller services needed to use livy within Nifi.

Good Luck!!