Kafka -> Flink DataStream -> MongoDB

There is currently no Streaming MongoDB sink available in Flink.

However, there are two ways for writing data into MongoDB:

Use the DataStream.write() call of Flink. It allows you to use any OutputFormat (from the Batch API) with streaming. Using the HadoopOutputFormatWrapper of Flink, you can use the offical MongoDB Hadoop connector
Implement the Sink yourself. Implementing sinks is quite easy with the Streaming API, and I'm sure MongoDB has a good Java Client library.

Both approaches do not provide any sophisticated processing guarantees. However, when you're using Flink with Kafka (and checkpointing enabled) you'll have at-least-once semantics: In an error case, the data is streamed again to the MongoDB sink. If you're doing idempotent updates, redoing these updates shouldn't cause any inconsistencies.

If you really need exactly-once semantics for MongoDB, you should probably file a JIRA in Flink and discuss with the community how to implement this.

As an alternative to Robert Metzger answer, you can write your results again to Kafka and then use one of the maintained kafka's connectors to drop the content of a topic inside your MongoDB Database.

Kafka -> Flink -> Kafka -> Mongo/Anything

With this approach you can mantain the "at-least-once semantics" behaivour.

Kafka -> Flink DataStream -> MongoDB

Tags:

Hadoop

Mongodb

Apache Kafka

Apache Flink

Related

Recent Posts