Apache Spark vs Apache Ignite

Apache Ignite is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time.Ignite is a data-source-agnostic platform and can distribute and cache data across multiple servers in RAM to deliver unprecedented processing speed and massive application scalability.

Apache Spark(cluster computing framework) is a fast, in-memory data processing engine with expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well suited for high-performance computing and machine learning algorithms.

Some conceptual differences:

Spark doesn’t store data, it loads data for processing from other storages, usually disk-based, and then discards the data when the processing is finished. Ignite, on the other hand, provides a distributed in-memory key-value store (distributed cache or data grid) with ACID transactions and SQL querying capabilities.

Spark is for non-transactional, read-only data (RDDs don’t support in-place mutation), while Ignite supports both non-transactional (OLAP) payloads as well as fully ACID compliant transactions (OLTP)

Ignite fully supports pure computational payloads (HPC/MPP) that can be “dataless”. Spark is based on RDDs and works only on data-driven payloads.

Conclusion:

Ignite and Spark are both in-memory computing solutions but they target different use cases.

In many cases, they are used together to achieve superior results:

Ignite can provide shared storage, so the state can be passed from one Spark application or job to another.

Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x (spark doesn’t index the data)

When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications


I would say that Spark is a good product for interactive analytics, while Ignite is better for real-time analytics and high performance transactional processing. Ignite achieves this by providing efficient and scalable in-memory key-value storage, as well as rich capabilities for indexing, querying the data and running computations.

Another common use for Ignite is distributed caching, which is often used to improve performance of applications that interact with relational databases or any other data sources.


Apache Spark is a processing framework. You tell it where to get data, provide some code about how to process that data, and then tell it where to put the results. It's a way to easily reliably run computing logic across a bunch of nodes in a cluster on data from any source (which is then kept in-memory during processing). It's primarily meant for large-scale analysis on data from various sources (even from multiple databases at once), or from streaming sources like Kafka. It can also be used for ETL, like transforming and joining data together before putting the final results in some other database system.

Apache Ignite is more of an in-memory distributed database, at least that's how it started. It has a key/value and SQL API, so you can store and read data in various ways, and run queries like you would any other SQL database. It also supports running your own code (similar to Spark) so you can do processing that wouldn't really work with SQL, while also reading and writing the data all in the same system. It also can read/write data to other database systems while acting as a cache layer in the middle. Eventually, as of 2018, it also supports on-disk storage so now you can use it as an all-in-one distributed database, cache, and processing framework.


Apache Spark is still better for more complex analytics, and you can have Spark read data from Apache Ignite, but for many scenarios it's now possible to consolidate processing and storage into a single system with Apache Ignite.


Does Spark and Ignite works together?

Yes, Spark and Ignite works together.

Spark and Ignite

In short

Ignite vs. Spark

Ignite is an in-memory distributed database more focused on data storage and handles transnational updates on data, then serves client requests. Apache Spark is an MPP compute engine which is more inclined towards analytics, ML, Graph, and ETL specific payloads.


In detail

Apache Spark is an OLAP tool

Apache Spark is a general-purpose cluster computing system. It's an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Spark with other components

Sparks with other tools

Deployment topology

Spark deployment topology

Spark on YARN typology is discussed here.


Apache Ignite is an OLTP tool

Ignite is a memory-centric distributed database, caching, and processing platform for transnational, analytical, and streaming workloads delivering in-memory speeds at the petabyte scale. Ignite also includes first-class level support for cluster management and operations, cluster-aware messaging, and zero-deployment technologies. Ignite also provides support for full ACID transactions spanning memory and optional data sources.

SQL Overview

Ignite SQL Overview

Deployment topology

Ignite SQL Overview