When to prefer Hadoop MapReduce over Spark?

You should prefer Hadoop Map Reduce over Spark if

  1. You have to query historic data, which in huge volumes of tera bytes/peta bytes in a huge cluster.
  2. You are not bothered about the job completion time - Job completion time in hours Vs minutes is not important to you
  3. Hadoop MapReduce is meant for data that does not fit in the memory whereas Apache Spark has a better performance for the data that fits in the memory, particularly on dedicated clusters.
  4. Hadoop MapReduce can be an economical option because of Hadoop as a service offering(HaaS) and availability of more personnel
  5. Apache Spark and Hadoop MapReduce both are failure tolerant but comparatively Hadoop MapReduce is more failure tolerant than Spark.

On other front, Spark’s major use cases over Hadoop

  1. Iterative Algorithms in Machine Learning
  2. Interactive Data Mining and Data Processing
  3. Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.
  4. Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis
  5. Sensor data processing: Where data is fetched and joined from multiple sources

Have a look at this blog and dezyre blog


Spark is a great improvement over traditional MapReduce.

When would you use MapReduce over Spark?

When you have a legacy program written in the MapReduce paradigm that is so complex that you do not want to reprogram it. Also if your problem is not about analyzing data then Spark might not be right for you. One example I can think of is for web crawling, there is a great Apache project called Apache Nutch, that is built on Hadoop not Spark.

When would I use Spark over MapReduce?

Ever since 2012... Ever since I started using Spark I haven't wanted to go back. It has also been a great motivation to expand my knowledge beyond Java and to learn Scala. A lot of the operations in Spark take less characters to complete. Also, using Scala/REPL is so much better to rapidly produce code. Hadoop has Pig, but then you have to learn "Pig Latin", which will never be useful anywhere else...

If you want to use Python Libs in your data analysis, I find it easier to get Python working with Spark, and MapReduce. I also REALLY like using something like IPython Notebook. As much as Spark learned me to learn Scala when I started, using IPython Notebook with Spark motivated me to learn PySpark. It doesn't have all the functionality, but most of it can be made up for with Python packages.

Spark also now features Spark SQL, which is backwardly compatible with Hive. This lets you use Spark, to run close to SQL queries. I think this is much better then trying to learn HiveQL, which is different enough that everything is specific to it. With Spark SQL, you can usually get away with using general SQL advice to solve issues.

Lastly, Spark also has MLLib, for machine learning, which is a great improvement over Apache Mahout.

Largest Spark issue: the internet is not full of troubleshooting tips. Since Spark is new, the documentation about issues is a little lacking... It's a good idea to buddy up with someone from AmpLabs/Databricks (the creators of Spark from UC Berkeley, and their consulting business), and utilize their forums for support.