Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill)

Here is an answer of "How does Impala compare to Shark?" from Reynold Xin, the leader of the Shark development effort at UC Berkeley AMPLab.


Comparison between Hive and Impala or Spark or Drill sometimes sounds inappropriate to me. The goals behind developing Hive and these tools were different. Hive was never developed for real-time, in memory processing and is based on MapReduce. It was built for offline batch processing kinda stuff. Best suited when you need long running jobs performing data heavy operations like joins on very huge datasets.

On the other hand these tools were developed keeping the real-timeness in mind. Go for them when you need to query not very huge data, that can be fit into the memory, real-time. I'm not saying you can't run queries on your BigData using these tools, but you would be pushing the limits if you are running real-time queries on PBs of data, IMHO.

Quite often you would have seen(or read) that a particular company has several PBs of data and they are successfully catering real-time needs of their customers. But actually these companies are not querying their entire data most of the time. So, the important thing is proper planning, when to use what. I hope you get the point i'm trying to make.

Coming back to your actual question, in my view it is hard to provide a reasonable comparison at this time since most of these projects are far from completed. They are not production ready yet, unless you are willing to do some(or maybe a lot) of work on your own. And, for each of these projects there are certain goals which are very specific to that particular project.

For example, Impala was developed to take advantage of existing Hive infrastructure so that you don't have to start from scratch. It uses the same metadata which Hive uses. It's goal was to run real-time queries on top of your existing Hadoop warehouse. Whereas Drill was developed to be a not only Hadoop project. And to provide us a distributed query capabilities across multiple big data platforms including MongoDB, Cassandra, Riak and Splunk. Shark is compatible with Apache Hive, which means that you can query it using the same HiveQL statements as you would through Hive. The difference is that Shark can return results up to 30 times faster than the same queries run on Hive.

Impala is doing good at present and some folks have been using it, but i'm not that confident about rest of the 2. All these tools are good but a fair comparison can be made only after you try these on your data and for your processing needs. But as per my experience Impala would be the best bet at this moment. I am not saying other tools are not good, but they are not yet mature enough. But if you wish to use it with your already running Hadoop cluster(Apache's hadoop for ex) you might have to do some additional work as Impala is used almost by everybody as a CDH feature.

Note : All these things as based on solely my experience. If you find something wrong or inappropriate please do let me know. Comments and suggestions are welcome. And I hope this answers some of your queries.