Why Presto is faster than Spark SQL

In general, it is hard to say if Presto is definitely faster or slower than Spark SQL. It really depends on the type of query you’re executing, environment and engine tuning parameters. However, what I see in the industry(Uber, Neflix examples) Presto is used as ad-hock SQL analytics whereas Spark for ETL/ML pipelines.

One possible explanation, there is no much overhead for scheduling a query for Presto. Presto coordinator is always up and waits for query. On the other hand, Spark is doing lazy approach. It takes time for the driver to negotiate with the cluster manager the resources, copy jars and start processing.

Another one that Presto architecture quite straightforward. It has a coordinator that does SQL parsing, planning, scheduling and a set of workers that execute a physical plan.

enter image description here

On the other hand, Spark core has much more layers in between. Besides stages that Presto has, Spark SQL has to cope with a resiliency build into RDD, do resource management and negotiation for the jobs.

enter image description here

Please also note that Spark SQL has Cost-Based-Optimizer that performs better on complex queries. While Presto(0.199) has a legacy ruled based optimizer. There is ongoing effort to bring CBO to Presto which might potentially beat Spark SQL performance.

I think the key difference is that the architecture of Presto is very similar to an MPP SQL engine. That means is highly optimized just for SQL query execution vs Spark being a general purpose execution framework that is able to run multiple different workloads such as ETL, Machine Learning etc.

In addition, one trade-off Presto makes to achieve lower latency for SQL queries is to not care about the mid-query fault tolerance. If one of the Presto worker nodes experiences a failure (say, shuts down) in most cases queries that are in progress will abort and need to be restarted. Spark on the other hand supports mid-query fault-tolerance and can recover from such a situation but in order to do that, it needs to do some extra bookkeeping and essentially "plan for failure". That overhead results in slower performance when your cluster does not experience any faults.

Position: Presto emphasis on query, however spark emphasis on calculation.

Memory storage: Both are memory store and calculations, spark will write the data to disk when it cannot get enough memory, but presto lead to OOM.

Tasks, resources: The spark commits tasks and applies for resources in real time at each stages(this strategy can result in a slightly slower processing speed compared to presto); Presto applies for all required resources and commits all tasks once.

Data processing: In spark, data needs to be fully processed before passing to the next stage. Presto is a batch (page) pipeline processing mode.. As long as the page is finished, it can be sent to the next task(This approach greatly reduces the end-to-end response time of various queries).

Data fault tolerance: If spark fails or loses data, it will be recalculated based on kinship. But presto will result in query failure.

Why Presto is faster than Spark SQL

Tags:

Apache Spark Sql

Presto

Related

Recent Posts