Spark: What is the time complexity of the connected components algorithm used in GraphX?

I submit that the primary goal of a graph vs a non-graph solution is to reduce the number of sequential steps required to solve the problem. This is different than complexity -- in fact a Graph solution may take more total CPU instructions to perform, yet still be the right solution if it reduces the number of sequential steps.

In terms of finding connected components, both the breadth- and depth-first approaches have the same number of sequential steps -- i.e. some multiple of the number of vertexes in the graph. The same logic has to be applied sequentially to each vertex. That's the whole solution.

Even if your graph has two more or less equal-sized, clusters, you can't divide the work up into two workers and start at one end and meet in the middle. You don't know where the ends are. You don't know where the middle is.

If you knew going in what you know coming out, your total number of sequential steps could be reduced to half. If it helps, you can think about this as the theoretical best you can do in terms of sequential steps. And it is completely dependent on the shape of your graph.

If you have lots of discreet clusters, unattached, and no cluster is bigger than 10 people, then the theoretical best you could do is 10 sequential steps. No matter how much parallel processing power you had, the best you can do is 10 sequential steps.

A graph algorithm doesn't just get you closer to the theoretical minimum -- depending on the shape of your clusters, it actually beats it.

So how does the Spark algorithm work? It's fairly simple -- each node just broadcasts its VertexId to its neighbors, and its neighbors do the same. Any node that receives a VertexId lower than its own broadcasts that the next round; if not the Vertex goes silent.

If you have a cluster where each of the vertexes is connected to every other vertex, then after one round of messages each one knows who the lowest VertexID is, and they all go silent the next round. One sequential step, the entire cluster.

If, on the other hand, each vertex in the cluster is only connected to at most 2 other vertices, then it could take N sequential steps before all the vertices know who what the minimum VertexID is.

Obviously the sequential steps are of a different nature in the graph algorithm, and even different from graph to graph. A well-connected graph will generate a lot of messages and spend more time merging them, etc. But it won't take as many sequential steps as a less well-connected graph.

Long story short, the performance of the graph solution is completely dependent on the shape of the graph, but it should parallelize much, much better than a breadth- or depth-first solution.