How is pagerank calculated in a distributed way?

The state of the art way of calculating PageRank is with the Google Pregel framework. I'm pretty sure that they have something more sophisticated right now, but that is the latest published effort.

You can read more details about it in the research blog. Or read the published paper here.

I'm working on an open source version of the Bulk Synchronous Parallel paradigm called Apache Hama. There is also Apache Giraph which solely focusses on the graph usecases and lots of others.

Like mfrankli mentioned, there is also the MapReduce framework (Apache Hadoop for example) that can be used to calculate PageRank, but it is not efficient for iterative algorithms.

The noteworthy thing to add is that both solutions (MapReduce and BSP) are batch solutions, so they may be used to recalculate the PageRank for the complete webgraph. Since Google updates are much faster than batch-algorithms, you can expect that they frequently recalculate PageRank on subgraphs.


Let

| 0 0 0 1 0 |
| 0 0 0 1 0 |
| 0 0 0 1 1 |
| 1 1 1 0 0 |
| 0 0 1 0 0 |

be an adjacency matrix (or a graph). Then transition matrix M in PageRank will be

| 0 0   0 1/3 0 |
| 0 0   0 1/3 0 |
| 0 0   0 1/3 1 |
| 1 1 1/2   0 0 |
| 0 0 1/2   0 0 |

which is column stochastic, irreducible, and aperiodic.

MapReduce starts from here. Serialized input for mappers will be like

1 -> 4
2 -> 4
3 -> 4 , 5
4 -> 1 , 2 , 3
5 -> 3

and mappers will emit the followings:

< 1 , [4] >
< 4 , 1 >

< 2 , [4] >
< 4 , 1 >

< 3 , [4 , 5] >
< 4 , 1/2 >
< 5 , 1/2 >

< 4 , [1, 2, 3] >
< 1 , 1/3 >
< 2 , 1/3 >
< 3 , 1/3 >

< 5 , [3] >
< 3 , 1 >

Mapper outputs will grouped by key and taken by reducers. If we have 5 reducers it will be like:

R1 takes [4]       , 1/3           then computes 1/5*(1/3)           =  2/30
R2 takes [4]       , 1/3           then computes 1/5*(1/3)           =  2/30
R3 takes [4, 5]    , 1/3 , 1       then computes 1/5*(1/3 + 1)       =  8/30
R4 takes [1, 2, 3] ,   1 , 1 , 1/2 then computes 1/5*(  1 + 1 + 1/2) = 15/30
R5 takes [3]       , 1/2           then computes 1/5*(1/2)           =  3/30

Now the first (power) iteration is over. During the following reduce jobs, reducers will emit like what mappers do, however, PR computed will be used instead of 1:

< 1 , [4] >
< 4 , 2/30 >

< 2 , [4] >
< 4 , 2/30 >

< 3 , [4 , 5] >
< 4 , 4/30 >
< 5 , 4/30 >

< 4 , [1, 2, 3] >
< 1 , 5/30 >
< 2 , 5/30 >
< 3 , 5/30 >

< 5 , [3] >
< 3 , 3/30 >

Repeat reduce jobs until it converges enough or you are satisfied.