Representation of Large Graph with 100 million nodes in C++

Preliminary remarks

You could think of using vectors of vectors instead of using dynamic memory allocation:

vector<vector<int>> AdjList(V);

In any case, you'll have V different vector<int> in your adjacency list. Every vector needs some space overhead to manage the size and the location of its items. Unfortunately you double this overhead (and associated hidden memory management when adding new links) by keeping weight in a different vector/array.

So why not regroup the adjacency list and the weight ?

struct Link {  
   int target;   // node number that was in adj list.  Hope none is negative!!
   int weight;   
};
vector<vector<Link>> AdjList(V);

Is the structure sparse ?

If the big majority of nodes has some kind of link, this is quite fine.

If on contrary, many nodes don't have an outgoing link (or if you have large unused node id ranges) then you could consider:

map<int, vector<Link>> AdjList;

The map is an associative array. There would be only vectors for nodes that have outgoing links. By the way, you could use any numbering scheme you want for your nodes, even negative ones.

You could even go a step further, and use a double map. The first map gives you the outgoing nodes. The second map maps the target node to the weight:

map<int, map<int, int>> Oulala;

But this risks to be much more memory intensive.

Big volumes ?

map and vector manage memory dynamically using a default allocator. But you have lots of small objects of predetermined size. So you could consider using your own allocator. This could optimize memory management overhead significantly.

Also, if you use vectors, when you load the adjacency list of a new node, it could be efficient to immediately reserve the size for the vector (if you know it). This could avoid several successive reallocations for the vector's growth. With millions of node this could be very expensive.

Libraries ?

The search for third party libraries is out of scope on SO. But if the above tips are not sufficient, you could consider using an existing graph library such as for example:

Boost Graph library: the boost advantage
SNAP: Standford Network Analysis Platform: a library that was build (and used) for huge graphs with millions of nodes. (Network means here a graph with data on nodes and on edges)

There are a couple of other graph libraries around, but many seem either no longer maintained or not designed for big volumes.

You should implement the graph as a binary decision diagram data structure.

Briefly, the idea is, a graph can be represented as a binary function by using the characteristic function of the graph.

There are multiple ways to encode a graph as a binary function by using the characteristic function. In the article and video I posted at the end of my post there is a way to do it.

BDD encode binary functions in a compact way with fast operations. Probably this is the most powerful data structure in the universe.

The idea of BDD is almost the same as in a trie, but at each node we do not dispatch in function of the next input, but, instead, each node has as attribute X, which represents the index of a variable and if the function F(..X=true..) is true, continue on the high branch of the node and arrive at the leaf true, if F(..X=true..) is true, continue on the low branch down to leaf node representing true. This is called the Shannon expansion of the boolean function (by using the same expansion formula is also a way to computed the hardware design of a boolean function, using multiplexors).

In general, for each possible combination of input values X_i for which the function is true, we have a unique branch that goes from root node to the true leaf, branching at each node in function of the input variable Xi (we branch on low or high direction in function of the value true or false of Xi). The same diagram can be used to keep multiple functions (each node is a different function).

There are 2 optimizations to convert from a binary decision tree to a binary decision diagram which makes this compact. The idea of optimizations is identical with the optimizations from the minimization algorithm of a finite automaton. The same as in the case of automata, the minimal BDD is unique for the function (so to see if 2 arbitrary functions are the same it is enough to convert them to BDD and see if the node representing one function is the same as the root node for the other function (complexity O(1) (constant time) to compare 2 pointer values)).

One optimization says, if a node has all edges going in the same physical nodes as other node, we unify both nodes in a single one (this can be done at creation by keeping a hash table of all created nodes).

Other optimization says, if the low edge and the high edge of a node for variable X goes in the same physical node of a variable Y, the X node disappears because the function has the same value for F(...X=true...)=F(...X=false...).

There are thousand of articles about BDD and its derivatives (changing the interpretation of dispatching at each node we get for example ZDD, for compact representation of unordered sets). A typical article on topic is What graphs can be efficiently represented by BDDs ? by C. Dong P. Molitor.

After you understand the basics of BDD, if you have pacience for a longer presentation, this video is excellent and summarizes how to encode graphs as BDDs.

BDD is how professional software does nowadays when one needs to manage millions of nodes.

Representation of Large Graph with 100 million nodes in C++

Tags:

C++

Graph

Vector

Bigdata

Related

Recent Posts