Algorithm for topological sorting if cycles exist

This is the NP-hard (and APX-hard) problem known as minimum feedback vertex set. An approximation algorithm due to Demetrescu and Finocchi (pdf, Combinatorial Algorithms for Feedback Problems in Directed Graphs (2003)") works well in practice when there are no long simple cycles, as I would expect for your application.


Here is how to do it in Python:

from collections import defaultdict

def topological_sort(dependency_pairs):
    'Sort values subject to dependency constraints'
    num_heads = defaultdict(int)   # num arrows pointing in
    tails = defaultdict(list)      # list of arrows going out
    for h, t in dependency_pairs:
        num_heads[t] += 1
        tails[h].append(t)

    ordered = [h for h in tails if h not in num_heads]
    for h in ordered:
        for t in tails[h]:
            num_heads[t] -= 1
            if not num_heads[t]:
                ordered.append(t)
    cyclic = [n for n, heads in num_heads.iteritems() if heads]
    return ordered, cyclic

def is_toposorted(ordered, dependency_pairs):
    '''Return True if all dependencies have been honored.
       Raise KeyError for missing tasks.
    '''
    rank = {t: i for i, t in enumerate(ordered)}
    return all(rank[h] < rank[t] for h, t in dependency_pairs)

print topological_sort('aa'.split())
ordered, cyclic = topological_sort('ah bg cf ch di ed fb fg hd he ib'.split())
print ordered, cyclic
print is_toposorted(ordered, 'ah bg cf ch di ed fb fg hd he ib'.split())
print topological_sort('ah bg cf ch di ed fb fg hd he ib ba xx'.split())

The runtime is linearly proportional to the number of edges (dependency pairs).

The algorithm is organized around a lookup table called num_heads that keeps a count the number of predecessors (incoming arrows). In the ah bg cf ch di ed fb fg hd he ib example, the counts are:

node  number of incoming edges
----  ------------------------
 a       0
 b       2
 c       0
 d       2
 e       1
 f       1  
 g       2
 h       2
 i       1

The algorithm works by "visting" nodes with no predecessors. For example, nodes a and c have no incoming edges, so they are visited first.

Visiting means that the nodes are output and removed from the graph. When a node is visited, we loop over its successors and decrement their incoming count by one.

For example, in visiting node a, we go to its successor h to decrement its incoming count by one (so that h 2 becomes h 1.

Likewise, when visiting node c, we loop over its successors f and h, decrementing their counts by one (so that f 1 becomes f 0 and h 1 becomes h 0).

The nodes f and h no longer have incoming edges, so we repeat the process of outputting them and removing them from the graph until all the nodes have been visited. In the example, the visitation order (the topological sort is):

a c f h e d i b g

If num_heads ever arrives at a state when there are no nodes without incoming edges, then it means there is a cycle that cannot be topologically sorted and the algorithm exits.