Talk:Borůvka's algorithm: Difference between revisions

Content deleted Content added
TinucherianBot (talk | contribs)
Autoassessment for WP:COMP ! ( FAQ ) :
Line 38:
----
Why is there no mention of Yao's improvement to make this algorithm O(log log n)? [[User:129.59.32.191|129.59.32.191]] 01:19, 11 February 2007 (UTC)
 
== Comments by 24.202.168.27 ==
 
The following addition was recently made to this article:
 
:The minimal spanning tree problem is simple to explain: how to connect a set of n cities vertices) with pipes/roads/etc (a spanning tree), choosing from a set of m (m < n^2) possible links (the original graph's edges) for the least cost. It's also one of the earliest studied problems among those we still encounter to this day. Two extremely well known algorithms to solve the problem are Prim's and Kruskal's algorithms (note the prevalence of Eastern European researchers; I think it might just be because they had greater territory to cover than the rest of the modern world). If one squints just so, they look somewhat similar: both iterate through the cheapest edges that connect disjoint trees.
 
:This isn't a coincidence. The correctness of most MST algorithms may be derived from the following lemma:
 
 
:Tarjan's Blue Rule:
 
:Let {X, X-bar} be a cut of G (a non-trivial partition of G's vertices), and e an edge of minimal weight among those crossing the cut (linking one vertex in X and one in V(G) \ X). There exists a minimal spanning tree (MST) containing e.
 
:It is easy to show that it is true. Let T be an MST. There is exactly one edge e' that crosses the {X, X-bar} cut. e' may be removed, yielding 2 components, and those components connected back with e, yielding a new tree T'. Since e is of minimal weight among those crossing the cut, c(e) <= c(e'), so the new tree's weight is at most that of T (c(T') <= c(T)). T is of minimal weight, c(T') = C(T).
 
:The dual to that rule is the Red Rule, which says that, given a cycle and a maximal weight edge on that cycle, there exists an MST without that edge. Since MSTs have exactly n-1 edges, it's generally preferable to build the trees up than to trim edges away from a potentially nearly-complete graph.
 
 
:Since both Prim's and Kruskal's algorithms iterate through up to all the edges in order, it's easy to show that, regardless of the datastructures we may use, there is a (tight) lower bound on their complexity of O(m lg n) (simply observe that it's impossible to sort a sequence of n elements by comparison in less than n lg n comparisons). Of course, this assume the lack of special structure, e.g., a small set of arc costs.
 
:Boruvka's algorithm dodges this problem by going through much fewer elements. The basic idea relies on the fact that the Blue rule applies to any pair of trees. Boruvka's algorithm thus attempts to build subtrees of similar size, merging them until a single one is left. By ensuring a sort of balance on subtree size, the algorithm also makes sure there aren't too many disconnected subtrees (since there's a fixed number of vertices).
 
:To find whether two nodes are in the same subtree, we can use Tarjan's Union-Find data structure, which gives us extremely efficient (amortised) operations on disjoint sets: Creation of n singletons in O(n) time, n Union of sets and m ``same set'' queries in total O(m alpha(m, n)) time, where alpha is the 2-parameter inverse Ackermann function (extremely slow growing; pretty much constant for all realistic usage).
 
:We must still decide how to balance the subtrees. Intuitively, we could go for the smallest-tree criterion: always connect the smallest tree in the forest (set of tree) with the least number of vertices. Interestingly, this can even be implemented with a linear time O(n) for *all* the selections: it suffices to notice that the keys are integers in [1, n], and that keys follow a non-decreasing progression. Thus, a linear forward scan in a list of buckets suffices.
 
:However, we can also use a weaker criterion while still keeping the same properties: place the trees in a FIFO queue, pop from the front and push the result of fusions to the back (uniform selection). Initially, isolated vertices are considered singleton trees and pushed on the queue in an arbitrary order. This obviously takes O(n) time for all the selections. We can guess that since newly-fused trees aren't processed too early, the size of subtrees will tend to equalize over time.
 
:Thus, the algorithm is simply to pop the next tree in the queue, find the cheapest edge that connects it to another tree, fuse them, remove the other tree from the queue, and push the newly fused tree at the back of the queue.
 
:Since the uniform selection criterion is simpler I will focus on that one. See, e.g., Cheriton & Tarjan 1976 for more details.
 
:As an analysis trick, we will associate a Phase to each tree ever generated in the algorithm. The Phase of singleton trees is 1, and that of the fusion of T1 and T2 1+min{Phase(T1), Phase(T2)}.
 
:It is simple to show (by induction on i) that a tree in Phase i contains at least 2^i vertices. Since there are n vertices, the algorithm will always stop by the time we reach the (lg n)th Phase.
 
:It is slightly more complicated to show that two trees in the same phase are disjoint (see the paper). Taking this result for granted, we can see that, in each phase, we only have to process each edge at most twice (once for each end), giving a total number of edge inspection (2m lg n).
 
:Given that we are performing many fusions and have a nice bound on the total number of edges through which we iterate, it makes sense to represent our set of edges as unsorted lists, and iterate through the lists of indicent edges to find the minimal one (and delete those that connect vertices in the same tree). Interestingly, this already gives us a O(m lg n) time algorithm with three extremely simple data structures (Union-Find, a FIFO and regular linked lists), and nearly trivial logic.
 
:However, it seems obvious that we can do better by accelerating the search for minimal edges. We can do that by representing our sets of edges as unsorted list of sorted subsets. Each subset is of size at most k. Obviously, initialising the list of subsets for m elements takes O(m lg k) time [~m/k sorts in time k lg n]. Note that there might be an overflow subset with less than k elements. However, fusions still happen in constant time since we just leave these not-quite-filled subsets by themselves. Finally, as advertised, searching for the minimal element is much faster (we only have to look at the min element in each subset). Since we also have to delete some edges that link vertices in the same tree, it's a bit slower than just looking at the min element of each sorted subset, but the number of deleted edges is bounded.
 
:With some algebra, we can see that, if we use lists of subsets of k elements from Phase p until Phase q:
 
:Initialisation takes O(m lg k) time
 
:Each phase takes takes O(m/k + n/2^p + l): each tree contains at least 2^p vertices, thus there are at most n/2^p non-full subsets, and l is the number of skipped edges (on which we can put a useful bound).
 
:Thus for the Phases p to q, we have a total time
:O([m/k + n/2^p](q-p) + m lg k).
 
:With some more algebra, we can find this interesting schedule:
 
:use k = 1 from Phase 1 to (lg lg lg n):
:O(m) initialisation, O(m lg lg lg n) execution
 
:use k = lg lg n from Phase (lg lg lg n) to (lg lg n)
:O(m lg lg lg n) initialisation, O(m) execution
 
:use k = lg n from Phase (lg lg n) to (lg n)
:O(m lg lg n) for initialisation, O(m) execution.
 
:The last initialisation phase completely dominates the total runtime, O(m lg lg n), which is much better than Prim's and Kruskal's O(m lg n)! If one were lazy, it would be possible to use k = 1 until Phase (lg lg n) without affecting the complexity. However, it is more interesting to observe that the initialisation phases are trivially parallelisable (a multitude of completely independent set sorts), and that there are very few comparisons (O(m lg lg lg n)), which may be interesting when cost comparisons/computations are expensive.
 
:With some cunning, Boruvka managed to combine a very simple edge selection rule, with simple, well-known, data structures (Union-Find, a FIFO queue and maybe regular linked lists) and a surprisingly effective simple, but more obscure, data structure (unsorted list of sorted subsets) to build an extremely effective algorithm. This rare combination of all-around simplicity and extreme efficiency (and parallelism, since the initialisation steps are trivially parallelisable) makes Boruvka's algorithm worthy of much more interest than it typically receives.
 
Some of this can definitely be used to expand the article. However, there are some consistent problems in tone that I think prevent it from being immediately useful. I've moved the contribution here to keep it visible, but I will remove it from the article. [[User:Michael Slone|Michael Slone]] ([[User talk:Michael Slone|talk]]) 02:35, 15 January 2009 (UTC)