Expected linear time MST algorithm: Difference between revisions

Content deleted Content added
m Reverted edits by 128.193.8.92 (talk) to last version by Yobot
Further reading: The link should be refreshed. Isn't valid anymore
 
(19 intermediate revisions by 16 users not shown)
Line 1:
AThe '''expected linear time MST algorithm''' is a [[randomized algorithm]] for computing the [[minimum spanning forest]] of a [[weighted graph]] with no [[isolated vertex|isolated vertices]]. It was developed by [[David Karger]], Philip Klein, and [[Robert Tarjan]].<ref>{{Cite journal|doi=10.1145/201019.201022|title=A randomized linear-time algorithm to find minimum spanning trees|journal=Journal of the ACM|volume=42|issue=2|pages=321|year=1995|last1=Karger|first1=David R.|last2=Klein|first2=Philip N.|last3=Tarjan|first3=Robert E.|citeseerx=10.1.1.39.9012|s2cid=832583 }}</ref> The algorithm relies on techniques from [[Borůvka's algorithm]] along with an algorithm for verifying a minimum spanning tree in linear time.<ref name=MST-V1>{{Cite journal|doi=10.1137/0221070|title=Verification and Sensitivity Analysis of Minimum Spanning Trees in Linear Time|journal=SIAM Journal on Computing|volume=21|issue=6|pages=1184|year=1992|last1=Dixon|first1=Brandon|last2=Rauch|first2=Monika|last3=Tarjan|first3=Robert E.|citeseerx=10.1.1.49.25}}</ref><ref name=MST-V2>{{cite conference
| last1 = KargerKing | first1 = David R.Valerie | author1-link = DavidValerie KargerKing
.<ref>{{citation
| last1 = Karger | first1 = David R. | author1-link = David Karger
| last2 = Klein | first2 = Philip N.
| last3 = Tarjan | first3 = Robert E. | author3-link = Robert Tarjan
| doi = 10.1145/201019.201022
| mr = 1409738
| issue = 2
| journal = [[Journal of the Association for Computing Machinery]]
| pages = 321–328
| title = A randomized linear-time algorithm to find minimum spanning trees
| volume = 42
| year = 1995}}</ref> The algorithm relies on techniques from [[Borůvka's algorithm]] along with an algorithm for [[MST verification algorithm|verifying a minimum spanning tree in linear time]]<ref name=MST-V1>{{cite journal
| last1 = Dixon | first1 = Brandon
| last2 = Rauch | first2 = Monika
| last3 = Tarjan | first3 = Robert
| author3-link = Robert Tarjan
| title = Verification and sensitivity analysis of minimum spanning trees in linear time
| journal = SIAM J. on Computing
| volume = 21
| pages = 1184–1192
| year = 1992
| url = http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.49.25
}}</ref>
.<ref name=MST-V2>
{{cite conference
| last1 = King | first1 = Valerie
| title = A Simpler Minimum Spanning Tree Verification Algorithm
| booktitleconference = Proceedings of the 4th International Workshop on Algorithms and Data Structures
| year = 1995
| pages = 440–448
Line 36 ⟶ 11:
</ref> It combines the design paradigms of [[divide and conquer algorithms]], [[greedy algorithms]], and [[randomized algorithms]] to achieve [[Expected value|expected]] [[linear time|linear performance]].
 
Deterministic algorithms that find the minimum spanning tree include [[Prim's algorithm]], [[Kruskal's algorithm]], [[Reversereverse-Deletedelete algorithm]], and [[Borůvka's algorithm]].
 
==Overview==
The key insight to the algorithm is a random sampling step which partitions a graph into two [[Glossary_of_graph_theory#Subgraphs|subgraphs]] by randomly selecting edges to include in each subgraph. The algorithm recursively finds the [[minimum spanning tree|minimum spanning forest]] of the first subproblem and uses the solution in conjunction with thea [[MST verification algorithm|linear time verification algorithm]] to discard edges in the graph that cannot be in the minimum spanning tree. A procedure taken from [[Borůvka's algorithm]] is also used to reduce the size of the graph at each [[Recursion (computer science)|recursion]].
 
===Borůvka Step===
Each iteration of the algorithm relies on an adaptation of Borůvka's Algorithmalgorithm referred to as a '''Borůvka Step'step'':
 
<br />
Input: A graph ''G'' with no isolated vertices
1 For each vertex ''v'', select the lightest edge incident on ''v''
Line 49 ⟶ 24:
3 Remove all isolated vertices, self-loops, and non-minimal repetitive edges from ''G'''
Output: The edges selected in step 1 and the contracted graph ''G'''
A Borůvka Step is equivalent to the inner loop of [[Borůvka's algorithm]] which runs in ''O''(''m'') time where ''m'' is the number of edges in ''G''. Furthermore, since each edge can be selected at most twice (once by each incident vertex) the maximum number of disconnected components after step 1 is equal to half the number of vertices. Thus, a Borůvka step reduces the number of vertices in the graph by at least a factor of two and deletes at least ''n''/2 edges where ''n'' is the number of vertices in ''G''.
 
A Borůvka Stepstep is equivalent to the inner loop of [[Borůvka's algorithm]], which runs in ''O''(''m'') time where ''m'' is the number of edges in ''G''. Furthermore, since each edge can be selected at most twice (once by each incident vertex) the maximum number of disconnected components after step 1 is equal to half the number of vertices. Thus, a Borůvka step reduces the number of vertices in the graph by at least a factor of two and deletes at least ''n''/2 edges where ''n'' is the number of vertices in ''G''.
Example Execution of a Borůvka Step
 
{| border=1 cellspacing=2 cellpadding=5 class="wikitable"
Example Executionexecution of a Borůvka Stepstep
{| class="wikitable"
! Image !! Description
|-
Line 71 ⟶ 47:
|}
 
===F-heavy and F-light Edgesedges===
In each iteration the algorithm removes edges with particular properties that exclude them from the [[minimum spanning tree]]. These are called '''F-heavy edges''' and are defined as follows. Let ''F'' be a forest on the [[Graph (discrete mathematics)|graph]] ''H''. An F-heavy edge is an edge ''e'' connecting vertices ''u'',''v'' whose weight is strictly greater than the weight of the heaviest edge on the path from ''u'' to ''v'' in ''F''. (If a path does not exist in ''F'' it is considered to have infinite weight). Any edge that is not F-heavy is '''F-light'''. If ''HF'' is a [[Glossary_of_graph_theory#Subgraphs|subgraph]] of ''G'' then any F-heavy edge in ''G'' cannot be in the minimum spanning tree of ''G'' by the [[Minimum_spanning_tree#Cycle_property|cycle property]]. Given a forest, F-heavy edges can be computed in [[linear time]] using a [[MST verification algorithm|minimum spanning tree verification algorithm]].<ref name=MST-V1/><ref name=MST-V2/>
 
==Algorithm==
Input: A graph ''G'' with no isolated vertices
1# If ''G'' is empty return an empty forest
2# Create a contracted graph ''G''' by running two successive Borůvka steps on ''G''
3# Create a subgraph ''H'' by selecting each edge in ''G''' with probability 1/2. Recursively apply the algorithm to ''H'' to get its minimum spanning forest ''F''.
4# Remove all F-heavy edges from ''G''' (where ''F'' is the forest from step 3) using a [[MST verification algorithm|linear time minimum spanning tree verification algorithm]].<ref name=MST-V1/><ref name=MST-V2/>
5# Recursively apply the algorithm to ''G''' to get its minimum spanning forest.
Output: The minimum spanning forest of ''G''' and the contracted edges from the Borůvka steps
 
==Correctness==
Line 89 ⟶ 65:
The expected performance is a result of the random sampling step. The effectiveness of the random sampling step is described by the following lemma which places a bound on the number of '''[[#F-heavy and F-light Edges|F-light]]''' edges in ''G'' thereby restricting the size of the second subproblem.
 
===Random Samplingsampling Lemmalemma===
'''Lemma'''- Let ''H'' be a subgraph of ''G'' formed by including each edge of ''G'' independently with probability ''p'' and let ''F'' be the minimum spanning forest of ''H''. The [[Expected value|expected number]] of '''[[#F-heavy and F-light Edges|F-light]]''' edges in ''G'' is at most ''n/p'' where ''n'' is the number of vertices in ''G'' .
 
To prove the lemma examine the edges of ''G'' as they are being added to ''H''. The number of [[#F-heavy and F-light Edges|F-light]] edges in ''G'' is independent of the order in which the edges of ''H'' are selected since the minimum spanning forest of ''H'' is the same for all selection orders. For the sake of the proof consider selecting edges for ''H'' by taking the edges of ''G'' one at a time in order of edge weight from lightest to heaviest. Let ''e'' be the current edge being considered. If the endpoints of ''e'' are in two disconnected components of ''H'' then ''e'' is the lightest edge connecting those components and if it is added to ''H'' it will be in ''F'' by the [[Minimum_spanning_tree#Cut_property|cut property]]. This also means ''e'' is [[#F-heavy and F-light Edges|F-light]] regardless of whether or not it is added to ''H'' since only heavier edges are subsequently considered. If both endpoints of ''e'' are in the same component of ''H'' then it is (and always will be) F-heavy by the [[Minimum_spanning_tree#Cycle_property|cycle property]]. Edge ''e'' is then added to ''H'' with probability ''p''.
Line 96 ⟶ 72:
The maximum number of [[#F-heavy and F-light Edges|F-light]] edges added to ''H'' is ''n''-1 since any minimum spanning tree of ''H'' has ''n''-1 edges. Once ''n''-1 F-light edges have been added to ''H'' none of the subsequent edges considered are F-light by the [[Minimum_spanning_tree#Cycle_property|cycle property]]. Thus, the number of F-light edges in ''G'' is bounded by the number of F-light edges considered for ''H'' before ''n''-1 F-light edges are actually added to ''H''. Since any F-light edge is added with probability ''p'' this is equivalent to flipping a coin with probability ''p'' of coming up heads until ''n''-1 heads have appeared. The total number of coin flips is equal to the number of F-light edges in ''G''. The distribution of the number of coin flips is given by the [[negative binomial distribution|inverse binomial distribution]] with parameters ''n''-1 and ''p''. For these parameters the expected value of this distribution is (''n''-1)/''p''.
 
===Expected Analysisanalysis===
Ignoring work done in recursive subproblems the total amount of work done in a single invocation of the algorithm is [[linear time|linear]] in the number of edges in the input graph. Step 1 takes constant time. Borůvka steps can be executed in time linear in the number of edges as mentioned in the [[#Borůvka Stepstep|Borůvka step]] section. Step 3 iterates through the edges and flips a single coin for each one so it is linear in the number of edges. Step 4 can be executed in linear time using a modified [[MST verification algorithm|linear time minimum spanning tree verification algorithm]].<ref name=MST-V1/><ref name=MST-V2/> Since the work done in one iteration of the algorithm is linear in the number of edges the work done in one complete run of the algorithm (including all recursive calls) is bounded by a constant factor times the total number of edges in the original problem and all recursive subproblems.
 
Each invocation of the algorithm produces at most two subproblems so the set of subproblems forms a [[binary tree]]. Each [[#Borůvka Step|Borůvka step]] reduces the number of vertices by at least a factor of two so after two Borůvka steps the number of vertices has been reduced by a factor of four. Thus, if the original graph has ''n'' vertices and ''m'' edges then at depth ''d'' of the tree each subproblem is on a graph of at most ''n''/4<sup>''d''</sup> vertices. Also the tree has at most log<sub>4</sub>''n'' levels.
Line 107 ⟶ 83:
Each edge in a left child problem is selected from the edges of its parent problem (less the edges contracted in the [[#Borůvka Step|Borůvka steps]]) with probability 1/2. If a parent problem has ''x'' edges then the [[Expected value|expected number]] of edges in the left child problem is at most ''x''/2. If ''x'' is replaced by a random variable ''X'' then by the [[Linearity_of_expectation#Linearity|linearity of expectation]] the expected number of edges in the left child problem ''Y'' is given by <math>E[Y] \leq E[X]/2</math>. Thus if the expected number of edges in a problem at the top of a left path is ''k'' then the sum of the expected number of edges in each subproblem in the left path is at most <math>\sum_{d=0}^{\infty} \frac{k}{2^d}=2k</math> (see [[Geometric series]]). The root has ''m'' edges so the [[Expected value|expected number]] of edges is equal to 2''m'' plus twice the expected number of edges in each right subproblem.
 
The expected number of edges in each right subproblem is equal to the number of [[#F-heavy and F-light Edgesedges|F-light]] edges in the parent problem where ''F'' is the minimum spanning tree of the left subproblem. The number of F-light edges is less than or equal to twice the number of vertices in the subproblem by the [[#Random Sampling Lemma|sampling lemma]]. The number of vertices in a subproblem at depth ''d'' is ''n''/4<sup>''d''</sup> so the total number of vertices in all right subproblems is given by <math>\sum_{d=1}^{\infty}\frac{2^{d-1}n}{4^d}=n/2</math>. Thus, the expected number of edges in the original problem and all subproblems is at most 2''m''+''n''. Since ''n'' at most 2''m'' for a graph with no isolated vertices the algorithm runs in expected time ''O''(''m'').
 
===Worst Casecase Analysisanalysis===
The worst case runtime is equivalent to the runtime of [[Borůvka's algorithm]]. This occurs if all edges are added to either the left or right subproblem on each invocation. In this case the algorithm is identical to [[Borůvka's algorithm]] which runs in ''O''(min{''n''<sup>2</sup>, ''m''log''n''}) on a graph with ''n'' vertices and ''m'' edges.
 
==See also==
*[[Kruskal's Algorithm]]
*[[Borůvka's algorithm]]
*[[Prim's algorithm]]
*[[Reverse-Delete algorithm]]
 
==Further reading==
[http://www.cs.technion.ac.il/~idddo/mstverif.pdf Minimum Spanning Tree Verification in Linear Time]
 
==References==