Cartesian tree: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Altered doi-broken-date. | Use this bot. Report bugs. | #UCB_CommandLine
 
(31 intermediate revisions by 8 users not shown)
Line 1:
{{good article}}
{{Short description|Binary tree derived from a sequence of numbers}}
[[File:Cartesian tree.svg|thumb|240px|A sequence of numbers and the Cartesian tree derived from them.]]
Line 14 ⟶ 15:
#The tree has the [[Binary heap|min-heap property]]: the parent of any non-root node has a smaller value than the node itself.<ref name=minmax/>
These two definitions are equivalent: the tree defined recursively as described above is the unique tree that has the properties listed above. If a sequence of numbers contains repetitions, a Cartesian tree can be determined for it by following a consistent tie-breaking rule before applying the above construction. For instance, the first of two equal elements can be treated as the smaller of the two.{{sfnp|Vuillemin|1980}}
 
==History==
Cartesian trees were introduced and named by {{harvtxt|Vuillemin|1980}}, who used them as an example of the interaction between [[geometric combinatorics]] and the design and analysis of [[data structure]]s. In particular, Vuillemin used these structures to analyze the [[average-case complexity]] of concatenation and splitting operations on [[binary search tree]]s. The name is derived from the [[Cartesian coordinate]] system for the plane: in one version of this structure, as in the two-dimensional range searching application discussed abovebelow, a Cartesian tree for a point set has the sorted order of the points by their <math>x</math>-coordinates as its symmetric traversal order, and it has the heap property according to the <math>y</math>-coordinates of the points. {{harvtxt|Vuillemin|1980}} described both this geometric version of the structure, and the definition here in which a Cartesian tree is defined from a sequence. Using sequences instead of point coordinates provides a more general setting that allows the Cartesian tree to be applied to non-geometric problems as well.{{sfnp|Vuillemin|1980}}
 
==Efficient construction==
Line 22 ⟶ 26:
 
Another linear-time algorithm for Cartesian tree construction is based on [[Divide-and-conquer algorithm|divide-and-conquer]]. The algorithm recursively constructs the tree on each half of the input, and then merges the two trees. The merger process involves only the nodes on the left and right ''spines'' of these trees: the left spine is the path obtained by following left child edges from the root until reaching a node with no left child, and the right spine is defined symmetrically. As with any path in a min-heap, both spines have their values in sorted order, from the smallest value at their root to their largest value at the end of the path. To merge the two trees, apply a [[merge algorithm]] to the right spine of the left tree and the left spine of the right tree, replacing these two paths in two trees by a single path that contains the same nodes. In the merged path, the successor in the sorted order of each node from the left tree is placed in its right child, and the successor of each node from the right tree is placed in its left child, the same position that was previously used for its successor in the spine. The left children of nodes from the left tree and right children of nodes from the right tree remain unchanged. The algorithm is parallelizable since on each level of recursion, each of the two sub-problems can be computed in parallel, and the merging operation can be [[Merge algorithm#Parallel merge|efficiently parallelized]] as well.{{sfnp|Shun|Blelloch|2014}}
 
Yet another linear-time algorithm, using a linked list representation of the input sequence, is based on ''locally maximum linking'': the algorithm repeatedly identifies a ''local maximum'' element, i.e., one that is larger than both its neighbors (or than its only neighbor, in case it is the first or last element in the list). This element is then removed from the list, and attached as the right child of its left neighbor, or the left child of its right neighbor, depending on which of the two neighbors has a larger value, breaking ties arbitrarily. This process can be implemented in a single left-to-right pass of the input, and it is easy to see that each element can gain at most one left-, and at most one right child, and that the resulting binary tree is a Cartesian tree of the input sequence. {{sfnp|Kozma|Saranurak|2020}} {{sfnp|Hartmann|Kozma|Sinnamon|Tarjan|2021}}
 
It is possible to maintain the Cartesian tree of a dynamic input, subject to insertions of elements and [[lazy deletion]] of elements, in logarithmic [[amortized time]] per operation. Here, lazy deletion means that a deletion operation is performed by marking an element in the tree as being a deleted element, but not actually removing it from the tree. When the number of marked elements reaches a constant fraction of the size of the whole tree, it is rebuilt, keeping only its unmarked elements.{{sfnp|Bialynicka-Birula|Grossi|2006}}
Line 38 ⟶ 44:
===As a binary search tree===
{{main article|Treap}}
Because a Cartesian tree is a binary tree, it is natural to use it as a [[binary search tree]] for an ordered sequence of values. However, defining a Cartesian tree based on the same values that form the search keys of a binary search tree does not work well: theThe Cartesian tree of a sorted sequence is just a [[path graph]], rooted at its leftmost endpoint,. and binaryBinary searching in this tree degenerates to [[sequential search]] in the path. However, ita isdifferent possibleconstruction uses Cartesian trees to generate more-balanced[[binary search treestree]]s byof generatinglogarithmic ''priority''depth valuesfrom forsorted eachsequences searchof keyvalues. thatThis arecan differentbe thandone theby keygenerating itself,''priority'' sortingnumbers thefor inputseach by their key valuesvalue, and using the corresponding sequence of priorities to generate a Cartesian tree. This construction may equivalently be viewed in the geometric framework described above, in which the <math>x</math>-coordinates of a set of points are the searchvalues keysin a sorted sequence and the <math>y</math>-coordinates are thetheir priorities.{{sfnp|Seidel|Aragon|1996}}
 
This idea was applied by {{harvtxt|Seidel|Aragon|1996}}, who suggested the use of random numbers as priorities. The data[[self-balancing structurebinary search tree]] resulting from this random choice is called a [[treap]], due to its combination of binary search tree and binary min-heap features. An insertion into a treap can be performed by inserting the new key as a leaf of an existing tree, choosing a priority for it, and then performing [[tree rotation]] operations along a path from the node to the root of the tree to repair any violations of the heap property caused by this insertion; a deletion can similarly be performed by a constant amount of change to the tree followed by a sequence of rotations along a single path in the tree.{{sfnp|Seidel|Aragon|1996}} A variation on this data structure called a zip tree uses the same idea of random priorities, but simplifies the random generation of the priorities, and performs insertions and deletions in a different way, by splitting the sequence and its associated Cartesian tree into two subsequences and two trees and then recombining them.{{sfnp|Tarjan|Levy|Timmel|2021}}
 
If the priorities of each key are chosen randomly and independently once whenever the key is inserted into the tree, the resulting Cartesian tree will have the same properties as a [[random binary search tree]], a tree computed by inserting the keys in a randomly chosen [[permutation]] starting from an empty tree, with each insertion leaving the previous tree structure unchanged and inserting the new node as a leaf of the tree. Random binary search trees hadhave been studied for much longer than treaps, and are known to behave well as search trees. (theyThe have[[expected value|expected]] length of the search path to any given value is at most <math>2\ln n</math>, and the whole tree has [[logarithm]]ic depth (its maximum root-to-leaf distance) [[with high probability);]]. More formally, there exists a constant <math>C</math> such that the depth is <math>\le C\ln n</math> with probability tending to one as the number of nodes tends to infinity. The same good behavior carries over to treaps. It is also possible, as suggested by Aragon and Seidel, to reprioritize frequently-accessed nodes, causing them to move towards the root of the treap and speeding up future accesses for the same keys.{{sfnp|Seidel|Aragon|1996}}
 
===In sorting===
Line 56 ⟶ 62:
#* Add the Cartesian tree children of the removed value to the priority queue
 
As Levcopoulos and Petersson show, for input sequences that are already nearly sorted, the size of the priority queue will remain small, allowing this method to take advantage of the nearly-sorted input and run more quickly. Specifically, the worst-case running time of this algorithm is <math>O(n\log k)</math>, where <math>n</math> is the sequence length and <math>k</math> is the average, over all values in the sequence, of the number of consecutive pairs of sequence values that bracket the given value (meaning that the given value is between the two sequence values). They also prove a lower bound stating that, for any <math>n</math> and (non-constant) <math>k</math>, any comparison-based sorting algorithm must use <math>\Omega(n\log k)</math> comparisons for some inputs.{{sfnp|Levcopoulos|Petersson|1989}}
 
===In pattern matching===
The problem of ''Cartesian tree matching'' has been defined as a generalized form of [[string matching]] in which one seeks a [[substring]] (or in some cases, a [[subsequence]]) of a given string that has a Cartesian tree of the same form as a given pattern. Fast algorithms for variations of the problem with a single pattern or multiple patterns have been developed, as well as data structures analogous to the [[suffix tree]] and other text indexing structures.<ref>{{harvtxt|Park|Amir|Landau|Park|2019}}; {{harvtxt|Park|Bataa|Amir|Landau|2020}}; {{harvtxt|Song|Gu|Ryu|Faro|2021}}; {{harvtxt|Kim|Cho|2021}}; {{harvtxt|Nishimoto|Fujisato|Nakashima|Inenaga|2021}}; {{harvtxt|Oizumi|Kai|Mieno|Inenaga|2022}}</ref>
 
==History==
Cartesian trees were introduced and named by {{harvtxt|Vuillemin|1980}}. The name is derived from the [[Cartesian coordinate]] system for the plane: in one version of this structure, as in the two-dimensional range searching application discussed above, a Cartesian tree for a point set has the sorted order of the points by their <math>x</math>-coordinates as its symmetric traversal order, and it has the heap property according to the <math>y</math>-coordinates of the points. {{harvtxt|Vuillemin|1980}} described both this geometric version of the structure, and the definition here in which a Cartesian tree is defined from a sequence. Using sequences instead of point coordinates provides a more general setting that allows the Cartesian tree to be applied to non-geometric problems as well.{{sfnp|Vuillemin|1980}}
 
==Notes==
Line 101 ⟶ 104:
| title = STACS 2006, 23rd Annual Symposium on Theoretical Aspects of Computer Science, Marseille, France, February 23-25, 2006, Proceedings
| volume = 3884
| year = 2006}}| isbn = 978-3-540-32301-3
}}
*{{citation|contribution=On Cartesian trees and range minimum queries|first1=Erik D.|last1=Demaine|author1-link=Erik Demaine|first2=Gad M.|last2=Landau|first3=Oren|last3=Weimann|series=Lecture Notes in Computer Science|volume=5555|year=2009|pages=341–353|doi=10.1007/978-3-642-02927-1_29|title=Automata, Languages and Programming, 36th International Colloquium, ICALP 2009, Rhodes, Greece, July 5-12, 2009|isbn=978-3-642-02926-4|hdl=1721.1/61963|hdl-access=free}}
*{{citation
Line 150 ⟶ 154:
| volume = 13
| year = 1984}}
*{{citation
*{{citation|title=The maximum capacity route problem|first=T. C.|last=Hu|journal=Operations Research|volume=9|issue=6|year=1961|pages=898–900|jstor=167055|doi=10.1287/opre.9.6.898|doi-access=free}}
| last1 = Hartmann | first1 = Maria
| last2 = Kozma | first2 = László
| last3 = Sinnamon | first3 = Corwin
| last4 = Tarjan | first4 = Robert E. | author4-link = Robert Tarjan
| doi =10.4230/LIPIcs.ICALP.2021.79
| journal = [[ICALP]]
| pages = 79:1–79:20
| title = Analysis of Smooth Heaps and Slim Heaps
| series = Leibniz International Proceedings in Informatics (LIPIcs)
| year = 2021| volume = 198
| doi-access = free
| isbn = 978-3-95977-195-5
}}
*{{citation|title=The maximum capacity route problem|first=T. C.|last=Hu|author-link=T. C. Hu|journal=Operations Research|volume=9|issue=6|year=1961|pages=898–900|jstor=167055|doi=10.1287/opre.9.6.898|doi-access=free}}
*{{citation
| last1 = Kim | first1 = Sung-Hwan
Line 163 ⟶ 181:
| title = 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland
| volume = 191
| year = 2021| doi-access = free
| isbn = 9783959771863
}}
*{{citation
| last1 = Kozma | first1 = László
| last2 = Saranurak | first2 = Thatchaphol
| doi = 10.1137/18M1195188
| number = 5
| journal = [[SIAM J. Comput.]]
| publisher = SIAM
| title = Smooth Heaps and a Dual View of Self-Adjusting Data Structures
| volume = 49
| year = 2020
| arxiv = 1802.05471
}}
*{{citation
Line 185 ⟶ 216:
| title = WADS '89: Proceedings of the Workshop on Algorithms and Data Structures
| volume = 382
| year = 1989}}| isbn = 978-3-540-51542-5
}}
*{{citation
| last1 = Nishimoto | first1 = Akio
Line 201 ⟶ 233:
| title = String Processing and Information Retrieval - 28th International Symposium, SPIRE 2021, Lille, France, October 4-6, 2021, Proceedings
| volume = 12944
| year = 2021| s2cidisbn = 235313506978-3-030-86691-4
| s2cid = 235313506
}}
*{{citation
Line 218 ⟶ 251:
| title = 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic
| volume = 223
| year = 2022| doi-access = free
| isbn = 9783959772341
| s2cid = 246679910
}}
Line 236 ⟶ 270:
| title = 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, June 18-20, 2019, Pisa, Italy
| volume = 128
| year = 2019| doi-access = free
| isbn = 9783959771030
| s2cid = 162168587
}}
Line 262 ⟶ 297:
| url = http://citeseer.ist.psu.edu/seidel96randomized.html
| volume = 16
| year = 1996| doi-broken-date = 1 July 2025 }}
*{{citation
| last1 = Schieber | first1 = Baruch