Range query (computer science): Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Altered template type. Add: isbn, pages, volume, date, series, url, chapter, title, chapter-url, authors 1-4. Removed or converted URL. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar
m Typo/quotemark fixes, replaced: ’s → 's, horizonal → horizontal
 
(8 intermediate revisions by 6 users not shown)
Line 52:
===Median===
 
This particular case is of special interest since finding the [[median]] has several applications.<ref name=heriel>{{cite book |arxiv=0807.0222 |doi=10.1007/978-3-540-87744-8_42 |chapter=Range Medians |title=Algorithms - ESA 2008 |series=Lecture Notes in Computer Science |date=2008 |last1=Har-Peled |first1=Sariel |last2=Muthukrishnan |first2=S. |volume=5193 |pages=503–514 |isbn=978-3-540-87743-1 }}</ref> On the other hand, the median problem, a special case of the [[selection problem]], is solvable in ''O''(''n''), using the [[median of medians]] algorithm.<ref name=tarjanmedian>{{Cite journal | last1 = Blum | first1 = M. | authorlink1 = Manuel Blum| last2 = Floyd | first2 = R. W. | authorlink2 = Robert W. Floyd| last3 = Pratt | first3 = V. R. | authorlink3 = Vaughan Pratt| last4 = Rivest | first4 = R. L. | authorlink4 = Ron Rivest| last5 = Tarjan | first5 = R. E. | authorlink5 = Robert Tarjan | title = Time bounds for selection | doi = 10.1016/S0022-0000(73)80033-9 | journal = Journal of Computer and System Sciences | volume = 7 | issue = 4 | pages = 448–461 | date =August 1973 | url = httphttps://people.csail.mit.edu/rivest/pubs/BFPRT73.pdf| doi-access = free }}</ref> However its generalization through range median queries is recent.<ref name=ethpaper /> A range median query <math>\operatorname{median}(A,i,j)</math> where ''A,i'' and ''j'' have the usual meanings returns the median element of <math>A[i,j]</math>. Equivalently, <math>\operatorname{median}(A,i,j)</math> should return the element of <math>A[i,j]</math> of rank <math>\frac{j-i}{2}</math>. Range median queries cannot be solved by following any of the previous methods discussed above including Yao's approach for semigroup operators.<ref name="morin kranakis" />
 
There have been studied two variants of this problem, the [[offline algorithm|offline]] version, where all the ''k'' queries of interest are given in a batch, and a version where all the pre-processing is done up front. The offline version can be solved with <math>O(n\log k + k \log n)</math> time and <math>O(n\log k)</math> space.
Line 82:
 
===Majority===
Finding frequent elements in a given set of items is one of the most important tasks in data mining. Finding frequent elements might be a difficult task to achieve when most items have similar frequencies. Therefore, it might be more beneficial if some threshold of significance was used for detecting such items. One of the most famous algorithms for finding the majority of an array was proposed by Boyer and Moore <ref>{{Citationcite book |last1=Boyer|first1=Robert S.|date=1991|chapter-url=http://dx.doi.org/10.1007/978-94-011-3488-0_5|pages=105–117|place=Dordrecht|publisher=Springer Netherlands|access-date=2021-12-18|last2=Moore|first2=J. Strother|title=Automated Reasoning |chapter=MJRTY—A Fast Majority Vote Algorithm |series=Automated Reasoning Series |volume=1 |doi=10.1007/978-94-011-3488-0_5 |isbn=978-94-010-5542-0 }}</ref> which is also known as the [[Boyer–Moore majority vote algorithm]]. Boyer and Moore proposed an algorithm to find the majority element of a string (if it has one) in <math>O(n)</math> time and using <math>O(1)</math> space. In the context of Boyer and Moore’sMoore's work and generally speaking, a majority element in a set of items (for example string or an array) is one whose number of instances is more than half of the size of that set. Few years later, Misra and Gries <ref>{{Cite journal|last1=Misra|first1=J.|last2=Gries|first2=David|date=November 1982|title=Finding repeated elements|journal=Science of Computer Programming|volume=2|issue=2|pages=143–152|doi=10.1016/0167-6423(82)90012-0|issn=0167-6423|doi-access=free|hdl=1813/6345|hdl-access=free}}</ref> proposed a more general version of Boyer and Moore's algorithm using <math>O \left ( n \log \left ( \frac{1}{\tau} \right ) \right )</math> comparisons to find all items in an array whose relative frequencies are greater than some threshold <math>0<\tau<1</math>. A range <math>\tau</math>-majority query is one that, given a subrange of a data structure (for example an array) of size <math>|R|</math>, returns the set of all distinct items that appear more than (or in some publications equal to) <math>\tau |R|</math> times in that given range. In different structures that support range <math>\tau</math>-majority queries, <math>\tau </math> can be either static (specified during pre-processing) or dynamic (specified at query time). Many of such approaches are based on the fact that, regardless of the size of the range, for a given <math>\tau</math> there could be at most <math>O(1/\tau)</math> distinct ''candidates'' with relative frequencies at least <math>\tau</math>. By verifying each of these candidates in constant time, <math>O(1/\tau)</math> query time is achieved. A range <math>\tau</math>-majority query is decomposable <ref name=":1">{{Cite book|author=Karpiński, Marek|url=httphttps://worldcat.org/oclc/277046650|title=Searching for frequent colors in rectangles|oclc=277046650}}</ref> in the sense that a <math>\tau</math>-majority in a range <math>R</math> with partitions <math>R_1</math> and <math>R_2</math> must be a <math>\tau</math>-majority in either <math>R_1</math>or <math>R_2</math>. Due to this decomposability, some data structures answer <math>\tau</math>-majority queries on one-dimensional arrays by finding the [[Lowest common ancestor]] (LCA) of the endpoints of the query range in a [[Range tree]] and validating two sets of candidates (of size <math>O(1/\tau)</math>) from each endpoint to the lowest common ancestor in constant time resulting in <math>O(1/\tau)</math> query time.
 
==== Two-dimensional arrays ====
Line 93:
 
==== One-dimensional arrays ====
Chan et al.<ref name=":0">{{cite book |last1=Chan|first1=Timothy M. |date=2012|chapter-url=http://dx.doi.org/10.1007/978-3-642-31155-0_26 |pages=295–306|place=Berlin, Heidelberg|publisher=Springer Berlin Heidelberg|isbn=978-3-642-31154-3|access-date=2021-12-20|last2=Durocher|first2=Stephane|last3=Skala|first3=Matthew|last4=Wilkinson|first4=Bryan T.|title=Algorithm Theory – SWAT 2012 |chapter=Linear-Space Data Structures for Range Minority Query in Arrays |series=Lecture Notes in Computer Science |volume=7357 |doi=10.1007/978-3-642-31155-0_26 }}</ref> proposed a data structure that given a one-dimensional array<math>A</math>, a subrange <math>R</math> of <math>A</math> (specified at query time) and a threshold <math>\tau</math> (specified at query time), is able to return the list of all <math>\tau</math>-majorities in <math>O(1/\tau)</math> time requiring <math>O(n \log n)</math> words of space. To answer such queries, Chan et al.<ref name=":0" /> begin by noting that there exists a data structure capable of returning the ''top-k'' most frequent items in a range in <math>O(k)</math> time requiring <math>O(n)</math> words of space. For a one-dimensional array <math>A[0,..,n-1]</math>, let a one-sided top-k range query to be of form <math>A[0..i] \text { for } 0 \leq i \leq n-1</math>. For a maximal range of ranges <math>A[0..i] \text { through } A[0..j]</math> in which the frequency of a distinct element <math>e</math> in <math>A</math> remains unchanged (and equal to <math>f</math>), a horizontal line segment is constructed. The <math>x</math>-interval of this line segment corresponds to <math>[i,j]</math> and it has a <math>y</math>-value equal to <math>f</math>. Since adding each element to <math>A</math> changes the frequency of exactly one distinct element, the aforementioned process creates <math>O(n)</math> line segments. Moreover, for a vertical line <math>x=i</math> all horizonalhorizontal line segments intersecting it are sorted according to their frequencies. Note that, each horizontal line segment with <math>x</math>-interval <math>[\ell,r]</math> corresponds to exactly one distinct element <math>e</math> in <math>A</math>, such that <math>A[\ell]=e</math>. A top-k query can then be answered by shooting a vertical ray <math>x=i</math> and reporting the first <math>k</math> horizontal line segments that intersect it (remember from above that these line segments are already sorted according to their frequencies) in <math>O(k)</math> time.
 
Chan et al.<ref name=":0" /> first construct a [[range tree]] in which each branching node stores one copy of the data structure described above for one-sided range top-k queries and each leaf represents an element from <math>A</math>. The top-k data structure at each node is constructed based on the values existing in the subtrees of that node and is meant to answer one-sided range top-k queries. Please note that for a one-dimensional array <math>A</math>, a range tree can be constructed by dividing <math>A</math> into two halves and recursing on both halves; therefore, each node of the resulting range tree represents a range. It can also be seen that this range tree requires <math>O(n \log n)</math> words of space, because there are <math>O(\log n)</math> levels and each level <math>\ell</math> has <math>2^{\ell}</math> nodes. Moreover, since at each level <math>\ell</math> of a range tree all nodes have a total of <math>n</math> elements of <math>A</math> at their subtrees and since there are <math>O(\log n)</math> levels, the space complexity of this range tree is <math>O(n \log n)</math>.
Line 109:
for <math>0\leq i \leq \log(depth(x))</math> where <math>\operatorname{par}(x)</math> returns the label of the direct parent of node <math>x</math>. Put another way, for each marked node, the set of all paths with a power of two length (plus one for the node itself) towards the root is stored. Moreover, for each <math>P_i(x)</math>, the set of all majority ''candidates'' <math>C_i(x)</math> are stored. More specifically, <math>C_i(x)</math> contains the set of all <math>(\tau/2)</math>-majorities in <math>P_i(x)</math> or labels that appear more than <math>(\tau/2).(2^i+1)</math> times in <math>P_i(x)</math>. It is easy to see that the set of candidates <math>C_i(x)</math> can have at most <math>2/\tau</math> distinct labels for each <math>i</math>. Gagie et al.<ref name=":2"/> then note that the set of all <math>\tau</math>-majorities in the path from any marked node <math>x</math> to one of its ancestors <math>z</math> is included in some <math>C_i(x)</math> (Lemma 2 in <ref name=":2"/>) since the length of <math>P_i(x)</math> is equal to <math>(2^i+1)</math> thus there exists a <math>P_i(x)</math> for <math>0\leq i \leq \log(depth(x))</math> whose length is between <math>d_{xz} \text{ and } 2 d_{xz}</math> where <math>d_{xz}</math> is the distance between x and z. The existence of such <math>P_i(x)</math> implies that a <math>\tau</math>-majority in the path from <math>x</math> to <math>z</math> must be a <math>(\tau/2)</math>-majority in <math>P_i(x)</math>, and thus must appear in <math>C_i(x)</math>. It is easy to see that this data structure require <math>O(n \log n)</math> words of space, because as mentioned above in the construction phase <math>O(\tau n)</math> nodes are marked and for each marked node some candidate sets are stored. By definition, for each marked node <math>O(\log n)</math> of such sets are stores, each of which contains <math>O(1/\tau)</math> candidates. Therefore, this data structure requires <math>O(\log n \times (1/\tau) \times \tau n)=O(n \log n)</math> words of space. Please note that each node <math>x</math> also stores <math>count(x)</math> which is equal to the number of instances of <math>label(x)</math> on the path from <math>x</math> to the root of <math>T</math>, this does not increase the space complexity since it only adds a constant number of words per node.
 
Each query between two nodes <math>u</math> and <math>v</math> can be answered by using the decomposability property (as explained above) of range <math>\tau</math>-majority queries and by breaking the query path between <math>u</math> and <math>v</math> into four subpaths. Let <math>z</math> be the lowest common ancestor of <math>u</math> and <math>v</math>, with <math>x</math> and <math>y</math> being the nearest marked ancestors of <math>u</math> and <math>v</math> respectively. The path from <math>u</math> to <math>v</math> is decomposed into the paths from <math>u</math> and <math>v</math> to <math>x</math> and <math>y</math> respectively (the size of these paths are smaller than <math>2\lceil 1 / \tau\rceil</math> by definition, all of which are considered as candidates), and the paths from <math>x</math> and <math>y</math> to <math>z</math> (by finding the suitable <math>C_i(x)</math> as explained above and considering all of its labels as candidates). Please note that, boundary nodes have to be handled accordingly so that all of these subpaths are disjoint and from all of them a set of <math>O(1/\tau)</math> candidates is derived. Each of these candidates is then verified using a combination of the <math>labelanc (x, \ell)</math> query which returns the lowest ancestor of node <math>x</math> that has label <math>\ell</math> and the <math>count(x)</math> fields of each node. On a <math>w</math>-bit RAM and an alphabet of size <math>\sigma</math>, the <math>labelanc (x, \ell)</math> query can be answered in <math>O\left(\log \log _{w} \sigma\right) </math> time whilst having linear space requirements.<ref>{{Cite journal|last1=He|first1=Meng|last2=Munro|first2=J. Ian|last3=Zhou|first3=Gelin|date=2014-07-08|title=A Framework for Succinct Labeled Ordinal Trees over Large Alphabets|url=http://dx.doi.org/10.1007/s00453-014-9894-4|journal=Algorithmica|volume=70|issue=4|pages=696–717|doi=10.1007/s00453-014-9894-4|s2cid=253977813 |issn=0178-4617|url-access=subscription}}</ref> Therefore, verifying each of the <math>O(1/\tau)</math> candidates in <math>O\left(\log \log _{w} \sigma\right) </math> time results in <math>O\left((1/\tau)\log \log _{w} \sigma\right) </math> total query time for returning the set of all <math>\tau </math>-majorities on the path from <math>u </math> to <math>v </math>.
 
==Related problems==
All the problems described above have been studied for higher dimensions as well as their dynamic versions. On the other hand, range queries might be extended to other data structures like [[Tree (data structure)|trees]],<ref name="morin kranakis">{{cite book |doi=10.1007/978-3-540-31856-9_31 |chapter-url=httphttps://cg.scs.carletoncglab.ca/~morin/publications/ds/rmq2-stacs.pdf |chapter=Approximate Range Mode and Range Median Queries |title=Stacs 2005 |series=Lecture Notes in Computer Science |date=2005 |last1=Bose |first1=Prosenjit |last2=Kranakis |first2=Evangelos |last3=Morin |first3=Pat |last4=Tang |first4=Yihui |volume=3404 |pages=377–388 |isbn=978-3-540-24998-6 }}</ref> such as the [[level ancestor problem]]. A similar family of problems are [[Range searching|orthogonal range]] queries, also known as counting queries.
 
== See also ==
Line 122:
 
==External links==
*[httphttps://opendatastructures.org/versions/edition-0.1c/ods-java/node64.html Open Data Structure - Chapter 13 - Data Structures for Integers]
*[httphttps://www.cs.au.dk/~gerth/papers/isaac09median.pdf Data Structures for Range Median Queries - Gerth Stolting Brodal and Allan Gronlund Jorgensen]
 
{{CS-Trees}}