Range query (computer science): Difference between revisions

Content deleted Content added
Headers and math formatting.
Examples: Added tree paths too
Line 91:
 
Using this structure, a range <math>\tau</math>-majority query <math>A[i..j]</math> on <math>A[0..n-1]</math> with <math>0\leq i\leq j \leq n</math> is answered as follows. First, the [[lowest common ancestor]] (LCA) of leaf nodes <math>i</math> and <math>j</math> is found in constant time. Note that there exists a data structure requiring <math>O(n)</math> bits of space that is capable of answering the LCA queries in <math>O(1)</math> time <ref>{{Cite journal|last=Sadakane|first=Kunihiko|last2=Navarro|first2=Gonzalo|date=2010-01-17|title=Fully-Functional Succinct Trees|url=http://dx.doi.org/10.1137/1.9781611973075.13|journal=Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms|___location=Philadelphia, PA|publisher=Society for Industrial and Applied Mathematics|doi=10.1137/1.9781611973075.13}}</ref>. Let <math>z</math> denote the LCA of <math>i </math> and <math>j</math>, using <math>z</math> and according to the decomposability of range <math>\tau</math>-majority queries (as described above and in <ref name=":1" />), the two-sided range query <math>A[i..j]</math> can be converted into two one-sided range top-k queries (from <math>z</math> to <math>i</math> and <math>j</math>). These two one-sided range top-k queries return the top-(<math>1/\tau</math>) most frequent elements in each of their respective ranges in <math>O(1/\tau)</math> time. These frequent elements make up the set of ''candidates'' for <math>\tau</math>-majorities in <math>A[i..j]</math> in which there are <math>O(1/\tau)</math> candidates some of which might be false positives. Each candidate is then assessed in constant time using a linear-space data structure (as described in Lemma 3 in <ref>{{Cite journal|last=Chan|first=Timothy M.|last2=Durocher|first2=Stephane|last3=Larsen|first3=Kasper Green|last4=Morrison|first4=Jason|last5=Wilkinson|first5=Bryan T.|date=2013-03-08|title=Linear-Space Data Structures for Range Mode Query in Arrays|url=http://dx.doi.org/10.1007/s00224-013-9455-2|journal=Theory of Computing Systems|volume=55|issue=4|pages=719–741|doi=10.1007/s00224-013-9455-2|issn=1432-4350}}</ref>) that is able to determine in <math>O(1)</math> time whether or not a given subrange of an array <math>A</math> contains at least <math>q</math> instances of a particular element <math>e</math>.
 
==== Tree paths ====
Gagie et al. <ref name=":2">{{Cite journal|last=Gagie|first=Travis|last2=He|first2=Meng|last3=Navarro|first3=Gonzalo|last4=Ochoa|first4=Carlos|date=2020-09|title=Tree path majority data structures|url=http://dx.doi.org/10.1016/j.tcs.2020.05.039|journal=Theoretical Computer Science|volume=833|pages=107–119|doi=10.1016/j.tcs.2020.05.039|issn=0304-3975}}</ref> proposed a data structure which supports queries such that, given two nodes <math>u</math> and <math>v</math> in a tree, are able to report the list of elements that have a greater relative frequency than <math>\tau</math> on the path from <math>u</math> to <math>v</math>. More formally, let <math>T</math> be a labelled tree in which each node has a label from an alphabet of size <math>\sigma</math>. Let <math>label(u)\in [1,\dots,\sigma]</math> denote the label of node <math>u</math> in <math>T</math>. Let <math>P_{uv}</math> denote the unique path from <math>u</math> to <math>v</math> in <math>T</math> in which middle nodes are listed in the order they are visited. Given <math>T</math>, and a fixed (specified during preprocessing) threshold <math>0<\tau<1</math>, a query <math>Q(u,v)</math> must return the set of all labels that appear more than <math>\tau|P_{uv}|</math> times in <math>P_{uv}</math>.
 
To construct this data structure, first <math>{O}(\tau n)</math> nodes are ''marked''. This can be done by marking any node that has distance at least <math>\lceil 1 / \tau\rceil</math> from the bottom of the three (height) and whose depth is divisible by <math>\lceil 1 / \tau\rceil</math>. After doing this, it can be observed that the distance between each node and its nearest marked ancestor is less than <math>2\lceil 1 / \tau\rceil</math>. For a marked node <math>x</math>, <math>\log(depth(x))</math> different sequences (paths towards the root) <math>P_i(x)</math> are stored,
 
<math>P_{i}(x)=\left\langle \operatorname{label}(x), \operatorname{par}(x), \operatorname{par}^{2}(x), \ldots, \operatorname{par}^{2^i}(x)\right\rangle
</math>
 
for <math>0\leq i \leq \log(depth(x))</math> where <math>\operatorname{par}(x)</math> returns the label of the direct parent of node <math>x</math>. Put another way, for each marked node, the set of all paths with a power of two length (plus one for the node itself) towards the root is stored. Moreover, for each <math>P_i(x)</math>, the set of all majority ''candidates'' <math>C_i(x)</math> are stored. More specifically, <math>C_i(x)</math> contains the set of all <math>(\tau/2)</math>-majorities in <math>P_i(x)</math> or labels that appear more than <math>(\tau/2).(2^i+1)</math> times in <math>P_i(x)</math>. It is easy to see that the set of candidates <math>C_i(x)</math> can have at most <math>2/\tau</math> distinct labels for each <math>i</math>. Gagie et al. <ref name=":2">{{Cite journal|last=Gagie|first=Travis|last2=He|first2=Meng|last3=Navarro|first3=Gonzalo|last4=Ochoa|first4=Carlos|date=2020-09|title=Tree path majority data structures|url=http://dx.doi.org/10.1016/j.tcs.2020.05.039|journal=Theoretical Computer Science|volume=833|pages=107–119|doi=10.1016/j.tcs.2020.05.039|issn=0304-3975}}</ref> then note that the set of all <math>\tau</math>-majorities in the path from any marked node <math>x</math> to one of its ancestors <math>z</math> is included in some <math>C_i(x)</math> (Lemma 2 in <ref name=":2">{{Cite journal|last=Gagie|first=Travis|last2=He|first2=Meng|last3=Navarro|first3=Gonzalo|last4=Ochoa|first4=Carlos|date=2020-09|title=Tree path majority data structures|url=http://dx.doi.org/10.1016/j.tcs.2020.05.039|journal=Theoretical Computer Science|volume=833|pages=107–119|doi=10.1016/j.tcs.2020.05.039|issn=0304-3975}}</ref>) since the length of <math>P_i(x)</math> is equal to <math>(2^i+1)</math> thus there exists a <math>P_i(x)</math> for <math>0\leq i \leq \log(depth(x))</math> whose length is between <math>d_{xz} \text{ and } 2 d_{xz}</math> where <math>d_{xz}</math> is the distance between x and z. The existence of such <math>P_i(x)</math> implies that a <math>\tau</math>-majority in the path from <math>x</math> to <math>z</math> must be a <math>(\tau/2)</math>-majority in <math>P_i(x)</math>, and thus must appear in <math>C_i(x)</math>. It is easy to see that this data structure require <math>O(n \log n)</math> words of space, because as mentioned above in the construction phase <math>O(\tau n)</math> nodes are marked and for each marked node some candidate sets are stored. By definition, for each marked node <math>O(\log n)</math> of such sets are stores, each of which contains <math>O(1/\tau)</math> candidates. Therefore, this data structure requires <math>O(\log n \times (1/\tau) \times \tau n)=O(n \log n)</math> words of space. Please note that each node <math>x</math> also stores <math>count(x)</math> which is equal to the number of instances of <math>label(x)</math> on the path from <math>x</math> to the root of <math>T</math>, this does not increase the space complexity since it only adds a constant number of words per node.
 
Each query between two nodes <math>u</math> and <math>v</math> can be answered by using the decomposability property (as explained above) of range <math>\tau</math>-majority queries and by breaking the query path between <math>u</math> and <math>v</math> into four subpaths. Let <math>z</math> be the lowest common ancestor of <math>u</math> and <math>v</math>, with <math>x</math> and <math>y</math> being the nearest marked ancestors of <math>u</math> and <math>v</math> respectively. The path from <math>u</math> to <math>v</math> is decomposed into the paths from <math>u</math> and <math>v</math> to <math>x</math> and <math>y</math> respectively (the size of these paths are smaller than <math>2\lceil 1 / \tau\rceil</math> by definition, all of which are considered as candidates), and the paths from <math>x</math> and <math>y</math> to <math>z</math> (by finding the suitable <math>C_i(x)</math> as explained above and considering all of its labels as candidates). Please note that, boundary nodes have to be handled accordingly so that all of these subpaths are disjoint and from all of them a set of <math>O(1/\tau)</math> candidates is derived. Each of these candidates is then verified using a combination of the <math>labelanc (x, \ell)</math> query which returns the lowest ancestor of node <math>x</math> that has label <math>\ell</math> and the <math>count(x)</math> fields of each node. On a <math>w</math>-bit RAM and an alphabet of size <math>\sigma</math>, the <math>labelanc (x, \ell)</math> query can be answered in <math>O\left(\log \log _{w} \sigma\right) </math> time whilst having linear space requirements <ref>{{Cite journal|last=He|first=Meng|last2=Munro|first2=J. Ian|last3=Zhou|first3=Gelin|date=2014-07-08|title=A Framework for Succinct Labeled Ordinal Trees over Large Alphabets|url=http://dx.doi.org/10.1007/s00453-014-9894-4|journal=Algorithmica|volume=70|issue=4|pages=696–717|doi=10.1007/s00453-014-9894-4|issn=0178-4617}}</ref>. Therefore, verifying each of the <math>O(1/\tau)</math> candidates in <math>O\left(\log \log _{w} \sigma\right) </math> time results in <math>O\left((1/\tau)\log \log _{w} \sigma\right) </math> total query time for returning the set of all <math>\tau </math>-majorities on the path from <math>u </math> to <math>v </math>.
 
==Related problems==