Content deleted Content added
→Examples: I added range majority queries as examples of range queries and discussed range tau-majority queries on 2D arrays. I will add more in the coming days. |
→Range Majority Queries on Two-Dimensional Arrays: Added range majority queries on one-dimensional arrays. Will add more soon. |
||
Line 74:
If a linear algorithm to find the medians is used, the total cost of preprocessing for {{mvar|k}} range median queries is <math> n\log k</math>. The algorithm can also be modified to solve the [[online algorithm|online]] version of the problem.<ref name=ethpaper />
===Majority===
Finding frequent elements in a given set of items is one of the most important tasks in data mining. Finding frequent elements might be a difficult task to achieve when most items have similar frequencies. Therefore, it might be more beneficial if some threshold of significance was used for detecting such items. One of the most famous algorithms for finding the majority of an array was proposed by Boyer and Moore <ref>{{Citation|last=Boyer|first=Robert S.|title=MJRTY—A Fast Majority Vote Algorithm|date=1991|url=http://dx.doi.org/10.1007/978-94-011-3488-0_5|work=Automated Reasoning Series|pages=105–117|place=Dordrecht|publisher=Springer Netherlands|access-date=2021-12-18|last2=Moore|first2=J. Strother}}</ref> which is also known as the [[Boyer–Moore majority vote algorithm]]. Boyer and Moore proposed an algorithm to find the majority element of a string (if it has one) in <math>O(n)</math> time and using <math>O(1)</math> space. In the context of Boyer and Moore’s work and generally speaking, a majority element in a set of items (for example string or an array) is one whose number of instances is more than half of the size of that set. Few years later, Misra and Gries <ref>{{Cite journal|last=Misra|first=J.|last2=Gries|first2=David|date=1982-11|title=Finding repeated elements|url=http://dx.doi.org/10.1016/0167-6423(82)90012-0|journal=Science of Computer Programming|volume=2|issue=2|pages=143–152|doi=10.1016/0167-6423(82)90012-0|issn=0167-6423}}</ref> proposed a more general version of Boyer and Moore's algorithm using <math>O(n log (1 / \tau))</math> comparisons to find all items in an array whose relative frequencies are greater than some threshold <math>0<\tau<1</math>. A range <math>\tau</math>-majority query is one that, given a subrange of a data structure (for example an array) of size <math>|R|</math>, returns the set of all distinct items that appear more than (or in some publications equal to) <math>\tau |R|</math> times in that given range. In different structures that support range <math>\tau</math>-majority queries, <math>\tau </math> can be either static (specified during preprocessing) or dynamic (specified at query time). Many of such approaches are based on the fact that, regardless of the size of the range, for a given <math>\tau</math> there could be at most <math>O(1/\tau)</math> distinct ''candidates'' with relative frequencies at least <math>\tau</math>. By verifying each of these candidates in constant time, <math>O(1/\tau)</math> query time is achieved. A range <math>\tau</math>-majority query is decomposable <ref name=":1">{{Cite book|last=Verfasser|first=Karpiński, Marek 1948-|url=http://worldcat.org/oclc/277046650|title=Searching for frequent colors in rectangles|oclc=277046650}}</ref> in the sense that a <math>\tau</math>-majority in a range <math>R</math> with partitions <math>R_1</math> and <math>R_2</math> must be a <math>\tau</math>-majority in either <math>R_1</math>or <math>R_2</math>. Due to this decomposability, some data structures answer <math>\tau</math>-majority queries on one-dimensional arrays by finding the [[Lowest common ancestor]] (LCA) of the endpoints of the query range in a [[Range tree]] and validating two sets of candidates (of size <math>O(1/\tau)</math>) from each endpoint to the lowest common ancestor in constant time resulting in <math>O(1/\tau)</math> query time.
Gagie et al. <ref>{{Citation|last=Gagie|first=Travis|title=Finding Frequent Elements in Compressed 2D Arrays and Strings|date=2011|url=http://dx.doi.org/10.1007/978-3-642-24583-1_29|work=String Processing and Information Retrieval|pages=295–300|place=Berlin, Heidelberg|publisher=Springer Berlin Heidelberg|isbn=978-3-642-24582-4|access-date=2021-12-18|last2=He|first2=Meng|last3=Munro|first3=J. Ian|last4=Nicholson|first4=Patrick K.}}</ref> proposed a data structure that supports range <math>\tau</math>-majority
<math>\beta=2^{-i}, \;\; i\in \{1,\dots,log(\frac{1}{\alpha})\}
Line 83:
where <math>\beta</math> is the preprocessing threshold of the <math>i</math>-th instance. Thus, for query blocks smaller than <math>1/\alpha</math> the <math>\lceil\log (1 / \tau)\rceil</math>-th instance is queried. As mentioned above, this data structure has query time <math>O(1/\tau)</math> and requires <math>\mathcal{O}(m n(H+1) \log^2 (1 / \alpha))</math> bits of space by storing a Huffman-encoded copy of it (note the <math>log(1/\alpha)</math> factor and also see [[Huffman coding]]).
==== Range Majority Queries on One-Dimensional Arrays ====
Chan et al. <ref name=":0">{{Citation|last=Chan|first=Timothy M.|title=Linear-Space Data Structures for Range Minority Query in Arrays|date=2012|url=http://dx.doi.org/10.1007/978-3-642-31155-0_26|work=Algorithm Theory – SWAT 2012|pages=295–306|place=Berlin, Heidelberg|publisher=Springer Berlin Heidelberg|isbn=978-3-642-31154-3|access-date=2021-12-20|last2=Durocher|first2=Stephane|last3=Skala|first3=Matthew|last4=Wilkinson|first4=Bryan T.}}</ref> proposed a data structure that given a one-dimensional array<math>A</math>, a subrange <math>R</math> of <math>A</math> (specified at query time) and a threshold <math>\tau</math> (specified at query time), is able to return the list of all <math>\tau</math>-majorities in <math>O(1/\tau)</math> time requiring <math>O(nlogn)</math> words of space. To answer such queries, Chan et al. <ref name=":0" /> begin by noting that there exists a data structure capable of returning the ''top-k'' most frequent items in a range in <math>O(k)</math> time requiring <math>O(n)</math> words of space. For a one-dimensional array <math>A[0,..,n-1]</math>, let a one-sided top-k range query to be of form <math>A[0..i] \text { for } 0 \leq i \leq n-1</math>. For a maximal range of ranges <math>A[0..i] \text { through } A[0..j]</math> in which the frequency of a distinct element <math>e</math> in <math>A</math> remains unchanged (and equal to <math>f</math>), a horizontal line segment is constructed. The <math>x</math>-interval of this line segment corresponds to <math>[i,j]</math> and it has a <math>y</math>-value equal to <math>f</math>. Since adding each element to <math>A</math> changes the frequency of exactly one distinct element, the aforementioned process creates <math>O(n)</math> line segments. Moreover, for a vertical line <math>x=i</math> all horizonal line segments intersecting it are sorted according to their frequencies. Note that, each horizontal line segment with <math>x</math>-interval <math>[\ell,r]</math> corresponds to exactly one distinct element <math>e</math> in <math>A</math>, such that <math>A[\ell]=e</math>. A top-k query can then be answered by shooting a vertical ray <math>x=i</math> and reporting the first <math>k</math> horizontal line segments that intersect it (remember from above that these line line segments are already sorted according to their frequencies) in <math>O(k)</math> time.
Chan et al. <ref name=":0" /> first construct a [[range tree]] in which each branching node stores one copy of the data structure described above for one-sided range top-k queries and each leaf represents an element from <math>A</math>. The top-k data structure at each node is constructed based on the values existing in the subtrees of that node and is meant to answer one-sided range top-k queries. Please note that for a one-dimensional array <math>A</math>, a range tree can be constructed by dividing <math>A</math> into two halves and recursing on both halves; therefore, each node of the resulting range tree represents a range. It can also be seen that this range tree requires <math>O(nlogn)</math> words of space, because there are <math>O(logn)</math> levels and each level <math>\ell</math> has <math>2^{\ell}</math> nodes. Moreover, since at each level <math>\ell</math> of a range tree all nodes have a total of <math>n</math> elements of <math>A</math> at their subtrees and since there are <math>O(logn)</math> levels, the space complexity of this range tree is <math>O(nlogn)</math>.
Using this structure, a range <math>\tau</math>-majority query <math>A[i..j]</math> on <math>A[0..n-1]</math> with <math>0\leq i\leq j \leq n</math> is answered as follows. First, the [[lowest common ancestor]] (LCA) of leaf nodes <math>i</math> and <math>j</math> is found in constant time. Note that there exists a data structure requiring <math>O(n)</math> bits of space that is capable of answering the LCA queries in <math>O(1)</math> time <ref>{{Cite journal|last=Sadakane|first=Kunihiko|last2=Navarro|first2=Gonzalo|date=2010-01-17|title=Fully-Functional Succinct Trees|url=http://dx.doi.org/10.1137/1.9781611973075.13|journal=Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms|___location=Philadelphia, PA|publisher=Society for Industrial and Applied Mathematics|doi=10.1137/1.9781611973075.13}}</ref>. Let <math>z</math> denote the LCA of <math>i </math> and <math>j</math>, using <math>z</math> and according to the decomposability of range <math>\tau</math>-majority queries (as described above and in <ref name=":1" />), the two-sided range query <math>A[i..j]</math> can be converted into two one-sided range top-k queries (from <math>z</math> to <math>i</math> and <math>j</math>). These two one-sided range top-k queries return the top-(<math>1/\tau</math>) most frequent elements in each of their respective ranges in <math>O(1/\tau)</math> time. These frequent elements make up the set of ''candidates'' for <math>\tau</math>-majorities in <math>A[i..j]</math> in which there are <math>O(1/\tau)</math> candidates some of which might be false positives. Each candidate is then assessed in constant time using a linear-space data structure (as described in Lemma 3 in <ref>{{Cite journal|last=Chan|first=Timothy M.|last2=Durocher|first2=Stephane|last3=Larsen|first3=Kasper Green|last4=Morrison|first4=Jason|last5=Wilkinson|first5=Bryan T.|date=2013-03-08|title=Linear-Space Data Structures for Range Mode Query in Arrays|url=http://dx.doi.org/10.1007/s00224-013-9455-2|journal=Theory of Computing Systems|volume=55|issue=4|pages=719–741|doi=10.1007/s00224-013-9455-2|issn=1432-4350}}</ref>) that is able to determine in <math>O(1)</math> time whether or not a given subrange of an array <math>A</math> contains at least <math>q</math> instances of a particular element <math>e</math>.
|