Count-distinct problem: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Add: url. | You can use this bot yourself. Report bugs here. | Suggested by SemperIocundus | via #UCB_webform
m Min/max sketches: authorlink
Line 34:
 
=== Min/max sketches ===
Min/max sketches <ref name=cosma2011>{{cite journal | last1=Cosma | first1=Ioana A. | last2 = Clifford | first2 = Peter | year=2011 | title=A statistical analysis of probabilistic counting algorithms | journal=Scandinavian Journal of Statistics| arxiv=0801.3552 }} </ref><ref>{{cite book | last1=Giroire | first1=Frederic | title=2007 Proceedings of the Fourth Workshop on Analytic Algorithmics and Combinatorics (ANALCO) | pages=223–231 | last2 = Fusy | first2 = Eric | year=2007 |citeseerx=10.1.1.214.270 |doi=10.1137/1.9781611972979.9| isbn=978-1-61197-297-9 }} </ref> store only the minimum/maximum hashed values. Examples of known min/max sketch estimators: Chassaing et al. <ref>{{cite journal | last1=Chassaing | first1=Philippe | last2=Gerin | first2=Lucas | year=2006| title=Efficient estimation of the cardinality of large data sets | journal=Proceedings of the 4th Colloquium on Mathematics and Computer Science| bibcode=2007math......1347C | arxiv=math/0701347 }} </ref> presents max sketch which is the [[minimum-variance unbiased estimator]] for the problem. The continuous max sketches estimator <ref name=edithCohen>{{cite journal | last1=Cohen | first1=Edith |authorlink=Edith Cohen| year=1997| title=Size-estimation framework with applications to transitive closure and reachability | journal=J. Comput. Syst. Sci.| volume=55 | issue=3 | pages=441–453 | doi=10.1006/jcss.1997.1534 }} </ref> is the [[maximum likelihood]] estimator. The estimator of choice in practice is the [[HyperLogLog]] algorithm.<ref name=hyperloglog>{{cite journal | last1=Flajolet | first1=Philippe|author1-link=Philippe Flajolet | last2 = Fusy | first2 = Eric | last3=Gandouet | first3=Olivier | last4=Meunier | first4=Frederic | title=HyperLoglog: the analysis of a near-optimal cardinality estimation algorithm | journal=Analysis of Algorithms |year=2007|url=https://hal.archives-ouvertes.fr/docs/00/40/61/66/PDF/FlFuGaMe07.pdf}} </ref>
 
The intuition behind such estimators is that each sketch carries information about the desired quantity. For example, when every element <math> e_j </math> is associated with a uniform [[Random variable|RV]], <math> h(e_j) \sim U(0,1) </math>, the expected minimum value of <math> h(e_1),h(e_2), \ldots, h(e_n) </math> is <math> 1/(n+1) </math>. The hash function guarantees that <math> h(e_j) </math> is identical for all the appearances of <math> e_j </math>. Thus, the existence of duplicates does not affect the value of the extreme order statistics.
 
There are other estimation techniques other than min/max sketches. The first paper on count-distinct estimation by [[Philippe Flajolet|Flajolet]] et al. <ref>{{cite journal | last1=Flajolet | first1=Philippe|author1-link=Philippe Flajolet | last2 = Martin | first2 = G. Nigel | year=1985 | title=Probabilistic counting algorithms for data base applications | journal=J. Comput. Syst. Sci. | volume=31| issue=2| pages=182–209| doi=10.1016/0022-0000(85)90041-8| url=https://hal.inria.fr/inria-00076244/file/RR-0313.pdf}} </ref> describes a bit pattern sketch. In this case, the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values. The first asymptotically space- and time-optimal algorithm for this problem was given by [[Daniel Kane (mathematician)|Daniel M. Kane]], Jelani Nelson, and David P. Woodruff.<ref name=optimalf0>{{cite journal | last1=Kane | first1=Daniel M. | last2 = Nelson | first2 = Jelani | last3=Woodruff | first3=David P. | year=2010 | authorlink1=Daniel_Kane_(mathematician) | title=An Optimal Algorithm for the Distinct Elements Problem | journal=Proceedings of the 29th Annual ACM Symposium on Principles of Database Systems (PODS)|url=https://dash.harvard.edu/bitstream/handle/1/13820438/f0.pdf;sequence=1}} </ref>
 
===Bottom-''m'' sketches===