Perfect hash function: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 21:27, 23 September 2021 edit Beland (talk \| contribs) Autopatrolled, Administrators 259,156 edits m convert special characters (via WP:JWB) ← Previous edit		Latest revision as of 02:43, 11 August 2025 edit undo Bender the Bot (talk \| contribs) Bots 1,064,377 edits m HTTP to HTTPS for SourceForge Tag: AWB
(36 intermediate revisions by 25 users not shown)
Line 1: {{Short description\|Hash function without any collisions}} [[File:Hash table 4 1 1 0 0 0 0 LL.svg\|thumb\|240px\|right\|A perfect hash function for the four names shown]] [[File:Hash table 4 1 0 0 0 0 0 LL.svg\|thumb\|240px\|right\|A minimal perfect hash function for the four names shown]] In [[computer science]], a '''perfect hash function''' {{mvar\|h}} for a set {{mvar\|S}} is a [[hash function]] that maps distinct elements in {{mvar\|S}} to a set of {{mvar\|m}} integers, with no [[hash collision\|collisions]]. In mathematical terms, it is an [[injective function]]. Perfect hash functions may be used to implement a [[lookup table]] with constant worst-case access time. A perfect hash function can, as any [[hash function]], be used to implement [[hash table~~\|hash tables~~]]s, with the advantage that no [[Hash table#Collision resolution\|collision resolution]] has to be implemented. In addition, if the keys are not in the data and if it is known that queried keys will be valid, then the keys do not need to be stored in the lookup table, saving space. Disadvantages of perfect hash functions are that {{mvar\|S}} needs to be known for the construction of the perfect hash function. Non-dynamic perfect hash functions need to be re-constructed if {{mvar\|S}} changes. For frequently changing {{mvar\|S}} [[dynamic perfect hashing\|dynamic perfect hash functions]] may be used at the cost of additional space.<ref name="DynamicPerfectHashing" /> The space requirement to store the perfect hash function is in {{math\|''O''(''n'')}} where {{math\|''n''}} is the number of keys in the structure. The important performance parameters for perfect hash functions are the evaluation time, which should be constant, the construction time, and the representation size. Line 13 ⟶ 14: \| last1 = Lu \| first1 = Yi \| author1-link = Yi Lu (computer scientist) \| last2 = Prabhakar \| first2 = Balaji \| author2-link = Balaji Prabhakar \| last3 = Bonomi \| first3 = Flavio \| title = 2006 IEEE International Symposium on Information Theory \| chapter = Perfect Hashing for Network Applications \| author3-link = Flavio Bonomi \| doi = 10.1109/ISIT.2006.261567 \| pages = ~~2774-2778~~2774–2778▼ ~~\| journal = 2006 [[IEEE International Symposium on Information Theory]]~~ \| year = 2006\| isbn = 1-4244-0505-X \| s2cid = 1494710 }}</ref> ▲ \| pages = 2774-2778 ~~\| title = Perfect Hashing for Network Applications~~ \| year = 2006}}</ref>▼ ==Performance of perfect hash functions== The important performance parameters for perfect hashing are the representation size, the evaluation time, the construction time, and additionally the range requirement <math>\frac{m}{n}</math> (average number of buckets per key in the hash table).<ref name="CHD"/> The evaluation time can be as fast as {{math\|''O''(''1'')}}, which is optimal.<ref name="inventor"/><ref name="CHD"/> The construction time needs to be at least {{math\|''O''(''n'')}}, because each element in {{mvar\|S}} needs to be considered, and {{mvar\|S}} contains {{mvar\|n}} elements. This lower bound can be achieved in practice.<ref name="CHD"/> The lower bound for the representation size depends on {{mvar\|m}} and {{mvar\|n}}. Let {{math\|''m'' {{=}} (1+ε) ''n''}} and {{mvar\|h}} a perfect hash function. A good approximation for the lower bound is <math>\log e - \varepsilon \log \frac{1+\varepsilon}{\varepsilon}</math> Bits per element. For minimal perfect hashing, {{math\|ε {{=}} 0}}, the lower bound is {{math\|log e ≈ 1.44}} bits per element.<ref name="CHD"/> Line 41 ⟶ 40: \| title = Storing a Sparse Table with {{math\|''O''(1)}} Worst Case Access Time \| volume = 31 \| year = 1984~~}}</ref>~~\| s2cid = 5399743 \| doi-access = free ▲ ~~\| year = 2006~~}}</ref> As {{harvtxt\|Fredman\|Komlós\|Szemerédi\|1984}} show, there exists a choice of the parameter {{mvar\|k}} such that the sum of the lengths of the ranges for the {{mvar\|n}} different values of {{math\|''g''(''x'')}} is {{math\|''O''(''n'')}}. Additionally, for each value of {{math\|''g''(''x'')}}, there exists a linear modular function that maps the corresponding subset of {{mvar\|S}} into the range associated with that value. Both {{mvar\|k}}, and the second-level functions for each value of {{math\|''g''(''x'')}}, can be found in [[polynomial time]] by choosing values randomly until finding one that works.<ref name="inventor"/> Line 47: The hash function itself requires storage space {{math\|''O''(''n'')}} to store {{mvar\|k}}, {{mvar\|p}}, and all of the second-level linear modular functions. Computing the hash value of a given key {{mvar\|x}} may be performed in constant time by computing {{math\|''g''(''x'')}}, looking up the second-level function associated with {{math\|''g''(''x'')}}, and applying this function to {{mvar\|x}}. A modified version of this two-level scheme with a larger number of values at the top level can be used to construct a perfect hash function that maps {{mvar\|S}} into a smaller range of length {{math\|''n'' + ''o''(''n'')}}.<ref name="inventor"/> A more recent method for constructing a perfect hash function is described by {{harvtxt\|Belazzougui\|Botelho\|Dietzfelbinger\|2009}} as "hash, displace, and compress". Here a first-level hash function {{mvar\|g}} is also used to map elements onto a range of {{mvar\|r}} integers. An element {{math\|''x'' ∈ ''S''}} is stored in the Bucket {{mvar\|B<sub>g(x)</sub>}}.<ref name="CHD" /> Line 53 ⟶ 52: Then, in descending order of size, each bucket's elements are hashed by a hash function of a sequence of independent fully random hash functions {{math\|(Φ<sub>1</sub>, Φ<sub>2</sub>, Φ<sub>3</sub>, ...)}}, starting with {{math\|Φ<sub>1</sub>}}. If the hash function does not produce any collisions for the bucket, and the resulting values are not yet occupied by other elements from other buckets, the function is chosen for that bucket. If not, the next hash function in the sequence is tested.<ref name="CHD" /> To evaluate the perfect hash function {{math\|''h''(''x'')}} one only has to save the mapping ~~σ~~σ of the bucket index {{math\|''g''(''x'')}} onto the correct hash function in the sequence, resulting in {{math\|h(x) {{=}} Φ<sub>~~σ~~σ(g(x))</sub>}}.<ref name="CHD" /> Finally, to reduce the representation size, the (~~σ~~{{math\|σ(i))<sub>0~~≤~~ ≤ i < r</sub>}} are compressed into a form that still allows the evaluation in {{math\|''O''(''1'')}}.<ref name="CHD" /> This approach needs linear time in {{mvar\|n}} for construction, and constant evaluation time~~, and~~. ~~has a~~The representation size is in {{math\|''O''(''n'')}}, and depends on the achieved range. For example, with {{math\|''m'' {{=}} 1.23''n''}} {{harvtxt\|Belazzougui\|Botelho\|Dietzfelbinger\|2009}} achieved a representation size between 3.03 bits/key and 1.40 bits/key for their given example set of 10 million entries, with lower values needing a higher computation time. The space lower bound in this scenario is 0.88 bits/key.<ref name="CHD" /> {{missing information\|section\|RecSplit & "fingerprinting" [https://epubs.siam.org/doi/pdf/10.1137/1.9781611976007.14 recsplit paper]\|date=March 2023}} ===Pseudocode=== '''algorithm''' ''hash, displace, and compress'' '''is''' (1) Split S into buckets {{math\|B<sub>i</sub> :={{~~thin space~~=}} g<sup>-1−1</sup>({i})∩~~{{thin space}}~~S,0~~≤~~ ≤ i < r}} (2) Sort buckets B<sub>i</sub> in falling order according to size \|B<sub>i</sub>\| (3) Initialize array T[0...m-1] with 0's Line 68: (6) '''repeat''' forming K<sub>i</sub>{{thin space}}←{{thin space}}{{{math\|Φ}}<sub>l</sub>(x)\|x{{thin space}}∈{{thin space}}B<sub>i</sub>} (6) '''until''' \|K<sub>i</sub>\|=\|B<sub>i</sub>\| '''and''' K<sub>i</sub>∩{j\|T[j]=1}={{thin space}}&emptyset; (7) '''let''' ~~σ~~σ(i):= the successful l (8) '''for all''' j{{thin space}}∈{{thin space}}K<sub>i</sub> '''let''' T[j]:={{thin space}}1 (9) Transform (~~σ~~σ<sub>i</sub>)<sub>0≤i<r</sub> into compressed form, retaining {{math\|''O''(''1'')}} access. ==Space lower bounds== Line 88: For minimal perfect hash functions the information theoretic space lower bound is :<math>\log_2e\approx1.44</math> ~~Bits per element~~bits/key.<ref name="CHD" /> For perfect hash functions, it is first assumed that the range of {{mvar\|h}} is bounded by {{mvar\|n}} as {{math\|''m'' {{=}} (1+ε) ''n''}}. With the formula given by {{harvtxt\|Belazzougui\|Botelho\|Dietzfelbinger\|2009}} and for a [[Universe (mathematics)\|universe]] <math>U\supseteq S</math> whose size {{math\|{{!}}''U''{{!}} {{=}} ''u''}} tends towards infinity, the space lower bounds is :<math>\log_2e-\varepsilon \log\frac{1+\varepsilon}{\varepsilon}</math> ~~Bits per element~~bits/key, minus {{math\|log(''n'')}} bits overall.<ref name="CHD" /> ==Extensions== ===Dynamic perfect hashing=== {{main article\|Dynamic perfect hashing}} Line 115 ⟶ 114: ===Minimal perfect hash function=== A minimal perfect hash function is a perfect hash function that maps {{mvar\|n}} keys to {{mvar\|n}} consecutive integers – usually the numbers from {{math\|0}} to {{math\|''n'' − 1}} or from {{math\|1}} to {{mvar\|n}}. A more formal way of expressing this is: Let {{mvar\|j}} and {{mvar\|k}} be elements of some finite set {{mvar\|S}}. Then {{mvar\|h}} is a minimal perfect hash function if and only if {{math\|1=''h''(''j'') = ''h''(''k'')}} implies {{math\|1=''j'' = ''k''}} ([[injectivity]]) and there exists an integer {{mvar\|a}} such that the range of {{mvar\|h}} is {{math\|1=''a''..''a'' + {{!}}''S''{{!}} − 1}}. It has been proven that a general purpose minimal perfect hash scheme requires at least <math>\log_2 e \approx 1.44</math> bits/key.<ref name="CHD">{{citation \| last1 = Belazzougui \| first1 = Djamal \| last2 = Botelho \| first2 = Fabiano C. \| last3 = Dietzfelbinger \| first3 = Martin \| contribution = Hash, displace, and compress \| contribution-url = ~~http~~https://cmph.sourceforge.net/papers/esa09.pdf \| doi = 10.1007/978-3-642-04128-0_61 \| ___location = Berlin Line 127 ⟶ 126: \| publisher = Springer \| series = [[Lecture Notes in Computer Science]] \| title = Algorithms - ESA 2009 ~~\| title = Algorithms—ESA 2009: 17th Annual European Symposium, Copenhagen, Denmark, September 7-9, 2009, Proceedings~~ \| volume = 5757 \| isbn = 978-3-642-04127-3 \| year = 2009\| citeseerx = 10.1.1.568.130 \| url = ~~http~~https://cmph.sourceforge.net/papers/esa09.pdf }}.</ref> Assuming that <math>S</math> is a set of size <math>n</math> containing integers in the range <math>[1, 2^{o(n)}]</math>, it is known how to efficiently construct an explicit minimal perfect hash function from <math>S</math> to <math>\{1, 2, \ldots, n\}</math> that uses space <math>n \log_2 e + o(n)</math>bits and that supports constant evaluation time.<ref>{{Citation \|last1=Hagerup \|first1=Torben \|title=Efficient Minimal Perfect Hashing in Nearly Minimal Space \|date=2001 \|url=http://dx.doi.org/10.1007/3-540-44693-1_28 \|work=STACS 2001 \|pages=317–326 \|access-date=2023-11-12 \|place=Berlin, Heidelberg \|publisher=Springer Berlin Heidelberg \|isbn=978-3-540-41695-1 \|last2=Tholey \|first2=Torsten\|doi=10.1007/3-540-44693-1_28 \|url-access=subscription }}</ref> In practice, there are minimal perfect hashing schemes that use roughly 1.56 bits/key if given enough time.<ref name="RecSplit">{{citation ~~}}.</ref> The best currently known minimal perfect hashing schemes can be represented using less than 1.56 bits/key if given enough time.~~ ~~<ref name="RecSplit">{{citation~~ \| last1 = Esposito \| first1 = Emmanuel \| last2 = Mueller Graf \| first2 = Thomas Line 144 ⟶ 143: \| arxiv = 1910.06416 \| doi-access = free }}.</ref><ref>[https://github.com/iwiwi/minimal-perfect-hash minimal-perfect-hash (GitHub)]</ref> ~~}}.</ref>~~ ===k-perfect hashing=== Line 152 ⟶ 151: (6) '''repeat''' forming K<sub>i</sub>{{thin space}}←{{thin space}}{{{math\|Φ}}<sub>l</sub>(x)\|x{{thin space}}∈{{thin space}}B<sub>i</sub>} (6) '''until''' \|K<sub>i</sub>\|=\|B<sub>i</sub>\| '''and''' K<sub>i</sub>∩{j\|<u>T[j]=k</u>}={{thin space}}&emptyset; (7) '''let''' ~~σ~~σ(i):= the successful l (8) '''for all''' j{{thin space}}∈{{thin space}}K<sub>i</sub> '''set''' <u>T[j]←T[j]+1</u> ===Order preservation=== A minimal perfect hash function {{mvar\|F}} is ''order preserving'' if keys are given in some order {{math\|''a''<sub>1</sub>, ''a''<sub>2</sub>, ..., ''a''<sub>''n''</sub>}} and for any keys {{math\|''a''<sub>''j''</sub>}} and {{math\|''a''<sub>''k''</sub>}}, {{math\|''j'' < ''k''}} implies {{math\|''F''(''a''<sub>''j''</sub>) < F(''a''<sub>''k''</sub>)}}.<ref>{{Citation \|first=Bob \|last=Jenkins \|contribution=order-preserving minimal perfect hashing \|title=Dictionary of Algorithms and Data Structures \|editor-first=Paul E. \|editor-last=Black \|publisher=U.S. National Institute of Standards and Technology \|date=14 April 2009 \|accessdate=2013-03-05 \|url=https://xlinux.nist.gov/dads/HTML/orderPreservMinPerfectHash.html}}</ref> In this case, the function value is just the position of each key in the sorted ordering of all of the keys. A simple implementation of order-preserving minimal perfect hash functions with constant access time is to use an (ordinary) perfect hash function ~~or [[cuckoo hashing]]~~ to store a lookup table of the positions of each key. IfThis solution uses <math>O(n \log n)</math> bits, which is optimal in the setting where the comparison function for the keys tomay be ~~hashed~~arbitrary.<ref>{{citation \|last1=Fox \|first1=Edward A. \|title=Order-preserving minimal perfect hash functions and information retrieval \|date=July 1991 \|url=http://eprints.cs.vt.edu/archive/00000248/01/TR-91-01.pdf \|journal=ACM Transactions on Information Systems \|volume=9 \|issue=3 \|pages=281–308 \|___location=New York, NY, USA \|publisher=ACM \|doi=10.1145/125187.125200 \|s2cid=53239140 \|last2=Chen \|first2=Qi Fan \|last3=Daoud \|first3=Amjad M. \|last4=Heath \|first4=Lenwood S.}}.</ref> However, if the keys {{math\|''a''<sub>1</sub>, ''a''<sub>2</sub>, ..., ''a''<sub>''n''</sub>}} are ~~themselves~~integers ~~stored~~drawn infrom a ~~sorted~~universe ~~array~~<math>\{1, 2, \ldots, U\}</math>, then it is possible to ~~store~~construct aan ~~small~~order-preserving ~~number~~hash offunction ~~additional~~using only <math>O(n \log \log \log U)</math> bits ~~per~~of ~~key~~space.<ref>{{citation in\|last1=Belazzougui a\|first1=Djamal ~~data~~\|title=Theory ~~structure~~and ~~that~~practice ~~can~~of bemonotone ~~used~~minimal toperfect ~~compute~~hashing ~~hash~~\|date=November ~~values~~2008 ~~quickly~~\|journal=Journal of Experimental Algorithmics \|volume=16 \|at=Art. no. 3.2, 26pp \|doi=10.1145/1963190.2025378 \|s2cid=2367401 \|last2=Boldi \|first2=Paolo \|last3=Pagh \|first3=Rasmus \|last4=Vigna \|first4=Sebastiano \|author3-link=Rasmus Pagh}}.</ref> Moreover, this bound is known to be optimal.<ref>{{~~citation~~Citation \|last1=Assadi \|first1=Sepehr \|title=Tight Bounds for Monotone Minimal Perfect Hashing \|date=January 2023 \|url=http://dx.doi.org/10.1137/1.9781611977554.ch20 \|work=Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) \|pages=456–476 \|access-date=2023-04-27 \|place=Philadelphia, PA \|publisher=Society for Industrial and Applied Mathematics \|isbn=978-1-61197-755-4 \|last2=Farach-Colton \|first2=Martín \|last3=Kuszmaul \|first3=William\|doi=10.1137/1.9781611977554.ch20 \|arxiv=2207.10556 }}</ref> ~~\| last1 = Belazzougui \| first1 = Djamal~~ ~~\| last2 = Boldi \| first2 = Paolo~~ ~~\| last3 = Pagh \| first3 = Rasmus \| author3-link = Rasmus Pagh~~ ~~\| last4 = Vigna \| first4 = Sebastiano~~ ~~\| date = November 2008~~ ~~\| doi = 10.1145/1963190.2025378~~ ~~\| journal = Journal of Experimental Algorithmics~~ ~~\| at = Art. no. 3.2, 26pp~~ ~~\| title = Theory and practice of monotone minimal perfect hashing~~ ~~\| volume = 16}}.</ref> Order-preserving minimal perfect hash functions require necessarily {{math\|''Ω''(''n'' log ''n'')}} bits to be represented.<ref>{{citation~~ ~~\| last1 = Fox \| first1 = Edward A.~~ ~~\| last2 = Chen \| first2 = Qi Fan~~ ~~\| last3 = Daoud \| first3 = Amjad M.~~ ~~\| last4 = Heath \| first4 = Lenwood S.~~ ~~\| date = July 1991~~ ~~\| doi = 10.1145/125187.125200~~ ~~\| issue = 3~~ ~~\| journal = ACM Transactions on Information Systems~~ ~~\| ___location = New York, NY, USA~~ ~~\| pages = 281–308~~ ~~\| publisher = ACM~~ ~~\| title = Order-preserving minimal perfect hash functions and information retrieval~~ ~~\| volume = 9\| url = http://eprints.cs.vt.edu/archive/00000248/01/TR-91-01.pdf~~ ~~}}.</ref>~~ ==Related constructions== While well-dimensioned hash tables have amortized average O(1) time (amortized average constant time) for lookups, insertions, and deletion, most hash table algorithms suffer from possible worst-case times that take much longer. A worst-case O(1) time (constant time even in the worst case) would be better for many applications (including [[network router]] and [[memory cache]]s).<ref name="davis" > Timothy A. Davis. [https://www.cs.wm.edu/~tadavis/cs303/ch05sm.pdf "Chapter 5 Hashing"]: subsection "Hash Tables with Worst-Case O(1) Access" </ref>{{rp\|41}} Few hash table algorithms support worst-case O(1) lookup time (constant lookup time even in the worst case). The few that do include: perfect hashing; [[dynamic perfect hashing]]; [[cuckoo hashing]]; [[hopscotch hashing]]; and [[extendible hashing]].<ref name="davis" />{{rp\|42-69}} A simple alternative to perfect hashing, which also allows dynamic updates, is [[cuckoo hashing]]. This scheme maps keys to two or more locations within a range (unlike perfect hashing which maps each key to a single ___location) but does so in such a way that the keys can be assigned one-to-one to locations to which they have been mapped. Lookups with this scheme are slower, because multiple locations must be checked, but nevertheless take constant worst-case time.<ref>{{citation \| last1 = Pagh \| first1 = Rasmus \| author1-link = Rasmus Pagh Line 195 ⟶ 179: \| year = 2004}}.</ref> == References == {{reflist\|30em}} == Further reading == Richard J. Cichelli. ''Minimal Perfect Hash Functions Made Simple'', Communications of the ACM, Vol. 23, Number 1, January 1980. [[Thomas H. Cormen]], [[Charles E. Leiserson]], [[Ronald L. Rivest]], and [[Clifford Stein]]. ''[[Introduction to Algorithms]]'', Third Edition. MIT Press, 2009. {{ISBN\|978-0262033848}}. Section 11.5: Perfect hashing, pp. 267, 277–282. * Fabiano C. Botelho, [[Rasmus Pagh]] and Nivio Ziviani. [https://arxiv.org/abs/cs/0702159 "Perfect Hashing for Data Management Applications"]. * Fabiano C. Botelho and [[Nivio Ziviani]]. [http://homepages.dcc.ufmg.br/~nivio/papers/cikm07.pdf "External perfect hashing for very large key sets"]. 16th ACM Conference on Information and Knowledge Management (CIKM07), Lisbon, Portugal, November 2007. * Djamal Belazzougui, Paolo Boldi, [[Rasmus Pagh]], and Sebastiano Vigna. [https://web.archive.org/web/20140125080021/http://vigna.dsi.unimi.it/ftp/papers/MonotoneMinimalPerfectHashing.pdf "Monotone minimal perfect hashing: Searching a sorted table with O(1) accesses"]. In Proceedings of the 20th Annual ACM-SIAM Symposium On Discrete Mathematics (SODA), New York, 2009. ACM Press. * Marshall D. Brain and Alan L. Tharp. "Near-perfect Hashing of Large Word Sets". Software—Practice and Experience, vol. 19(10), 967-078, October 1989. John Wiley & Sons. * Douglas C. Schmidt, [http://www.dre.vanderbilt.edu/~schmidt/PDF/gperf.pdf GPERF: A Perfect Hash Function Generator], C++ Report, SIGS, Vol. 10, No. 10, November/December, 1998. * Hans-Peter Lehmann, Thomas Mueller, Rasmus Pagh, Giulio Ermanno Pibiri, Peter Sanders, Sebastiano Vigna, Stefan Walzer, "Modern Minimal Perfect Hashing: A Survey", {{arxiv\|2506.06536}}, June 2025. Discusses post-1997 developments in the field. == External links == [https://www.gnu.org/software/gperf/ gperf] is an [[~~Open~~open ~~Source~~source]] C and C++ perfect hash generator (very fast, but only works for small sets) [http://burtleburtle.net/bob/hash/perfect.html Minimal Perfect Hashing (bob algorithm)] by Bob Jenkins [~~http~~https://cmph.sourceforge.net/index.html cmph]: C Minimal Perfect Hashing Library, open source implementations for many (minimal) perfect hashes (works for big sets) [http://sux.di.unimi.it/ Sux4J]: open source monotone minimal perfect hashing in Java *[https://web.archive.org/web/20130729211948/http://www.dupuis.me/node/9 MPHSharp]: perfect hashing methods in C#