Perfect hash function: Difference between revisions

Content deleted Content added
what hash is generated for "John Doe"?
added disadvantages to perfect hashing to the summary
Line 1:
[[File:Hash table 4 1 1 0 0 0 0 LL.svg|thumb|240px|right|A perfect hash function for the four names shown]]
[[File:Hash table 4 1 0 0 0 0 0 LL.svg|thumb|240px|right|A minimal perfect hash function for the four names shown]]
In [[computer science]], a '''perfect hash function''' {{mvar|h}} for a set {{mvar|S}} is a [[hash function]] that maps distinct elements in {{mvar|S}} to a set of {{mvar|m}} integers, with no [[hash collision|collisions]]. In mathematical terms, it is an [[injective function]].
 
Perfect hash functions may be used to implement a [[lookup table]] with constant worst-case access time. A perfect hash function hascan, manyas of the sameany [[Hashhash function#Applications|applications]], asbe otherused to implement [[hash functions,table|hash buttables]], with the advantage that no [[Hash table#Collision resolution|collision resolution]] has to be implemented. In addition, if the keys are not the data and if it is known that queried keys will be valid, then the keys do not need to be stored in the lookup table, saving space.
 
Disadvantages of perfect hash functions are that {{mvar|S}} needs to be known for the construction of the perfect hash function, and re-construction is required if {{mvar|S}} changes. The space requirement to store the perfect hash function is in {{math|''O''(''n'')}}.
 
The important performance parameters for perfect hash functions are the evaluation time, which should be constant, the construction time, and the representation size.
 
==Application==
A perfect hash function with values in a limited range can be used for efficient lookup operations, by placing keys from {{mvar|S}} (or other associated values) in a [[lookup table]] indexed by the output of the function. One can then test whether a key is present in {{mvar|S}}, or look up a value associated with that key, by looking for it at its cell of the table. Each such lookup takes [[constant time]] in the [[Worst-case complexity|worst case]].<ref name="inventor"/> With perfect hashing, the associated data can be read or written with a single access to the table.<ref>{{citation
| last1 = Lu | first1 = Yi | author1-link = Yi Lu
| last2 = Prabhakar | first2 = Balaji | author2-link = Balaji Prabhakar
| last3 = Bonomi | first3 = Flavio | author3-link = Flavio Bonomi
| doi = 10.1109/ISIT.2006.261567
| journal = 2006 [[IEEE International Symposium on Information Theory]]
| pages = 2774-2778
| title = Perfect Hashing for Network Applications
| year = 2006}}</ref>
 
==Performance of perfect hash functions==
The important performance parameters for perfect hashing are the representation size, the evaluation time, the construction time, and additionally the range requirement <math>\frac{m}{n}</math>.<ref name="CHD"/> The evaluation time can be as fast as {{math|''O''(''1'')}}, which is optimal.<ref name="inventor"/><ref name="CHD"/> The construction time needs to be at least {{math|''O''(''n'')}}, because each element in {{mvar|S}} needs to be considered, and {{mvar|S}} contains {{mvar|n}} elements. This lower bound can be achieved in practice.<ref name="CHD"/>
 
The lower bound for the representation size depends on {{mvar|m}} and {{mvar|n}}. Let {{math|''m'' {{=}} (1+&epsilon;) ''n''}} and {{mvar|h}} a perfect hash function. A good approximation for the lower bound is <math>\log e - \varepsilon \log \frac{1+\varepsilon}{\varepsilon}</math> Bits per element. For minimal perfect hashing, {{math|&epsilon; {{=}} 0}}, the lower bound is {{math|log e &approx; 1.44}} bits per element.<ref name="CHD"/>
 
==Construction==
Line 30 ⟶ 47:
The hash function itself requires storage space {{math|''O''(''n'')}} to store {{mvar|k}}, {{mvar|p}}, and all of the second-level linear modular functions. Computing the hash value of a given key {{mvar|x}} may be performed in constant time by computing {{math|''g''(''x'')}}, looking up the second-level function associated with {{math|''g''(''x'')}}, and applying this function to {{mvar|x}}.
A modified version of this two-level scheme with a larger number of values at the top level can be used to construct a perfect hash function that maps {{mvar|S}} into a smaller range of length {{math|''n'' + ''o''(''n'')}}.<ref name="inventor"/>
 
 
A more recent method for constructing a perfect hash function is described by {{harvtxt|Belazzougui|Botelho|Dietzfelbinger|2009}} as "hash, displace, and compress". Here a first-level hash function {{mvar|g}} is also used to map elements onto a range of {{mvar|r}} integers. An element {{math|''x'' &in; ''S''}} is stored in the Bucket {{mvar|B<sub>g(x)</sub>}}.<ref name="CHD" />
 
Then, in descending order of size, each bucket's elements are hashed by a hash function of a sequence of independent fully random hash functions {{math|(&Phi;<sub>1</sub>, &Phi;<sub>2</sub>, &Phi;<sub>3</sub>, ...)}}, starting with {{math|&Phi;<sub>1</sub>}}. If the hash function does not produce any collisions for the bucket, and the resulting values are not yet occupied by other elements from other buckets, the function is chosen for that bucket. If not, the next hash function in the sequence is tested.<ref name="CHD" />
 
To evaluate the perfect hash function {{math|''h''(''x'')}} one only has to save the mapping &sigma; of the bucket index {{math|''g''(''x'')}} onto the correct hash function in the sequence, resulting in {{math|h(x) {{=}} &Phi;<sub>&sigma;(g(x))</sub>}}<ref name="CHD" />
 
Finally, to reduce the representation size, the (&sigma;(i))<sub>0&le;i<r</sub> are compressed into a form that still allows the evaluation in {{math|''O''(''1'')}}.<ref name="CHD" />
 
This approach needs linear time in {{mvar|n}} for construction, constant evaluation time, and has a representation size in {{math|''O''(''n'')}}.
 
===Pseudocode===
'''algorithm''' ''hash, displace, and compress'' '''is'''
(1) Split S into buckets B<sub>i</sub>:={{thin space}}g<sup>-1</sup>({i})&cap;{{thin space}}S,0&le;i<r
(2) Sort buckets B<sub>i</sub> in falling order according to size |B<sub>i</sub>|
(3) Initialize array T[0...m-1] with 0's
(4) '''for all''' i{{thin space}}&in;[r], in the order from (2), '''do'''
(5) '''for''' l{{thin space}}&larr;{{thin space}}1,2,...
(6) '''repeat''' forming K<sub>i</sub>{{thin space}}&larr;{{thin space}}{{{math|&Phi;}}<sub>l</sub>(x)|x{{thin space}}&in;{{thin space}}B<sub>i</sub>}
(6) '''until''' |K<sub>i</sub>|=|B<sub>i</sub>| '''and''' K<sub>i</sub>&cap;{j|T[j]=1}={{thin space}}&emptyset;
(7) '''let''' &sigma;(i):= the successful l
(8) '''for all''' j{{thin space}}&in;{{thin space}}K<sub>i</sub> '''let''' T[j]:={{thin space}}1
(9) Transform (&sigma;<sub>i</sub>)<sub>0&le;i<r</sub> into compressed form, retaining {{math|''O''(''1'')}} access.
 
==Space lower bounds==
Line 44 ⟶ 85:
| volume = 5
| year = 1984}}.</ref>
 
For minimal perfect hash functions the information theoretic space lower bound is
:<math>\log_2e\approx1.44</math>
Bits per element.<ref name="CHD" />
 
For perfect hash functions, it is first assumed that the range of {{mvar|h}} is bounded by {{mvar|n}} as {{math|''m'' {{=}} (1+&epsilon;) ''n''}}. With the formula given by {{harvtxt|Belazzougui|Botelho|Dietzfelbinger|2009}} and for a [[Universe (mathematics)|universe]] <math>U\supseteq S</math> whose size {{math|{{!}}''U''{{!}} {{=}} ''u''}} tends towards infinity, the space lower bounds is
:<math>\log_2e-\varepsilon \log\frac{1+\varepsilon}{\varepsilon}</math>
Bits per element, minus {{math|log(''n'')}} bits overall.<ref name="CHD" />
 
==Extensions==
Line 66 ⟶ 115:
 
===Minimal perfect hash function===
A minimal perfect hash function is a perfect hash function that maps {{mvar|n}} keys to {{mvar|n}} consecutive integers – usually the numbers from {{math|0}} to {{math|''n'' &minus; 1}} or from {{math|1}} to {{mvar|n}}. A more formal way of expressing this is: Let {{mvar|j}} and {{mvar|k}} be elements of some finite set {{mvar|S}}. Then {{mvar|Fh}} is a minimal perfect hash function if and only if {{math|1=''Fh''(''j'') = ''Fh''(''k'')}} implies {{math|1=''j'' = ''k''}} ([[injectivity]]) and there exists an integer {{mvar|a}} such that the range of {{mvar|Fh}} is {{math|1=''a''..''a'' + {{!}}''S''{{!}} &minus; 1}}. It has been proven that a general purpose minimal perfect hash scheme requires at least 1.44 bits/key.<ref name="CHD">{{citation
| last1 = Belazzougui | first1 = Djamal
| last2 = Botelho | first2 = Fabiano C.
Line 96 ⟶ 145:
| doi-access = free
}}.</ref>
 
===k-perfect hashing===
A hash function is {{mvar|k}}-perfect if at most {{mvar|k}} elements from {{mvar|S}} are mapped onto the same value in the range. The "hash, displace, and compress" algorithm can be used to construct {{mvar|k}}-perfect hash functions by allowing up to {{mvar|k}} collisions. The changes necessary to accomplish this are minimal, and are underlined in the adapted pseudocode below:
(4) '''for all''' i{{thin space}}&in;[r], in the order from (2), '''do'''
(5) '''for''' l{{thin space}}&larr;{{thin space}}1,2,...
(6) '''repeat''' forming K<sub>i</sub>{{thin space}}&larr;{{thin space}}{{{math|&Phi;}}<sub>l</sub>(x)|x{{thin space}}&in;{{thin space}}B<sub>i</sub>}
(6) '''until''' |K<sub>i</sub>|=|B<sub>i</sub>| '''and''' K<sub>i</sub>&cap;{j|<u>T[j]=k</u>}={{thin space}}&emptyset;
(7) '''let''' &sigma;(i):= the successful l
(8) '''for all''' j{{thin space}}&in;{{thin space}}K<sub>i</sub> '''set''' <u>T[j]&larr;T[j]+1</u>
 
===Order preservation===