Cuckoo hashing: Difference between revisions

Content deleted Content added
m History: Adding web.archive.org links for citations with url-status=live Category:CS1_maint:_url-status
Bender the Bot (talk | contribs)
m External links: HTTP to HTTPS for Blogspot
 
(46 intermediate revisions by 27 users not shown)
Line 2:
[[Image:Cuckoo hashing example.svg|thumb|Cuckoo hashing example. The arrows show the alternative ___location of each key. A new item would be inserted in the ___location of A by moving A to its alternative ___location, currently occupied by B, and moving B to its alternative ___location which is currently vacant. Insertion of a new item in the ___location of H would not succeed: Since H is part of a cycle (together with W), the new item would get kicked out again.]]
 
'''Cuckoo hashing''' is a scheme in [[computer programming]] for resolving [[hash collision]]s of values of [[hash function]]s in a [[hash table|table]], with [[Best, worst and average case|worst-case]] [[constant time|constant]] lookup time. The name derives from the behavior of some species of [[cuckoo]], where the cuckoo chick pushes the other eggs or young out of the nest when it hatches in a variation of the behavior referred to as [[brood parasitism]]; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different ___location in the table.
 
==History==
Cuckoo hashing was first described by [[Rasmus Pagh]] and [[Flemming Friche Rodler]] in a 2001 conference paper.<ref name="Cuckoo">{{Cite book | last1 = Pagh | first1 = Rasmus | author1-link = Rasmus Pagh | last2 = Rodler | first2 = Flemming Friche| chapter = Cuckoo Hashing | doi = 10.1007/3-540-44676-1_10 | title = Algorithms — ESA 2001 | series = Lecture Notes in Computer Science | volume = 2161| year = 2001 | isbn = 978-3-540-42493-2 | citeseerx = 10.1.1.25.4189}}</ref> The paper was awarded the [[European Symposium on Algorithms]] Test-of-Time award in 2020.<ref>{{Cite web |authorsothers=Award committee: [[Uri Zwick]], [[Samir Khuller]], [[Edith Cohen]]|title=ESA - European Symposium on Algorithms: ESA Test-of-Time Award 2020|url=http://esa-symposium.org/test-of-time-award.php |url-status=livedeviated |access-date=2021-05-22 |website=esa-symposium.org|archive-url=https://web.archive.org/web/2017012110385420210522135306/http://esa-symposium.org:80/test-of-time-award.php |archive-date=20172021-0105-21 22}}</ref>{{rp|p=122}}
 
==Operations==
Cuckoo hashing is a form of [[open addressing]] in which each non-empty cell of a [[hash table]] contains a [[unique key|key]] or [[Attribute–valuename–value pair|key–value pair]]. A [[hash function]] is used to determine the ___location for each key, and its presence in the table (or the value associated with it) can be found by examining that cell of the table. However, open addressing suffers from [[hash collision|collisions]], which happens when more than one key is mapped to the same cell.
The basic idea of cuckoo hashing is to resolve collisions by using two hash functions instead of only one. This provides two possible locations in the hash table for each key. In one of the commonly used variants of the algorithm, the hash table is split into two smaller tables of equal size, and each hash function provides an index into one of these two tables. It is also possible for both hash functions to provide indexes into a single table.{{r|Cuckoo|p=121-122}}
 
===Lookup===
Cuckoo hashing uses two hash tables, <math>T_1</math> and <math>T_2</math>. and assumingAssuming <math>r</math> beis the length of each tablestable, the hash functions for the two tables is defined as, <math>h_1,\ h_2\ :\ \cupS \rightarrow \{0, ..., r-1\}</math> and <math>\forall x \in S</math> where <math>x</math> beis the key and <math>S</math> beis the set whose keys are stored in <math>h_1(x)</math> of <math>T_1</math> or <math>h_2(x)</math> of <math>T_2</math>. The lookup operation is as follows:{{r|Cuckoo|p=124}}
{|
|- style="vertical-align:top"
Line 23:
 
===Deletion===
Deletion is performed in <math>O(1)</math> time since thereprobing isn'tis involvementnot ofinvolved. probing—notThis considerationignores the cost of the shrinking operation if the table is too sparse.{{r|Cuckoo|p=124-125}}
 
===Insertion===
InsertionWhen ofinserting a new [[key-valuename–value pair|item]] with key <math>x</math>, the first step involves examining if the slot <math>h_1(x)</math> of the table <math>T_1</math> is occupied;. ifIf it is not, the item is inserted atin that cellslot. However, if the slot is occupied, the preoccupiedexisting item gets removed—let it be <math>x'</math>—and is removed and <math>x</math> is inserted at <math>T_1[h_1(x)]</math>. The removed itemThen, <math>x'</math> is inserted into the table <math>T_2</math> by following the same procedure;. theThe process continues until an empty position is found to insert the key.{{r|Cuckoo|p=124-125}} To avoid the possiblean [[infinite iterationloop]] in the process loop, a threshold <math>\text{Max-Loop}</math> is specified. suchIf thatthe ifnumber theof [[iterations]] exceeds thethis fixed [[Critical_value|threshold]], the hash tables—bothboth <math>T_1</math> and <math>T_2</math>—are are [[Hash_table#Resizing_by_moving_all_entries|rehashed]] with newernew hash functions and the insertion procedure repeats. FollowingThe isfollowing ais [[pseudocode]] for insertion:{{r|Cuckoo|p=125}}
{|
|- style="vertical-align:top"
Line 34:
3 '''return'''
4 '''end if'''
5 '''whileloop''' Max-Loop &le; '''thentimes'''
6 '''if''' <math>T_1[h_1(x)]</math> = <math>\bot</math> '''then'''
7 <math>T_1[h_1(x)]</math> := x
Line 42:
11 '''if''' <math>T_2[h_2(x)]</math> = <math>\bot</math> '''then'''
12 <math>T_2[h_2(x)]</math> := x
13 '''return'''
14 '''end if'''
15 x <math>\leftrightarrow T_2[h_2(x)]</math>
16 '''repeatend loop'''
17 rehash()
18 insert(x)
19 '''end function'''
|}
On lines 10 and 15, the "cuckoo approach" of kicking other keys—whichkeys waswhich preoccupied atoccupy <math>T_{1,2}[h_{1,2}(x)]</math>—takes placerepeats until every key has its own "nest", i.e. the item <math>x</math> is inserted into aan spotempty onslot in either one of the two tables;. theThe notation <math>x\leftrightarrow y</math> expresses the process of [[wiktionary:swap|swapping]] <math>x</math> and <math>y</math>.{{r|Cuckoo|p=124-125}}
 
==Theory==
Line 57:
One method of proving this uses the theory of [[random graph]]s: one may form an [[undirected graph]] called the "cuckoo graph" that has a vertex for each hash table ___location, and an edge for each hashed value, with the endpoints of the edge being the two possible locations of the value. Then, the greedy insertion algorithm for adding a set of values to a cuckoo hash table succeeds if and only if the cuckoo graph for this set of values is a [[pseudoforest]], a graph with at most one cycle in each of its [[connected component (graph theory)|connected component]]s. Any vertex-induced subgraph with more edges than vertices corresponds to a set of keys for which there are an insufficient number of slots in the hash table. When the hash function is chosen randomly, the cuckoo graph is a [[random graph]] in the [[Erdős–Rényi model]]. With high probability, for load factor less than 1/2 (corresponding to a random graph in which the ratio of the number of edges to the number of vertices is bounded below 1/2), the graph is a pseudoforest and the cuckoo hashing algorithm succeeds in placing all keys. The same theory also proves that the expected size of a [[Connected component (graph theory)|connected component]] of the cuckoo graph is small, ensuring that each insertion takes constant expected time. However, also with high probability, a load factor greater than 1/2 will lead to a [[giant component]] with two or more cycles, causing the data structure to fail and need to be resized.<ref>{{Cite conference|first=Reinhard|last=Kutzelnigg|url=https://www.dmtcs.org/dmtcs-ojs/index.php/proceedings/article/viewFile/dmAG0133/1710.pdf|conference=Fourth Colloquium on Mathematics and Computer Science|title=Bipartite random graphs and cuckoo hashing|series=Discrete Mathematics and Theoretical Computer Science|year=2006|volume=AG|pages=403–406}}</ref>
 
Since a theoretical random hash function requires too much space for practical usage, an important theoretical question is which practical hash functions suffice for Cuckoo hashing. One approach is to use [[k-independent hashing]]. In 2009 it was shown<ref>Cohen, Jeffrey S., and Daniel M. Kane. "Bounds on the independence required for cuckoo hashing." ACM Transactions on Algorithms (2009).</ref> that <math>O(\log n)</math>-independence suffices, and at least 6-independence is needed. Another approach is to use [[Tabulationtabulation hashing]], which is not 6-independent, but was shown in 2012<ref>Pǎtraşcu, Mihai, and Mikkel Thorup. "The power of simple tabulation hashing." Journal of the ACM (JACM) 59.3 (2012): 1-50.</ref> to have other properties sufficient for Cuckoo hashing.
A third approach from 2014<ref>Aumüller, Martin, Martin Dietzfelbinger, and Philipp Woelfel. "Explicit and efficient hash families suffice for cuckoo hashing with a stash." Algorithmica 70.3 (2014): 428-456.</ref> is to slightly modify the cuckoo hashtable with a so-called stash, which makes it possible to use nothing more than 2-independent hash functions.
 
Line 63:
In practice, cuckoo hashing is about 20–30% slower than [[linear probing]], which is the fastest of the common approaches.<ref name="Cuckoo"/>
The reason is that cuckoo hashing often causes two cache misses per search, to check the two locations where a key might be stored, while linear probing usually causes only one cache miss per search.
However, because of its worst case guarantees on search time, cuckoo hashing can still be valuable when [[Real-time computing|real-time response rates]] are required. One advantage of cuckoo hashing is its link-list free property, which fits GPU processing well.
 
==Example==
The following hash functions are given (the two least significant digits of k in base 11):
 
<math>h\left(k\right)=k\bmod 11</math><br/>
<math>h'\left(k\right)=\left\lfloor\frac{k}{11}\right\rfloor\bmod 11</math>
 
The following two tables show the insertion of some example elements. Each column corresponds to the state of the two hash tables over time. The possible insertion locations for each new value are highlighted. The last column illustrates a failed insertion due to a cycle, details below.
 
<div style="float:left; margin-right:1em">
{| class="wikitable"
|+ 1.Table table1: foruses h(k)
|-
|colspan=2|
!colspan=10| Key insertedSteps
|-
!colspan=2| kStep number
| 201 || 502 || 533 || 754 || 1005 || 676 || 1057 || 38 || 369 || 3910
|-
!colspan=2| Key inserted
| 53 || 50 || 20 || 75 || 100 || 67 || 105 || 3 || 36 || 45
|-
!colspan=2| h(k)
| 9 || 6 || 9 || 9 || 1 || 1 || 6 || 3 || 3 || 61
|-
! rowspan=11 {{vert header|va=middle|Hash table entries}}
Line 91 ⟶ 94:
|-
! 1
| || || || ||bgcolor=#ffffcc| 100 ||bgcolor=#ffffcc| 67 || 67 || 67 || 67 ||bgcolor=#ffffcc| 10045
|-
! 2
Line 106 ⟶ 109:
|-
! 6
| ||bgcolor=#ffffcc| 50 || 50 || 50 || 50 || 50 ||bgcolor=#ffffcc| 105 || 105 || 105 ||bgcolor=#ffffcc| 39105
|-
! 7
Line 115 ⟶ 118:
|-
! 9
|bgcolor=#ffffcc| 2053 || 2053 || bgcolor="#ffffcc" | 20 ||bgcolor=#ffffcc| 75 || 75 || 75 || 53 || 53 || 53 || 7553
|-
! 10
Line 124 ⟶ 127:
<div style="float:left">
{| class="wikitable"
|+ 2.Table table2: foruses h′(k)
|-
|colspan=2|
!colspan=10| Key insertedSteps
|-
!colspan=2| kStep number
| 201 || 502 || 533 || 754 || 1005 || 676 || 1057 || 38 || 369 || 3910
|-
!colspan=2| Key inserted
| 53 || 50 || 20 || 75 || 100 || 67 || 105 || 3 || 36 || 45
|-
!colspan=2| h′(k)
| 14 || 4 || 41 || 6 || 9 || 6 || 9 || 0 || 3 || 34
|-
! rowspan=11 {{vert header|va=middle|Hash table entries}}
Line 140 ⟶ 146:
|-
! 1
|bgcolor=#ffffcc| || ||bgcolor=#ffffcc| || 20 || 20 || 20 || 20 || 20 || 20 || 20
|-
! 2
Line 146 ⟶ 152:
|-
! 3
| || || || || || || || ||bgcolor=#ffffcc| ||bgcolor=#ffffcc|
|-
! 4
| ||bgcolor=#ffffcc| ||bgcolor=#ffffcc| || 53 || 53 || 53 || 53 || 50 || 50 || 50 ||bgcolor=#ffffcc| 5350
|-
! 5
Line 155 ⟶ 161:
|-
! 6
| || || ||bgcolor=#ffffcc| || ||bgcolor=#ffffcc| || 75 || 75 || 75 || 6775
|-
! 7
Line 164 ⟶ 170:
|-
! 9
| || || || ||bgcolor=#ffffcc| || 100 ||bgcolor=#ffffcc| 100 || 100 || 100 || 105100
|-
! 10
Line 173 ⟶ 179:
 
===Cycle===
If you now attempt to insert the element 645, then you get into a cycle, and fail. In the last row of the table we find the same initial situation as at the beginning again.
 
<math>h\left(645\right)=645\bmod 11=61</math><br/>
<math>h'\left(645\right)=\left\lfloor\frac{645}{11}\right\rfloor\bmod 11=04</math><br/>
{| class="wikitable"
|-
! tableTable 1
! tableTable 2
|-
| 645 replaces 5067 in cell 61 || 5067 replaces 5375 in cell 46
|-
| 5375 replaces 7553 in cell 9 || 7553 replaces 6750 in cell 64
|-
| 6750 replaces 100105 in cell 16 || 100105 replaces 105100 in cell 9
|-
| 105100 replaces 645 in cell 61 || 645 replaces 353 in cell 04
|-
| 353 replaces 3675 in cell 39 || 3675 replaces 3967 in cell 36
|-
| 3967 replaces 105100 in cell 61 || 105100 replaces 100105 in cell 9
|-
| 100 replaces 67 in cell 1 || 67 replaces 75 in cell 6
|-
| 75 replaces 53 in cell 9 || 53 replaces 50 in cell 4
|-
| 50 replaces 39 in cell 6 || 39 replaces 36 in cell 3
|-
| 36105 replaces 350 in cell 36 || 350 replaces 645 in cell 04
|-
| 645 replaces 5067 in cell 61 || 5067 replaces 5375 in cell 46
|}
 
Line 208:
Several variations of cuckoo hashing have been studied, primarily with the aim of improving its space usage by increasing the [[Load factor (computer science)|load factor]] that it can tolerate to a number greater than the 50% threshold of the basic algorithm. Some of these methods can also be used to reduce the failure rate of cuckoo hashing, causing rebuilds of the data structure to be much less frequent.
 
Generalizations of cuckoo hashing that use more than two alternative hash functions can be expected to utilize a larger part of the capacity of the hash table efficiently while sacrificing some lookup and insertion speed. Using just three hash functions increases the load to 91%.<ref name="mitzenmacher2009survey">{{cite journal | last=Mitzenmacher | first=Michael | author-link=Michael Mitzenmacher |title=Some Open Questions Related to Cuckoo Hashing &#124; |journal=Proceedings of ESA 2009 | date=2009-09-09 |url=http://www.eecs.harvard.edu/~michaelm/postscripts/esa2009.pdf | access-date=2010-11-10 }}</ref>
 
Another generalization of cuckoo hashing, called ''blocked cuckoo hashing'' consists in usinguses more than one key per bucket and a balanced allocation scheme. Using just 2 keys per bucket permits a load factor above 80%.<ref>{{citationcite journal
| last1 = Dietzfelbinger | first1 = Martin
| last2 = Weidling | first2 = Christoph
Line 220 ⟶ 221:
| volume = 380
| year = 2007| doi-access = free
}}.</ref>
 
Another variation of cuckoo hashing that has been studied is ''cuckoo hashing with a stash''. The stash, in this data structure, is an array of a constant number of keys, used to store keys that cannot successfully be inserted into the main hash table of the structure. This modification reduces the failure rate of cuckoo hashing to an inverse-polynomial function with an exponent that can be made arbitrarily large by increasing the stash size. However, larger stashes also mean slower searches for keys that are not present or are in the stash. A stash can be used in combination with more than two hash functions or with blocked cuckoo hashing to achieve both high load factors and small failure rates.<ref>{{citationcite journal
| last1 = Kirsch | first1 = Adam
| last2 = Mitzenmacher | first2 = Michael D.
Line 233 ⟶ 234:
| title = More robust hashing: cuckoo hashing with a stash
| volume = 39
| year = 2010}}.</ref> The analysis of cuckoo hashing with a stash extends to practical hash functions, not just to the random hash function model commonly used in theoretical analysis of hashing.<ref>{{citationcite journal
| last1 = Aumüller | first1 = Martin
| last2 = Dietzfelbinger | first2 = Martin
Line 244 ⟶ 245:
| title = Explicit and efficient hash families suffice for cuckoo hashing with a stash
| volume = 70
| year = 2014| arxiv = 1204.4431| s2cid = 1888828
}}.</ref>
 
Some people recommend a simplified generalization of cuckoo hashing called [[CPU cache#Two-way skewed associative cache|skewed-associative cache]] in some [[CPU cache]]s.<ref>
Line 263 ⟶ 265:
 
==Comparison with related structures==
A study by Zukowski et al.<ref>{{cite journal | last=Zukowski | first=Marcin |author2=Heman, Sandor |author3=Boncz, Peter | title=Architecture-Conscious Hashing | publisherjournal=Proceedings of the International Workshop on Data Management on New Hardware (DaMoN) | date=June 2006 | url=https://www.cs.cmu.edu/~damon2006/pdf/zukowski06archconscioushashing.pdf | access-date=2008-10-16}}</ref> has shown that cuckoo hashing is much faster than [[Separate chaining|chained hashing]] for small, [[CPU cache|cache]]-resident hash tables on modern processors. Kenneth Ross<ref>{{cite journalreport | last=Ross | first=Kenneth | title=Efficient Hash Probes on Modern Processors | publisher=IBM |type=Research Report |id=RC24100 | date=2006-11-08 |url=http://domino.research.ibm.com/library/cyberdig.nsf/papers/DF54E3545C82E8A585257222006FD9A2/$File/rc24100.pdf | id=RC24100 | access-date=2008-10-16 }}</ref> has shown bucketized versions of cuckoo hashing (variants that use buckets that contain more than one key) to be faster than conventional methods also for large hash tables, when space utilization is high. The performance of the bucketized cuckoo hash table was investigated further by Askitis,<ref>{{Cite book | titlechapter=Fast and Compact Hash Tables for Integer Keys | first1=Nikolas | last1=Askitis | year=2009 | isbn=978-1-920682-72-9 | url=http://crpit.com/confpapers/CRPITV91Askitis.pdf | pages=113–122 | journaltitle=Proceedings of the 32nd Australasian Computer Science Conference (ACSC 2009) |url=http://crpit.com/confpapers/CRPITV91Askitis.pdf |volume=91 | access-date=2010-06-13 | archive-url=https://web.archive.org/web/20110216180225/http://crpit.com/confpapers/CRPITV91Askitis.pdf | archive-date=2011-02-16 | url-status=dead }}</ref>
with its performance compared against alternative hashing schemes.
 
A survey by [[Michael Mitzenmacher|Mitzenmacher]]<ref name="mitzenmacher2009survey" /> presents open problems related to cuckoo hashing as of 2009.
 
==Known users==
Cuckoo hashing is used in [[TikTok]]'s recommendation system to solve the problem of "embedding table collisions", which can result in reduced model quality.
The TikTok recommendation system "Monolith" takes advantage cuckoo hashing's collision resolution to prevent different concepts from being mapped to the same vectors.<ref>{{cite arXiv |vauthors=Liu Z, Zou L, Zou X, Wang C, Zhang B, Tang D, Zhu B, Zhu Y, Wu P, Wang K, Cheng Y|date=27 Sep 2022|title=Monolith: Real Time Recommendation System With Collisionless Embedding Table|arxiv=2209.07663|class=cs.IR}}</ref>
 
==See also==
Line 278 ⟶ 284:
 
==External links==
* [http://www.ru.is/faculty/ulfar/CuckooHash.pdf A cool and practical alternative to traditional hash tables] {{Webarchive|url=https://web.archive.org/web/20190407085017/http://www.ru.is/faculty/ulfar/CuckooHash.pdf |date=2019-04-07 }}, U. Erlingsson, M. Manasse, F. Mcsherry, 2006.
* [http://www.itu.dk/people/pagh/papers/cuckoo-undergrad.pdf Cuckoo Hashing for Undergraduates, 2006], R. Pagh, 2006.
* [httphttps://mybiasedcoin.blogspot.com/2007/06/cuckoo-hashing-theory-and-practice-part.html Cuckoo Hashing, Theory and Practice] (Part 1, [httphttps://mybiasedcoin.blogspot.com/2007/06/cuckoo-hashing-theory-and-practice-part_15.html Part 2] and [httphttps://mybiasedcoin.blogspot.com/2007/06/cuckoo-hashing-theory-and-practice-part_19.html Part 3]), Michael Mitzenmacher, 2007.
* {{cite conference |last=Naor |first=Moni |author2=Segev, Gil |author3=Wieder, Udi |title=History-Independent Cuckoo Hashing |book-title=International Colloquium on Automata, Languages and Programming (ICALP) |place=Reykjavik, Iceland |year=2008 |url=http://www.wisdom.weizmann.ac.il/~naor/PAPERS/cuckoo_hi_abs.html |access-date=2008-07-21}}
* [http://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf Algorithmic Improvements for Fast Concurrent Cuckoo Hashing], X. Li, D. Andersen, M. Kaminsky, M. Freedman. EuroSys 2014.
Line 286 ⟶ 292:
===Examples===
*[https://github.com/efficient/libcuckoo Concurrent high-performance Cuckoo hashtable written in C++]
* [httphttps://sourceforge.net/projects/cuckoo-cpp/ Cuckoo hash map written in C++]
* [http://www.theiling.de/projects/lookuptable.html Static cuckoo hashtable generator for C/C++]
* [http://hackage.haskell.org/packages/archive/hashtables/latest/doc/html/Data-HashTable-ST-Cuckoo.html Cuckoo hash table written in Haskell]