Cuckoo hashing: Difference between revisions

Content deleted Content added
Tags: Reverted possible vandalism
Bender the Bot (talk | contribs)
m External links: HTTP to HTTPS for Blogspot
 
(30 intermediate revisions by 15 users not shown)
Line 8:
 
==Operations==
Sælir Reiknirit áfangi 2023!!
Cuckoo hashing is a form of [[open addressing]] in which each non-empty cell of a [[hash table]] contains a [[unique key|key]] or [[name–value pair|key–value pair]]. A [[hash function]] is used to determine the ___location for each key, and its presence in the table (or the value associated with it) can be found by examining that cell of the table. However, open addressing suffers from [[hash collision|collisions]], which happens when more than one key is mapped to the same cell.
The basic idea of cuckoo hashing is to resolve collisions by using two hash functions instead of only one. This provides two possible locations in the hash table for each key. In one of the commonly used variants of the algorithm, the hash table is split into two smaller tables of equal size, and each hash function provides an index into one of these two tables. It is also possible for both hash functions to provide indexes into a single table.{{r|Cuckoo|p=121-122}}
 
===Lookup===
Cuckoo hashing uses two hash tables, <math>T_1</math> and <math>T_2</math>. Assuming <math>r</math> is the length of each table, the hash functions for the two tables is defined as, <math>h_1,\ h_2\ :\ \cupS \rightarrow \{0, ..., r-1\}</math> and <math>\forall x \in S</math> where <math>x</math> beis the key and <math>S</math> beis the set whose keys are stored in <math>h_1(x)</math> of <math>T_1</math> or <math>h_2(x)</math> of <math>T_2</math>. The lookup operation is as follows:{{r|Cuckoo|p=124}}
{|
|- style="vertical-align:top"
Line 24 ⟶ 23:
 
===Deletion===
Deletion is performed in <math>O(1)</math> time since thereprobing isn'tis involvementnot ofinvolved. probing—notThis consideringignores the cost of the shrinking operation if the table is too sparse.{{r|Cuckoo|p=124-125}}
 
===Insertion===
InsertionWhen ofinserting a new [[name–value pair|item]] with key <math>x</math>, the first step involves examining if the slot <math>h_1(x)</math> of the table <math>T_1</math> is occupied;. ifIf it is not, the item is inserted atin that cellslot. However, if the slot is occupied, the preoccupiedexisting item gets removed—let it be <math>x'</math>—and is removed and <math>x</math> is inserted at <math>T_1[h_1(x)]</math>. The removed itemThen, <math>x'</math> is inserted into the table <math>T_2</math> by following the same procedure;. theThe process continues until an empty position is found to insert the key.{{r|Cuckoo|p=124-125}} To avoid the possiblean [[infinite iterationloop]] in the process loop, a threshold <math>\text{Max-Loop}</math> is specified. suchIf thatthe ifnumber theof [[iterations]] exceeds thethis fixed [[Critical_value|threshold]], the hash tables—bothboth <math>T_1</math> and <math>T_2</math>—are are [[Hash_table#Resizing_by_moving_all_entries|rehashed]] with newernew hash functions and the insertion procedure repeats. FollowingThe isfollowing ais [[pseudocode]] for insertion:{{r|Cuckoo|p=125}}
{|
|- style="vertical-align:top"
Line 51 ⟶ 50:
19 '''end function'''
|}
On lines 10 and 15, the "cuckoo approach" of kicking other keys—whichkeys waswhich preoccupied atoccupy <math>T_{1,2}[h_{1,2}(x)]</math>—takes placerepeats until every key has its own "nest", i.e. the item <math>x</math> is inserted into aan spotempty onslot in either one of the two tables;. theThe notation <math>x\leftrightarrow y</math> expresses the process of [[wiktionary:swap|swapping]] <math>x</math> and <math>y</math>.{{r|Cuckoo|p=124-125}}
 
==Theory==
Line 58 ⟶ 57:
One method of proving this uses the theory of [[random graph]]s: one may form an [[undirected graph]] called the "cuckoo graph" that has a vertex for each hash table ___location, and an edge for each hashed value, with the endpoints of the edge being the two possible locations of the value. Then, the greedy insertion algorithm for adding a set of values to a cuckoo hash table succeeds if and only if the cuckoo graph for this set of values is a [[pseudoforest]], a graph with at most one cycle in each of its [[connected component (graph theory)|connected component]]s. Any vertex-induced subgraph with more edges than vertices corresponds to a set of keys for which there are an insufficient number of slots in the hash table. When the hash function is chosen randomly, the cuckoo graph is a [[random graph]] in the [[Erdős–Rényi model]]. With high probability, for load factor less than 1/2 (corresponding to a random graph in which the ratio of the number of edges to the number of vertices is bounded below 1/2), the graph is a pseudoforest and the cuckoo hashing algorithm succeeds in placing all keys. The same theory also proves that the expected size of a [[Connected component (graph theory)|connected component]] of the cuckoo graph is small, ensuring that each insertion takes constant expected time. However, also with high probability, a load factor greater than 1/2 will lead to a [[giant component]] with two or more cycles, causing the data structure to fail and need to be resized.<ref>{{Cite conference|first=Reinhard|last=Kutzelnigg|url=https://www.dmtcs.org/dmtcs-ojs/index.php/proceedings/article/viewFile/dmAG0133/1710.pdf|conference=Fourth Colloquium on Mathematics and Computer Science|title=Bipartite random graphs and cuckoo hashing|series=Discrete Mathematics and Theoretical Computer Science|year=2006|volume=AG|pages=403–406}}</ref>
 
Since a theoretical random hash function requires too much space for practical usage, an important theoretical question is which practical hash functions suffice for Cuckoo hashing. One approach is to use [[k-independent hashing]]. In 2009 it was shown<ref>Cohen, Jeffrey S., and Daniel M. Kane. "Bounds on the independence required for cuckoo hashing." ACM Transactions on Algorithms (2009).</ref> that <math>O(\log n)</math>-independence suffices, and at least 6-independence is needed. Another approach is to use [[Tabulationtabulation hashing]], which is not 6-independent, but was shown in 2012<ref>Pǎtraşcu, Mihai, and Mikkel Thorup. "The power of simple tabulation hashing." Journal of the ACM (JACM) 59.3 (2012): 1-50.</ref> to have other properties sufficient for Cuckoo hashing.
A third approach from 2014<ref>Aumüller, Martin, Martin Dietzfelbinger, and Philipp Woelfel. "Explicit and efficient hash families suffice for cuckoo hashing with a stash." Algorithmica 70.3 (2014): 428-456.</ref> is to slightly modify the cuckoo hashtable with a so-called stash, which makes it possible to use nothing more than 2-independent hash functions.
 
Line 64 ⟶ 63:
In practice, cuckoo hashing is about 20–30% slower than [[linear probing]], which is the fastest of the common approaches.<ref name="Cuckoo"/>
The reason is that cuckoo hashing often causes two cache misses per search, to check the two locations where a key might be stored, while linear probing usually causes only one cache miss per search.
However, because of its worst case guarantees on search time, cuckoo hashing can still be valuable when [[Real-time computing|real-time response rates]] are required. One advantage of cuckoo hashing is its link-list free property, which fits GPU processing well.
 
==Example==
The following hash functions are given (the two least significant digits of k in base 11):
 
<math>h\left(k\right)=k\bmod 11</math><br/>
<math>h'\left(k\right)=\left\lfloor\frac{k}{11}\right\rfloor\bmod 11</math>
 
The following two tables show the insertion of some example elements. Each column corresponds to the state of the two hash tables over time. The possible insertion locations for each new value are highlighted. The last column illustrates a failed insertion due to a cycle, details below.
 
<div style="float:left; margin-right:1em">
{| class="wikitable"
|+ 1.Table table1: foruses h(k)
|-
|colspan=2|
!colspan=10| Key insertedSteps
|-
!colspan=2| kStep number
| 201 || 502 || 533 || 754 || 1005 || 676 || 1057 || 38 || 369 || 3910
|-
!colspan=2| Key inserted
| 53 || 50 || 20 || 75 || 100 || 67 || 105 || 3 || 36 || 45
|-
!colspan=2| h(k)
| 9 || 6 || 9 || 9 || 1 || 1 || 6 || 3 || 3 || 61
|-
! rowspan=11 {{vert header|va=middle|Hash table entries}}
Line 92 ⟶ 94:
|-
! 1
| || || || ||bgcolor=#ffffcc| 100 ||bgcolor=#ffffcc| 67 || 67 || 67 || 67 ||bgcolor=#ffffcc| 10045
|-
! 2
Line 107 ⟶ 109:
|-
! 6
| ||bgcolor=#ffffcc| 50 || 50 || 50 || 50 || 50 ||bgcolor=#ffffcc| 105 || 105 || 105 ||bgcolor=#ffffcc| 39105
|-
! 7
Line 116 ⟶ 118:
|-
! 9
|bgcolor=#ffffcc| 2053 || 2053 || bgcolor="#ffffcc" | 20 ||bgcolor=#ffffcc| 75 || 75 || 75 || 53 || 53 || 53 || 7553
|-
! 10
Line 125 ⟶ 127:
<div style="float:left">
{| class="wikitable"
|+ 2.Table table2: foruses h′(k)
|-
|colspan=2|
!colspan=10| Key insertedSteps
|-
!colspan=2| kStep number
| 201 || 502 || 533 || 754 || 1005 || 676 || 1057 || 38 || 369 || 3910
|-
!colspan=2| Key inserted
| 53 || 50 || 20 || 75 || 100 || 67 || 105 || 3 || 36 || 45
|-
!colspan=2| h′(k)
| 14 || 4 || 41 || 6 || 9 || 6 || 9 || 0 || 3 || 34
|-
! rowspan=11 {{vert header|va=middle|Hash table entries}}
Line 141 ⟶ 146:
|-
! 1
|bgcolor=#ffffcc| || ||bgcolor=#ffffcc| || 20 || 20 || 20 || 20 || 20 || 20 || 20
|-
! 2
Line 147 ⟶ 152:
|-
! 3
| || || || || || || || ||bgcolor=#ffffcc| ||bgcolor=#ffffcc|
|-
! 4
| ||bgcolor=#ffffcc| ||bgcolor=#ffffcc| || 53 || 53 || 53 || 53 || 50 || 50 || 50 ||bgcolor=#ffffcc| 5350
|-
! 5
Line 156 ⟶ 161:
|-
! 6
| || || ||bgcolor=#ffffcc| || ||bgcolor=#ffffcc| || 75 || 75 || 75 || 6775
|-
! 7
Line 165 ⟶ 170:
|-
! 9
| || || || ||bgcolor=#ffffcc| || 100 ||bgcolor=#ffffcc| 100 || 100 || 100 || 105100
|-
! 10
Line 174 ⟶ 179:
 
===Cycle===
If you now attempt to insert the element 645, then you get into a cycle, and fail. In the last row of the table we find the same initial situation as at the beginning again.
 
<math>h\left(645\right)=645\bmod 11=61</math><br/>
<math>h'\left(645\right)=\left\lfloor\frac{645}{11}\right\rfloor\bmod 11=04</math><br/>
{| class="wikitable"
|-
! tableTable 1
! tableTable 2
|-
| 645 replaces 5067 in cell 61 || 5067 replaces 5375 in cell 46
|-
| 5375 replaces 7553 in cell 9 || 7553 replaces 6750 in cell 64
|-
| 6750 replaces 100105 in cell 16 || 100105 replaces 105100 in cell 9
|-
| 105100 replaces 645 in cell 61 || 645 replaces 353 in cell 04
|-
| 353 replaces 3675 in cell 39 || 3675 replaces 3967 in cell 36
|-
| 3967 replaces 105100 in cell 61 || 105100 replaces 100105 in cell 9
|-
| 100105 replaces 6750 in cell 16 || 6750 replaces 7545 in cell 64
|-
| 7545 replaces 5367 in cell 91 || 5367 replaces 5075 in cell 46
|-
| 50 replaces 39 in cell 6 || 39 replaces 36 in cell 3
|-
| 36 replaces 3 in cell 3 || 3 replaces 6 in cell 0
|-
| 6 replaces 50 in cell 6 || 50 replaces 53 in cell 4
|}
 
Line 210 ⟶ 209:
 
Generalizations of cuckoo hashing that use more than two alternative hash functions can be expected to utilize a larger part of the capacity of the hash table efficiently while sacrificing some lookup and insertion speed. Using just three hash functions increases the load to 91%.<ref name="mitzenmacher2009survey">{{cite journal | last=Mitzenmacher | first=Michael | author-link=Michael Mitzenmacher |title=Some Open Questions Related to Cuckoo Hashing |journal=Proceedings of ESA 2009 | date=2009-09-09 |url=http://www.eecs.harvard.edu/~michaelm/postscripts/esa2009.pdf | access-date=2010-11-10 }}</ref>
 
Another generalization of cuckoo hashing, called ''blocked cuckoo hashing'' consists in usinguses more than one key per bucket and a balanced allocation scheme. Using just 2 keys per bucket permits a load factor above 80%.<ref>{{cite journal
| last1 = Dietzfelbinger | first1 = Martin
| last2 = Weidling | first2 = Christoph
Line 270:
A survey by [[Michael Mitzenmacher|Mitzenmacher]]<ref name="mitzenmacher2009survey" /> presents open problems related to cuckoo hashing as of 2009.
 
==ApplicationsKnown users==
Cuckoo Hashinghashing is used in [[TikTok]]'s recommendation system to solve the problem of "embedding table collisions", which can result in reduced model quality.
The TikTok recommendation system "Monolith" takes advantage cuckoo hashing's collision resolution to prevent different concepts from being mapped to the same vectors.<ref>{{Citecite webarXiv |vauthors=Liu Z, Zou L, Zou X, Wang C, Zhang B, Tang D, Zhu B, Zhu Y, Wu P, Wang K, Cheng Y|date=27 Sep 2022|title=Monolith: TheReal Time Recommendation System BehindWith TikTokCollisionless |url=https://gantry.io/blog/papers-to-know-20230110/Embedding Table|access-datearxiv=2023-05-30 |website=gantry2209.io 07663|languageclass=encs.IR}}</ref>
 
==See also==
Line 284:
 
==External links==
* [http://www.ru.is/faculty/ulfar/CuckooHash.pdf A cool and practical alternative to traditional hash tables] {{Webarchive|url=https://web.archive.org/web/20190407085017/http://www.ru.is/faculty/ulfar/CuckooHash.pdf |date=2019-04-07 }}, U. Erlingsson, M. Manasse, F. Mcsherry, 2006.
* [http://www.itu.dk/people/pagh/papers/cuckoo-undergrad.pdf Cuckoo Hashing for Undergraduates, 2006], R. Pagh, 2006.
* [httphttps://mybiasedcoin.blogspot.com/2007/06/cuckoo-hashing-theory-and-practice-part.html Cuckoo Hashing, Theory and Practice] (Part 1, [httphttps://mybiasedcoin.blogspot.com/2007/06/cuckoo-hashing-theory-and-practice-part_15.html Part 2] and [httphttps://mybiasedcoin.blogspot.com/2007/06/cuckoo-hashing-theory-and-practice-part_19.html Part 3]), Michael Mitzenmacher, 2007.
* {{cite conference |last=Naor |first=Moni |author2=Segev, Gil |author3=Wieder, Udi |title=History-Independent Cuckoo Hashing |book-title=International Colloquium on Automata, Languages and Programming (ICALP) |place=Reykjavik, Iceland |year=2008 |url=http://www.wisdom.weizmann.ac.il/~naor/PAPERS/cuckoo_hi_abs.html |access-date=2008-07-21}}
* [http://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf Algorithmic Improvements for Fast Concurrent Cuckoo Hashing], X. Li, D. Andersen, M. Kaminsky, M. Freedman. EuroSys 2014.
Line 292:
===Examples===
*[https://github.com/efficient/libcuckoo Concurrent high-performance Cuckoo hashtable written in C++]
* [httphttps://sourceforge.net/projects/cuckoo-cpp/ Cuckoo hash map written in C++]
* [http://www.theiling.de/projects/lookuptable.html Static cuckoo hashtable generator for C/C++]
* [http://hackage.haskell.org/packages/archive/hashtables/latest/doc/html/Data-HashTable-ST-Cuckoo.html Cuckoo hash table written in Haskell]