Content deleted Content added
Reverted 1 edit by Turkdevops (talk) |
m →External links: HTTP to HTTPS for Blogspot |
||
(47 intermediate revisions by 28 users not shown) | |||
Line 2:
[[Image:Cuckoo hashing example.svg|thumb|Cuckoo hashing example. The arrows show the alternative ___location of each key. A new item would be inserted in the ___location of A by moving A to its alternative ___location, currently occupied by B, and moving B to its alternative ___location which is currently vacant. Insertion of a new item in the ___location of H would not succeed: Since H is part of a cycle (together with W), the new item would get kicked out again.]]
'''Cuckoo hashing''' is a scheme in [[computer programming]] for resolving [[hash collision]]s of values of [[hash function]]s in a [[hash table|table]], with [[Best, worst and average case|worst-case]] [[constant time|constant]] lookup time. The name derives from the behavior of some species of [[cuckoo]], where the cuckoo chick pushes the other eggs or young out of the nest when it hatches in a variation of the behavior referred to as [[brood parasitism]]; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different ___location in the table.
==History==
Cuckoo hashing was first described by [[Rasmus Pagh]] and [[Flemming Friche Rodler]] in a 2001 conference paper.<ref name="Cuckoo">{{Cite book |
==Operations==
Cuckoo hashing is a form of [[open addressing]] in which each non-empty cell of a [[hash table]] contains a [[unique key|key]] or [[
The basic idea of cuckoo hashing is to resolve collisions by using two hash functions instead of only one. This provides two possible locations in the hash table for each key. In one of the commonly used variants of the algorithm, the hash table is split into two smaller tables of equal size, and each hash function provides an index into one of these two tables. It is also possible for both hash functions to provide indexes into a single table.{{r|Cuckoo|p=121-122}}
===Lookup===
Cuckoo hashing uses two hash tables, <math>T_1</math> and <math>T_2</math>.
{|
|- style="vertical-align:top"
Line 23:
===Deletion===
Deletion is performed in <math>O(1)</math> time since
===Insertion===
{|
|- style="vertical-align:top"
Line 34:
3 '''return'''
4 '''end if'''
5 '''
6 '''if''' <math>T_1[h_1(x)]</math> = <math>\bot</math> '''then'''
7 <math>T_1[h_1(x)]</math> := x
Line 42:
11 '''if''' <math>T_2[h_2(x)]</math> = <math>\bot</math> '''then'''
12 <math>T_2[h_2(x)]</math> := x
13 '''return'''
14 '''end if'''
15 x <math>\leftrightarrow T_2[h_2(x)]</math>
16 '''
17 rehash()
18 insert(x)
19 '''end function'''
|}
On lines 10 and 15, the "cuckoo approach" of kicking other
==Theory==
Line 57:
One method of proving this uses the theory of [[random graph]]s: one may form an [[undirected graph]] called the "cuckoo graph" that has a vertex for each hash table ___location, and an edge for each hashed value, with the endpoints of the edge being the two possible locations of the value. Then, the greedy insertion algorithm for adding a set of values to a cuckoo hash table succeeds if and only if the cuckoo graph for this set of values is a [[pseudoforest]], a graph with at most one cycle in each of its [[connected component (graph theory)|connected component]]s. Any vertex-induced subgraph with more edges than vertices corresponds to a set of keys for which there are an insufficient number of slots in the hash table. When the hash function is chosen randomly, the cuckoo graph is a [[random graph]] in the [[Erdős–Rényi model]]. With high probability, for load factor less than 1/2 (corresponding to a random graph in which the ratio of the number of edges to the number of vertices is bounded below 1/2), the graph is a pseudoforest and the cuckoo hashing algorithm succeeds in placing all keys. The same theory also proves that the expected size of a [[Connected component (graph theory)|connected component]] of the cuckoo graph is small, ensuring that each insertion takes constant expected time. However, also with high probability, a load factor greater than 1/2 will lead to a [[giant component]] with two or more cycles, causing the data structure to fail and need to be resized.<ref>{{Cite conference|first=Reinhard|last=Kutzelnigg|url=https://www.dmtcs.org/dmtcs-ojs/index.php/proceedings/article/viewFile/dmAG0133/1710.pdf|conference=Fourth Colloquium on Mathematics and Computer Science|title=Bipartite random graphs and cuckoo hashing|series=Discrete Mathematics and Theoretical Computer Science|year=2006|volume=AG|pages=403–406}}</ref>
Since a theoretical random hash function requires too much space for practical usage, an important theoretical question is which practical hash functions suffice for Cuckoo hashing. One approach is to use [[k-independent hashing]]. In 2009 it was shown<ref>Cohen, Jeffrey S., and Daniel M. Kane. "Bounds on the independence required for cuckoo hashing." ACM Transactions on Algorithms (2009).</ref> that <math>O(\log n)</math>-independence suffices, and at least 6-independence is needed. Another approach is to use [[
A third approach from 2014<ref>Aumüller, Martin, Martin Dietzfelbinger, and Philipp Woelfel. "Explicit and efficient hash families suffice for cuckoo hashing with a stash." Algorithmica 70.3 (2014): 428-456.</ref> is to slightly modify the cuckoo hashtable with a so-called stash, which makes it possible to use nothing more than 2-independent hash functions.
Line 63:
In practice, cuckoo hashing is about 20–30% slower than [[linear probing]], which is the fastest of the common approaches.<ref name="Cuckoo"/>
The reason is that cuckoo hashing often causes two cache misses per search, to check the two locations where a key might be stored, while linear probing usually causes only one cache miss per search.
However, because of its worst case guarantees on search time, cuckoo hashing can still be valuable when [[Real-time computing|real-time response rates]] are required
==Example==
The following hash functions are given (the two least significant digits of k in base 11):
<math>h\left(k\right)=k\bmod 11</math><br/>
<math>h'\left(k\right)=\left\lfloor\frac{k}{11}\right\rfloor\bmod 11</math>
The following two tables show the insertion of some example elements. Each column corresponds to the state of the two hash tables over time. The possible insertion locations for each new value are highlighted. The last column illustrates a failed insertion due to a cycle, details below.
<div style="float:left; margin-right:1em">
{| class="wikitable"
|+
|-
|colspan=2|
!colspan=10|
|-
!colspan=2|
|
|-▼
!colspan=2| Key inserted
| 53 || 50 || 20 || 75 || 100 || 67 || 105 || 3 || 36 || 45
|-
!colspan=2| h(k)
| 9 || 6 || 9 || 9 || 1 || 1 || 6 || 3 || 3 ||
|-
! rowspan=11 {{vert header|va=middle|Hash table entries}}
Line 91 ⟶ 94:
|-
! 1
| || || || ||bgcolor=#ffffcc| 100 ||bgcolor=#ffffcc| 67 || 67 || 67 || 67 ||bgcolor=#ffffcc|
|-
! 2
Line 106 ⟶ 109:
|-
! 6
| ||bgcolor=#ffffcc| 50 || 50 || 50 || 50 || 50 ||bgcolor=#ffffcc| 105 || 105 || 105 |
|-
! 7
Line 115 ⟶ 118:
|-
! 9
|bgcolor=#ffffcc|
|-
! 10
Line 124 ⟶ 127:
<div style="float:left">
{| class="wikitable"
|+
|-
|colspan=2|
!colspan=10|
|-
!colspan=2|
|
|-▼
!colspan=2| Key inserted
| 53 || 50 || 20 || 75 || 100 || 67 || 105 || 3 || 36 || 45
|-
!colspan=2| h′(k)
|
|-
! rowspan=11 {{vert header|va=middle|Hash table entries}}
Line 140 ⟶ 146:
|-
! 1
|
|-
! 2
Line 146 ⟶ 152:
|-
! 3
| || || || || || || || ||bgcolor=#ffffcc| |
|-
! 4
|-
! 5
Line 155 ⟶ 161:
|-
! 6
| || || ||bgcolor=#ffffcc| || ||bgcolor=#ffffcc| || 75 || 75 || 75 ||
|-
! 7
Line 164 ⟶ 170:
|-
! 9
| || || || ||bgcolor=#ffffcc| || 100 ||bgcolor=#ffffcc| 100 || 100 || 100 ||
|-
! 10
Line 173 ⟶ 179:
===Cycle===
If you
<math>h\left(
<math>h'\left(
{| class="wikitable"
|-
!
!
|-
|
|-
|
|-
|
|-
|
|-
|
|-
|
▲|-
▲|-
|-
|
|-
|
|}
Line 208:
Several variations of cuckoo hashing have been studied, primarily with the aim of improving its space usage by increasing the [[Load factor (computer science)|load factor]] that it can tolerate to a number greater than the 50% threshold of the basic algorithm. Some of these methods can also be used to reduce the failure rate of cuckoo hashing, causing rebuilds of the data structure to be much less frequent.
Generalizations of cuckoo hashing that use more than two alternative hash functions can be expected to utilize a larger part of the capacity of the hash table efficiently while sacrificing some lookup and insertion speed. Using just three hash functions increases the load to 91%.<ref name="mitzenmacher2009survey">{{cite journal | last=Mitzenmacher | first=Michael | author-link=Michael Mitzenmacher |title=Some Open Questions Related to Cuckoo Hashing
Another generalization of cuckoo hashing
| last1 = Dietzfelbinger | first1 = Martin
| last2 = Weidling | first2 = Christoph
Line 220 ⟶ 221:
| volume = 380
| year = 2007| doi-access = free
}}
Another variation of cuckoo hashing that has been studied is ''cuckoo hashing with a stash''. The stash, in this data structure, is an array of a constant number of keys, used to store keys that cannot successfully be inserted into the main hash table of the structure. This modification reduces the failure rate of cuckoo hashing to an inverse-polynomial function with an exponent that can be made arbitrarily large by increasing the stash size. However, larger stashes also mean slower searches for keys that are not present or are in the stash. A stash can be used in combination with more than two hash functions or with blocked cuckoo hashing to achieve both high load factors and small failure rates.<ref>{{
| last1 = Kirsch | first1 = Adam
| last2 = Mitzenmacher | first2 = Michael D.
Line 233 ⟶ 234:
| title = More robust hashing: cuckoo hashing with a stash
| volume = 39
| year = 2010}}
| last1 = Aumüller | first1 = Martin
| last2 = Dietzfelbinger | first2 = Martin
Line 244 ⟶ 245:
| title = Explicit and efficient hash families suffice for cuckoo hashing with a stash
| volume = 70
| year = 2014| arxiv = 1204.4431| s2cid = 1888828
}} Some people recommend a simplified generalization of cuckoo hashing called [[CPU cache#Two-way skewed associative cache|skewed-associative cache]] in some [[CPU cache]]s.<ref>
Line 263 ⟶ 265:
==Comparison with related structures==
A study by Zukowski et al.<ref>{{cite journal |
with its performance compared against alternative hashing schemes.
A survey by [[Michael Mitzenmacher|Mitzenmacher]]<ref name="mitzenmacher2009survey" /> presents open problems related to cuckoo hashing as of 2009.
==Known users==
Cuckoo hashing is used in [[TikTok]]'s recommendation system to solve the problem of "embedding table collisions", which can result in reduced model quality.
The TikTok recommendation system "Monolith" takes advantage cuckoo hashing's collision resolution to prevent different concepts from being mapped to the same vectors.<ref>{{cite arXiv |vauthors=Liu Z, Zou L, Zou X, Wang C, Zhang B, Tang D, Zhu B, Zhu Y, Wu P, Wang K, Cheng Y|date=27 Sep 2022|title=Monolith: Real Time Recommendation System With Collisionless Embedding Table|arxiv=2209.07663|class=cs.IR}}</ref>
==See also==
Line 278 ⟶ 284:
==External links==
* [http://www.ru.is/faculty/ulfar/CuckooHash.pdf A cool and practical alternative to traditional hash tables] {{Webarchive|url=https://web.archive.org/web/20190407085017/http://www.ru.is/faculty/ulfar/CuckooHash.pdf |date=2019-04-07 }}, U. Erlingsson, M. Manasse, F. Mcsherry, 2006.
* [http://www.itu.dk/people/pagh/papers/cuckoo-undergrad.pdf Cuckoo Hashing for Undergraduates, 2006], R. Pagh, 2006.
* [
* {{cite conference |last=Naor |first=Moni |author2=Segev, Gil |author3=Wieder, Udi |title=History-Independent Cuckoo Hashing |book-title=International Colloquium on Automata, Languages and Programming (ICALP) |place=Reykjavik, Iceland |year=2008 |url=http://www.wisdom.weizmann.ac.il/~naor/PAPERS/cuckoo_hi_abs.html |access-date=2008-07-21}}
* [http://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf Algorithmic Improvements for Fast Concurrent Cuckoo Hashing], X. Li, D. Andersen, M. Kaminsky, M. Freedman. EuroSys 2014.
Line 286 ⟶ 292:
===Examples===
*[https://github.com/efficient/libcuckoo Concurrent high-performance Cuckoo hashtable written in C++]
* [
* [http://www.theiling.de/projects/lookuptable.html Static cuckoo hashtable generator for C/C++]
* [http://hackage.haskell.org/packages/archive/hashtables/latest/doc/html/Data-HashTable-ST-Cuckoo.html Cuckoo hash table written in Haskell]
|