Content deleted Content added
Mastergreg82 (talk | contribs) |
Citation bot (talk | contribs) Removed URL that duplicated identifier. Removed access-date with no URL. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 128/990 |
||
(42 intermediate revisions by 30 users not shown) | |||
Line 4:
{{about|a computer programming construct|other meanings of "hash" and "hashing"|Hash (disambiguation)}}
{{More citations needed|date=July 2010}}
[[File:Hash table 4 1 1 0 0 1 0 LL.svg|thumb|240px|right|A hash function that maps names to integers from 0 to 15. There is a
A '''hash function''' is any [[Function (mathematics)|function]] that can be used to map [[data (computing)|data]] of arbitrary size to fixed-size values, though there are some hash functions that support variable
*{{cite web|url=https://csrc.nist.gov/glossary/term/hash_digest|title=hash digest|publisher=[[NIST]] |website=Computer Security Resource Center - Glossary}} *{{cite Hash functions and their associated hash tables are used in data storage and retrieval applications to access data in a small and nearly constant time per retrieval. They require an amount of storage space only fractionally greater than the total space required for the data or records themselves. Hashing is a computationally- and storage
Use of hash functions relies on statistical properties of key and function interaction: worst-case
Hash functions are related to (and often confused with) [[checksums]], [[check digit]]s, [[Fingerprint (computing)|fingerprints]], [[lossy compression]], [[randomization function]]s, [[Error correction code|error-correcting codes]], and [[cipher]]s. Although the concepts overlap to some extent, each one has its own uses and requirements and is designed and optimized differently. The hash function differs from these concepts mainly in terms of [[data integrity]]. Hash tables may use [[non-cryptographic hash function]]s, while [[cryptographic hash function]]s are used in cybersecurity to secure sensitive data such as passwords.
== Overview ==
A hash function may be considered to perform three functions:
*Convert variable-length keys into fixed
*Scramble the bits of the key so that the resulting values are uniformly distributed over the [[Key space (cryptography)|keyspace
*Map the key values into ones less than or equal to the size of the table.
A good hash function satisfies two basic properties:
== Hash tables ==
{{main|Hash table}}
Hash functions are used in conjunction with [[hash tables]] to store and retrieve data items or data records. The hash function translates the key associated with each datum or record into a hash code, which is used to index the hash table. When an item is to be added to the table, the hash code may index an empty slot (also called a bucket), in which case the item is added to the table there. If the hash code indexes a full slot, then some kind of collision resolution is required: the new item may be omitted (not added to the table), or replace the old item, or
=== Specialized uses ===
Hash functions are also used to build [[cache (computing)|caches]] for large data sets stored in slow media. A cache is generally simpler than a hashed search table, since any collision can be resolved by discarding or writing back the older of the two colliding items.<ref>{{Cite web|last=Stokes|first=Jon|date=2002-07-08|title=Understanding CPU caching and performance|url=https://arstechnica.com/gadgets/reviews/2002/07/caching.ars|access-date=2022-02-06|website=Ars Technica|language=en-us}}</ref>
Hash functions are an essential ingredient of the [[Bloom filter]], a space-efficient [[Probability|probabilistic]] [[data structure]] that is used to test whether an [[element (mathematics)|element]] is a member of a [[set (computer science)|set]].
A special case of hashing is known as
Hash tables are also used to implement [[associative array]]s and [[set (abstract data type)|dynamic sets]].<ref name="handbook_of_applied_cryptography">{{cite book|title=Handbook of Applied Cryptography|first1=Alfred J.|last1=Menezes|first2=Paul C.|last2=van Oorschot|first3=Scott A|last3=Vanstone|year=1996|publisher=CRC Press|isbn=978-0849385230|url-access=registration|url=https://archive.org/details/handbookofapplie0000mene}}</ref>
Line 42 ⟶ 43:
=== Uniformity ===
A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same [[probability]]. The reason for this last requirement is that the cost of hashing-based methods goes up sharply as the number of ''collisions''—pairs of inputs that are mapped to the same hash value—increases. If some hash values are more likely to occur than others, then a larger fraction of the lookup operations will have to search through a larger set of colliding table entries.
This criterion only requires the value to be ''uniformly distributed'', not ''random'' in any sense. A good randomizing function is (barring computational efficiency concerns) generally a good choice as a hash function, but the converse need not be true.
Line 48 ⟶ 49:
Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries.
In other words, if a typical set of {{math|''m''}} records is hashed to {{math|''n''}} table slots, then the probability of a bucket receiving many more than {{math|''m''/''n''}} records should be vanishingly small. In particular, if {{
In special cases when the keys are known in advance and the key set is static, a hash function can be found that achieves absolute (or collisionless) uniformity. Such a hash function is said to be ''[[Perfect hash function|perfect]]''. There is no algorithmic way of constructing such a
=== Testing and measurement ===
When testing a hash function, the uniformity of the distribution of hash values can be evaluated by the [[chi-squared test]]. This test is a goodness-of-fit measure: it
<math>\frac{\sum_{j=0}^{m-1}(b_j)(b_j+1)/2}{(n/2m)(n+2m-1)},</math>
where
A ratio within one confidence interval (such as 0.95
Hash functions can have some technical properties that make it more likely that they
=== Efficiency ===
In data storage and retrieval applications, the use of a hash function is a trade-off between search time and data storage space. If search time were unbounded, then a very compact unordered linear list would be the best medium; if storage space were unbounded, then a randomly accessible structure indexable by the key-value would be very large
Computational complexity varies with the number of instructions required and latency of individual instructions, with the simplest being the bitwise methods (folding), followed by the multiplicative methods, and the most complex (slowest) are the division-based methods.
Because collisions should be infrequent, and cause a marginal delay but are otherwise harmless, it
Division-based implementations can be of particular concern because
We can allow the table size {{math|''n''}} to not be a power of
If keys are being hashed repeatedly, and the hash function is costly, then computing time can be saved by precomputing the hash codes and storing them with the keys. Matching hash codes almost certainly means that the keys are identical. This technique is used for the transposition table in game-playing programs, which stores a 64-bit hashed representation of the board position.
=== Universality ===
{{main|Universal hashing}}
A ''universal hashing'' scheme is a [[randomized algorithm]] that selects a
===Applicability===
A hash function that allows only certain table sizes
A hash function is applicable in a variety of situations. Particularly within cryptography, notable applications include:<ref>{{Citation |last1=Wagner |first1=Urs |title=Hash Functions |date=2023 |work=Trends in Data Protection and Encryption Technologies |pages=21–24 |editor-last=Mulder |editor-first=Valentin |place=Cham |publisher=Springer Nature Switzerland |language=en |doi=10.1007/978-3-031-33386-6_5 |isbn=978-3-031-33386-6 |last2=Lugrin |first2=Thomas |editor2-last=Mermoud |editor2-first=Alain |editor3-last=Lenders |editor3-first=Vincent |editor4-last=Tellenbach |editor4-first=Bernhard|doi-access=free }}</ref>
* [[File verification|Integrity
* [[Key derivation function|Key derivation]]: Minor input changes result in a random-looking output alteration, known as the diffusion property. Thus, hash functions are valuable for key derivation functions.
* [[Message
* Password storage: The password's hash value
* [[Digital signature|Signatures]]: Message hashes are signed rather than the whole message.
=== Deterministic ===
A hash procedure must be [[deterministic algorithm|deterministic]]
The determinism is in the context of the reuse of the function. For example, [[Python (programming language)|Python]] adds the feature that hash functions make use of a randomized seed that is generated once when the Python process starts in addition to the input to be hashed.<ref>{{Cite web|url=https://docs.python.org/3/reference/datamodel.html#object.__hash__|title=3. Data model — Python 3.6.1 documentation|website=docs.python.org|access-date=2017-03-24}}</ref> The Python hash ([[SipHash]]) is still a valid hash function when used within a single run
=== Defined range ===
It is often desirable that the output of a hash function have fixed size (but see below). If, for example, the output is constrained to 32-bit integer values, then the hash values can be used to index into an array. Such hashing is commonly used to accelerate data searches.<ref name="algorithms_in_java">{{cite book|title=Algorithms in Java|first=Robert|last=Sedgewick|year=2002|publisher=Addison Wesley|edition=3|isbn=978-0201361209|chapter=14. Hashing}}</ref> Producing fixed-length output from variable-length input can be accomplished by breaking the input data into chunks of specific size. Hash functions used for data searches use some arithmetic expression that iteratively processes chunks of the input (such as the characters in a string) to produce the hash value.<ref name="algorithms_in_java"/>
=== Variable range ===
Line 107 ⟶ 108:
When the hash function is used to store values in a hash table that outlives the run of the program, and the hash table needs to be expanded or shrunk, the hash table is referred to as a dynamic hash table.
A hash function that will relocate the minimum number of records when the table is resized is desirable. What is needed is a hash function {{math|''H''(''z'',''n'')}} (where {{math|''z''}} is the key being hashed and {{math|''n''}} is the number of allowed hash values) such that {{math|1=''H''(''z'',''n'' + 1) = ''H''(''z'',''n'')}} with probability close to {{math|''n''/(''n'' + 1)}}.
[[Linear hashing]] and [[spiral hashing]] are examples of dynamic hash functions that execute in constant time but relax the property of uniformity to achieve the minimal movement property. [[Extendible hashing]] uses a dynamic hash function that requires space proportional to {{math|''n''}} to compute the hash function, and it becomes a function of the previous keys that have been inserted. Several algorithms that preserve the uniformity property but require time proportional to {{math|''n''}} to compute the value of {{math|''H''(''z'',''n'')}} have been invented.{{clarify|date=September 2019}}
Line 121:
=== Identity hash function ===
If the data to be hashed is small enough, then one can use the data itself (reinterpreted as an integer) as the hashed value. The cost of computing this ''[[identity function|identity]]'' hash function is effectively zero. This hash function is [[Perfect hash function|perfect]], as it maps each input to a distinct hash value.
The meaning of "small enough" depends on the size of the type that is used as the hashed value. For example, in [[Java (programming language)|Java]], the hash code is a 32-bit integer. Thus the 32-bit integer <code>Integer</code> and 32-bit floating-point <code>Float</code> objects can simply use the value directly
Other types of data can also use this hashing scheme. For example, when mapping [[character string]]s between [[letter case|upper and lower case]], one can use the binary encoding of each character, interpreted as an integer, to index a table that gives the alternative form of that character ("A" for "a", "8" for "8", etc.). If each character is stored in 8 bits (as in [[extended ASCII]]<ref group=Notes>Plain [[ASCII]] is a 7-bit character encoding, although it is often stored in 8-bit bytes with the highest-order bit always clear (zero). Therefore, for plain ASCII, the bytes have only 2<sup>7</sup> = 128 valid values, and the character translation table has only this many entries.</ref> or [[ISO Latin 1]]), the table has only 2<sup>8</sup> = 256 entries; in the case of [[Unicode]] characters, the table would have
The same technique can be used to map [[ISO 3166-1 alpha-2|two-letter country codes]] like "us" or "za" to country names (26<sup>2</sup> = 676 table entries), 5-digit
=== Trivial hash function ===
If the keys are uniformly or sufficiently uniformly distributed over the key space, so that the key values are essentially random, then they may be considered to be already
=== Mid-squares ===
A mid-squares hash code is produced by squaring the input and extracting an appropriate number of middle digits or bits. For example, if the input is
=== Division hashing ===
A standard technique is to use a modulo function on the key, by selecting a divisor
=== Algebraic coding ===
Algebraic coding is a variant of the division method of hashing which uses division by a polynomial modulo 2 instead of an integer to map {{Mvar|n}} bits to {{Mvar|m}} bits.<ref name="knuth-1973" />{{rp|pages=512–513}} In this approach,
Let
Define <math>P(x)=\prod_{j\in S}(x-\alpha^j)</math> where
The usual outcome is that either
=== Unique permutation hashing ===
=== Multiplicative hashing ===
Standard multiplicative hashing uses the formula {{Math|1=''h''<
<syntaxhighlight lang="c">
unsigned hash(unsigned K)▼
unsigned hash(unsigned K) {
return (a * K) >> (w - m);
</syntaxhighlight>
and for fixed
Multiplicative hashing is susceptible to a "common mistake" that leads to poor diffusion—higher-value input bits do not affect lower-value output bits.<ref>
{{cite web |url=
<syntaxhighlight lang="c">
▲ unsigned hash(unsigned K)
K ^= K >> (w - m);
return (a * K) >> (w - m);
</syntaxhighlight>
=== Fibonacci hashing ===
[[Fibonacci number|Fibonacci]] hashing is a form of multiplicative hashing in which the multiplier is {{Mvar|2<
*16: ''a'' = <span style="display: inline-block; width: 11em">9E37<sub>16</sub></span> = {{val|40503}}<sub>10</sub>
*32: ''a'' = <span style="display: inline-block; width: 11em">{{gaps|9E37|79B9}}<sub>16</sub></span> = {{val|2654435769}}<sub>10</sub>
Line 179 ⟶ 177:
*64: ''a'' = <span style="display: inline-block; width: 11em">{{gaps|9E37|79B9|7F4A|7C15}}<sub>16</sub></span> = {{val|11400714819323198485}}<sub>10</sub>
{{^|256+ bits is 0x9E3779B97F4A7C15F39CC0605CEDC8341082276BF3A27251F86C6A11D0C18E95.2767F..., if anyone needs it}}
The multiplier should be odd, so the least significant bit of the output is invertible modulo {{Math|2<
=== Zobrist hashing ===
{{main | Tabulation hashing|Zobrist hashing}}
[[Tabulation hashing]], more generally known as ''Zobrist hashing'' after [[Albert Lindsey Zobrist|Albert Zobrist]]
Zobrist hashing was originally introduced as a means of compactly representing chess positions in computer game-playing programs. A unique random number was assigned to represent each type of piece (six each for black and white) on each space of the board. Thus a table of 64×12 such numbers is initialized at the start of the program. The random numbers could be any length, but 64 bits was natural due to the 64 squares on the board. A position was transcribed by cycling through the pieces in a position, indexing the corresponding random numbers (vacant spaces were not included in the calculation)
Later, the method was extended to hashing integers by representing each byte in each of 4 possible positions in the word by a unique 32-bit random number. Thus, a table of 2<sup>8</sup>×4
This kind of function has some nice theoretical properties, one of which is called ''3-tuple independence'', meaning that every 3-tuple of keys is equally likely to be mapped to any 3-tuple of hash values.
===
A hash function can be designed to exploit existing entropy in the keys. If the keys have leading or trailing zeros, or particular fields that are unused, always zero or some other constant, or generally vary little, then masking out only the volatile bits and hashing on those will provide a better and possibly faster hash function. Selected divisors or multipliers in the division and multiplicative schemes may make more uniform hash functions if the keys are cyclic or have other redundancies.
Line 198 ⟶ 196:
=== Middle and ends ===
Simplistic hash functions may add the first and last {{math|''n''}} characters of a string along with the length, or form a word-size hash from the middle 4 characters of a string. This saves iterating over the (potentially long) string, but hash functions that do not hash on all characters of a string can readily become linear due to redundancies, clustering, or other pathologies in the key set. Such strategies may be effective as a custom hash function if the structure of the keys is such that either the middle, ends, or other fields are zero or some other invariant constant that does not differentiate the keys; then the invariant parts of the keys can be ignored.
=== Character folding ===
The paradigmatic example of folding by characters is to add up the integer values of all the characters in the string. A better idea is to multiply the hash total by a constant, typically a sizable prime number, before adding in the next character, ignoring overflow. Using exclusive
The classic approach, dubbed the [[PJW hash function|PJW hash]] based on the work of [[Peter
Today, especially with the advent of 64-bit word sizes, much more efficient variable-length string hashing by word chunks is available.
Line 210 ⟶ 207:
=== Word length folding ===
{{See also|Universal hashing#Hashing strings}}
Modern microprocessors will allow for much faster processing if 8-bit character strings are not hashed by processing one character at a time, but by interpreting the string as an array of 32
=== Radix conversion hashing ===
Analogous to the way an ASCII or [[EBCDIC]] character string representing a decimal number is converted to a numeric quantity for computing, a variable
=== Rolling hash ===
{{Main|Rolling hash}}
{{See also|Linear congruential generator}}
In some applications, such as [[string searching algorithm|substring search]], one can compute a hash function {{math|''h''}} for every {{math|''k''}}-character [[substring]] of a given {{math|''n''}}-character string by advancing a window of width {{math|''k''}} characters along the string
The most familiar algorithm of this type is [[Rabin-Karp]] with best and average case performance {{math|''O''(''n''+''mk'')}} and worst case {{math|''O''(''n''·''k'')}} (in all fairness, the worst case here is gravely pathological: both the text string and substring are composed of a repeated single character, such as {{math|''t''}}="AAAAAAAAAAA", and {{math|''s''}}="AAA"). The hash function used for the algorithm is usually the [[Rabin fingerprint]], designed to avoid collisions in 8-bit character strings, but other suitable hash functions are also used.
Line 229 ⟶ 226:
== Analysis ==
Worst case
While [[Donald Knuth|Knuth]] worries about adversarial attack on real time systems,<ref>{{cite book |last=Knuth |first=Donald E. |author-link=Donald Knuth |date=1975 |title=The Art of Computer Programming, Vol. 3, Sorting and Searching |page=540 |publisher=[[Addison-Wesley]] |___location=Reading, MA}}</ref> Gonnet has shown that the probability of such a case is "ridiculously small". His representation was that the probability of {{math|''k''}} of {{math|''n''}} keys mapping to a single slot is
== History ==
The term ''hash'' offers a natural analogy with its non-technical meaning (to chop up or make a mess out of something), given how hash functions scramble their input data to derive their output.<ref name="knuth-2000">{{cite book|last1=Knuth|first1=Donald E.|author-link=Donald Knuth|title=The Art of Computer Programming, Vol. 3, Sorting and Searching|date=2000|publisher=Addison-Wesley|___location=Boston [u.a.]|isbn=978-0-201-89685-5|edition=2. ed., 6. printing, newly updated and rev.}}</ref>{{rp|page=514}} In his research for the precise origin of the term, [[Donald Knuth]] notes that, while [[Hans Peter Luhn]] of [[IBM]] appears to have been the first to use the concept of a hash function in a memo dated January 1953, the term itself
== See also ==
Line 241 ⟶ 238:
*[[List of hash functions]]
*[[Nearest neighbor search]]
*[[Distributed hash table]]
*[[Identicon]]
Line 256 ⟶ 252:
== External links ==
{{Wiktionary|hash}}
*[http://www.sinfocol.org/archivos/2009/11/Goulburn06.pdf The Goulburn Hashing Function] ([[Portable Document Format|PDF]]) by Mayur Patel
*[https://dspace5.zcu.cz/bitstream/11025/11784/1/Skala_2010_Corfu-NAUN-Hash.pdf Hash Function Construction for Textual and Geometrical Data Retrieval] ([[Portable Document Format|PDF]]) Latest Trends on Computers, Vol.2, pp. 483–489, CSCC Conference, Corfu, 2010
|