Hash function: Difference between revisions

Content deleted Content added
m ce
Citation bot (talk | contribs)
Removed URL that duplicated identifier. Removed access-date with no URL. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 128/990
 
(28 intermediate revisions by 24 users not shown)
Line 4:
{{about|a computer programming construct|other meanings of "hash" and "hashing"|Hash (disambiguation)}}
{{More citations needed|date=July 2010}}
[[File:Hash table 4 1 1 0 0 1 0 LL.svg|thumb|240px|right|A hash function that maps names to integers from 0 to 15. There is a [[hash collision|collision]] between keys "John Smith" and "Sandra Dee".]]
 
A '''hash function''' is any [[Function (mathematics)|function]] that can be used to map [[data (computing)|data]] of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output.<ref>{{cite conference |last1=Aggarwal |first1=Kirti |last2=Verma |first2=Harsh K. |date=March 19, 2015 |url=https://ieeexplore.ieee.org/document/7164747 |title=Hash_RC6 — Variable length Hash algorithm using RC6 |doi=10.1109/ICACEA.2015.7164747 |conference=2015 International Conference on Advances in Computer Engineering and Applications (ICACEA) |access-date=January 24, 2023}}</ref> The values returned by a hash function are called ''hash values'', ''hash codes'', (''hash digests/message'',) ''digests'', or simply ''hashes''.<ref>
*{{cite web|url=https://csrc.nist.gov/glossary/term/hash_digest|title=hash digest|publisher=[[NIST]] |website=Computer Security Resource Center - Glossary}}
*{{cite web hash|url=https://csrc.nist.gov/glossary/term/message_digest |title=message digest |access-datepublisher=January[[NIST]] 1,|website=Computer 2024Security Resource Center - Glossary}}</ref> or simply ''hashes''. The values are usually used to index a fixed-size table called a ''[[hash table]]''. Use of a hash function to index a hash table is called ''hashing'' or ''scatter-storage addressing''.
 
Hash functions and their associated hash tables are used in data storage and retrieval applications to access data in a small and nearly constant time per retrieval. They require an amount of storage space only fractionally greater than the total space required for the data or records themselves. Hashing is a computationally- and storage-space-efficient form of data access that avoids the non-constant access time of ordered and unordered lists and structured trees, and the often-exponential storage requirements of direct access of state spaces of large or variable-length keys.
Line 21 ⟶ 23:
*Scramble the bits of the key so that the resulting values are uniformly distributed over the [[Key space (cryptography)|keyspace]], and
*Map the key values into ones less than or equal to the size of the table.
A good hash function satisfies two basic properties: it should be very fast to compute, and it should minimize duplication of output values ([[Hash collision|collisions]]). Hash functions rely on generating favorable [[Probabilityprobability distribution|probability distributions]]s for their effectiveness, reducing access time to nearly constant. High table loading factors, [[Pathological (mathematics)|pathological]] key sets, and poorly designed hash functions can result in access times approaching linear in the number of items in the table. Hash functions can be designed to give the best worst-case performance,<ref group=Notes>This is useful in cases where keys are devised by a malicious agent, for example in pursuit of a DOS attack.</ref> good performance under high table loading factors, and in special cases, perfect (collisionless) mapping of keys into hash codes. Implementation is based on parity-preserving bit operations (XOR and ADD), multiply, or divide. A necessary adjunct to the hash function is a collision-resolution method that employs an auxiliary data structure like [[linked list]]s, or systematic probing of the table to find an empty slot.
 
== Hash tables ==
Line 69 ⟶ 71:
Because collisions should be infrequent, and cause a marginal delay but are otherwise harmless, it is usually preferable to choose a faster hash function over one that needs more computation but saves a few collisions.
 
Division-based implementations can be of particular concern because thea division isrequires [[Microprogramming|microprogrammed]]multiple cycles on nearly all chipprocessor architectures[[microarchitecture]]s. Division ([[Modulo operation|modulo]]) by a constant can be inverted to become a multiplication by the word-size multiplicative-inverse of that constant. This can be done by the programmer, or by the compiler. Division can also be reduced directly into a series of shift-subtracts and shift-adds, though minimizing the number of such operations required is a daunting problem; the number of assemblymachine-language instructions resulting may be more than a dozen and swamp the pipeline. If the architecturemicroarchitecture has [[hardware multiply]] [[Functionalfunctional unit|functional units]]s, then the multiply-by-inverse is likely a better approach.
 
We can allow the table size {{math|''n''}} to not be a power of 2 and still not have to perform any remainder or division operation, as these computations are sometimes costly. For example, let {{math|''n''}} be significantly less than {{math|2<sup>''b''</sup>}}. Consider a [[pseudorandom number generator]] function {{math|''P''(key)}} that is uniform on the interval {{math|[0, 2<sup>''b''</sup>&nbsp;−&nbsp;1]}}. A hash function uniform on the interval {{math|[0, ''n'' &minus; 1]}} is {{math|''n'' ''P''(key) / 2<sup>''b''</sup>}}. We can replace the division by a (possibly faster) right [[bit shifting|bit shift]]: {{math|''n P''(key) >> ''b''}}.
Line 77 ⟶ 79:
=== Universality ===
{{main|Universal hashing}}
A ''universal hashing'' scheme is a [[randomized algorithm]] that selects a hash function {{math|''h''}} among a family of such functions, in such a way that the probability of a collision of any two distinct keys is {{math|1/''m''}}, where {{math|''m''}} is the number of distinct hash values desired—independently of the two keys. Universal hashing ensures (in a probabilistic sense) that the hash [[function application]] will behave as well as if it were using a random function, for any distribution of the input data. It will, however, have more collisions than perfect hashing and may require more operations than a special-purpose hash function.
 
===Applicability===
Line 86 ⟶ 88:
* [[File verification|Integrity checking]]: Identical hash values for different files imply equality, providing a reliable means to detect file modifications.
* [[Key derivation function|Key derivation]]: Minor input changes result in a random-looking output alteration, known as the diffusion property. Thus, hash functions are valuable for key derivation functions.
* [[Message authentication code|Message Authentication Codes]]s (MACs): Through the integration of a confidential key with the input data, hash functions can generate MACs ensuring the genuineness of the data, such as in [[HMAC]]s.
* Password storage: The password's hash value does not expose any password details, emphasizing the importance of securely storing hashed passwords on the server.
* [[Digital signature|Signatures]]: Message hashes are signed rather than the whole message.
 
=== Deterministic ===
A hash procedure must be [[deterministic algorithm|deterministic]]—for a given input value, it must always generate the same hash value. In other words, it must be a [[function (mathematics)|function]] of the data to be hashed, in the mathematical sense of the term. This requirement excludes hash functions that depend on external variable parameters, such as [[pseudo-random number generator]]s or the time of day. It also excludes functions that depend on the [[memory address]] of the object being hashed, because the address may change during execution (as may happen on systems that use certain methods of [[garbage collection (computer science)|garbage collection]]), although sometimes rehashing of the item is possible.
 
The determinism is in the context of the reuse of the function. For example, [[Python (programming language)|Python]] adds the feature that hash functions make use of a randomized seed that is generated once when the Python process starts in addition to the input to be hashed.<ref>{{Cite web|url=https://docs.python.org/3/reference/datamodel.html#object.__hash__|title=3. Data model — Python 3.6.1 documentation|website=docs.python.org|access-date=2017-03-24}}</ref> The Python hash ([[SipHash]]) is still a valid hash function when used within a single run, but if the values are persisted (for example, written to disk), they can no longer be treated as valid hash values, since in the next run the random value might differ.
Line 129 ⟶ 131:
=== Trivial hash function ===
If the keys are uniformly or sufficiently uniformly distributed over the key space, so that the key values are essentially random, then they may be considered to be already "hashed". In this case, any number of any bits in the key may be extracted and collated as an index into the hash table. For example, a simple hash function might mask off the {{Mvar|m}} least significant bits and use the result as an index into a hash table of size {{Math|2<sup>''m''</sup>}}.
 
=== Folding ===
A folding hash code is produced by dividing the input into sections of {{Mvar|m}} bits, where {{Math|2<sup>''m''</sup>}} is the table size, and using a parity-preserving bitwise operation such as ADD or XOR to combine the sections, followed by a mask or shifts to trim off any excess bits at the high or low end. For example, for a table size of 15 bits and a 64-bit key value of 0x0123456789ABCDEF, there are five sections consisting of 0x4DEF, 0x1357, 0x159E, 0x091A, and 0x0. Adding yields 0x7FFE, a 15-bit value.
 
=== Mid-squares ===
Line 137 ⟶ 136:
 
=== Division hashing ===
A standard technique is to use a modulo function on the key, by selecting a divisor {{Mvar|M}} which is a prime number close to the table size, so {{Math|''h''(''K'') &equiv; ''K'' (mod ''M'')}}. The table size is usually a power of 2. This gives a distribution from {{Math|{{mset|0, ''M'' &minus; 1}}}}. This gives good results over a large number of key sets. A significant drawback of division hashing is that division isrequires microprogrammedmultiple cycles on most modern architectures (including [[x86]]) and can be 10 times slower than multiplication. A second drawback is that it will not break up clustered keys. For example, the keys 123000, 456000, 789000, etc. modulo 1000 all map to the same address. This technique works well in practice because many key sets are sufficiently random already, and the probability that a key set will be cyclical by a large prime number is small.
 
=== Algebraic coding ===
Line 155 ⟶ 154:
=== Multiplicative hashing ===
Standard multiplicative hashing uses the formula {{Math|1=''h''<sub>''a''</sub>(''K'') = {{floor|(''aK'' mod ''W'') / (''W''/''M'')}}}}, which produces a hash value in {{Math|{{mset|0, &mldr;, ''M'' &minus; 1}}}}. The value {{Mvar|a}} is an appropriately chosen value that should be [[Coprime integers|relatively prime]] to {{Mvar|W}}; it should be large,{{Clarify|reason=How large?|date=January 2021}} and its binary representation a random mix{{Clarify|reason="a" is a constant, so it cannot be random|date=January 2021}} of 1s and 0s. An important practical special case occurs when {{Math|1=''W'' = 2<sup>''w''</sup>}} and {{Math|1=''M'' = 2<sup>''m''</sup>}} are powers of 2 and {{Mvar|w}} is the machine [[word size]]. In this case, this formula becomes {{Math|1=''h''<sub>''a''</sub>(''K'') = {{floor|(''aK'' mod 2<sup>''w''</sup>) / 2<sup>''w''&minus;''m''</sup>}}}}. This is special because arithmetic modulo {{Math|2<sup>''w''</sup>}} is done by default in low-level programming languages and integer division by a power of 2 is simply a right-shift, so, in [[C (programming language)|C]], for example, this function becomes
<syntaxhighlight lang="c">
unsigned hash(unsigned K) {
return (a * K) >> (w - m);
}
</syntaxhighlight>
and for fixed {{Mvar|m}} and {{Mvar|w}} this translates into a single integer multiplication and right-shift, making it one of the fastest hash functions to compute.
 
Multiplicative hashing is susceptible to a "common mistake" that leads to poor diffusion&mdash;higher-value input bits do not affect lower-value output bits.<ref>
{{cite web |url=httphttps://www.cs.cornell.edu/courses/cs3110/2008fa/lectures/lec21.html |title=CS 3110 Lecture 21: Hash functions |at=Section "Multiplicative hashing"}}</ref> A transmutation on the input which shifts the span of retained top bits down and XORs or ADDs them to the key before the multiplication step corrects for this. The resulting function looks like:<ref name="fibonacci-hashing"/>
<syntaxhighlight lang="c">
unsigned hash(unsigned K) {
K ^= K >> (w-m);
returnK ^= (a*K) >> (w - m);
Kreturn ^=(a * K) >> (w - m);
}
</syntaxhighlight>
 
=== Fibonacci hashing ===
Line 178 ⟶ 181:
=== Zobrist hashing ===
{{main | Tabulation hashing|Zobrist hashing}}
[[Tabulation hashing]], more generally known as ''Zobrist hashing'' after [[Albert Lindsey Zobrist|Albert Zobrist]], is a method for constructing universal families of hash functions by combining table lookup with XOR operations. This algorithm has proven to be very fast and of high quality for hashing purposes (especially hashing of integer-number keys).<ref>{{citation|first=Albert L.|last=Zobrist|author-link= Albert Lindsey Zobrist|url=https://www.cs.wisc.edu/techreports/1970/TR88.pdf|title=A New Hashing Method with Application for Game Playing|series=Tech. Rep. 88|publisher=Computer Sciences Department, University of Wisconsin|___location=Madison, Wisconsin|date=April 1970}}.</ref>
 
Zobrist hashing was originally introduced as a means of compactly representing chess positions in computer game-playing programs. A unique random number was assigned to represent each type of piece (six each for black and white) on each space of the board. Thus a table of 64×12 such numbers is initialized at the start of the program. The random numbers could be any length, but 64 bits was natural due to the 64 squares on the board. A position was transcribed by cycling through the pieces in a position, indexing the corresponding random numbers (vacant spaces were not included in the calculation) and XORing them together (the starting value could be 0 (the identity value for XOR) or a random seed). The resulting value was reduced by modulo, folding, or some other operation to produce a hash table index. The original Zobrist hash was stored in the table as the representation of the position.
Line 196 ⟶ 199:
 
=== Character folding ===
The paradigmatic example of folding by characters is to add up the integer values of all the characters in the string. A better idea is to multiply the hash total by a constant, typically a sizable prime number, before adding in the next character, ignoring overflow. Using exclusive-or instead of addition is also a plausible alternative. The final operation would be a modulo, mask, or other function to reduce the word value to an index the size of the table. The weakness of this procedure is that information may cluster in the upper or lower bits of the bytes; this clustering will remain in the hashed result and cause more collisions than a proper randomizing hash. ASCII byte codes, for example, have an upper bit of 0, and printable strings do not use the last byte code or most of the first 32 byte codes, so the information, (95which bytecodes)uses the remaining byte codes, is clustered in the remaining bits in an unobvious manner.
 
The classic approach, dubbed the [[PJW hash function|PJW hash]] based on the work of [[Peter J. Weinberger]] at [[Bell Labs]] in the 1970s, was originally designed for hashing identifiers into compiler symbol tables as given in the [[Compilers: Principles, Techniques, and Tools|"Dragon Book"]].<ref>{{cite book |first1=A. |last1=Aho |author-link1=Alfred Aho |first2=R. |last2=Sethi |author-link2=Ravi Sethi |first3=J. D. |last3=Ullman |author-link3=Jeffrey Ullman |date=1986 |title=Compilers: Principles, Techniques and Tools |page=435 |publisher=[[Addison-Wesley]] |___location=Reading, MA |isbn=0-201-10088-6}}</ref> This hash function offsets the bytes 4 bits before adding them together. When the quantity wraps, the high 4 bits are shifted out and if non-zero, [[Exclusive or|xored]] back into the low byte of the cumulative quantity. The result is a word-size hash code to which a modulo or other reducing operation can be applied to produce the final hash index.
Line 212 ⟶ 215:
{{Main|Rolling hash}}
{{See also|Linear congruential generator}}
In some applications, such as [[string searching algorithm|substring search]], one can compute a hash function {{math|''h''}} for every {{math|''k''}}-character [[substring]] of a given {{math|''n''}}-character string by advancing a window of width {{math|''k''}} characters along the string, where {{math|''k''}} is a fixed integer, and {{Math|''n'' > ''k''}}. The straightforward solution, which is to extract such a substring at every character position in the text and compute {{math|''h''}} separately, requires a number of operations proportional to {{math|''k''·''n''}}. However, with the proper choice of {{math|''h''}}, one can use the technique of rolling hash to compute all those hashes with an effort proportional to {{math|''mk''&nbsp;+&nbsp;''n''}} where {{math|''m''}} is the number of occurrences of the substring.<ref>{{citationCite book needed|datelast=NovemberSingh 2022|first=N. B. |url=https://books.google.com/books?id=ALIMEQAAQBAJ&dq=rolling+hash&pg=PT102 |title=A Handbook of Algorithms |publisher=N.B. Singh |language=en}}</ref>{{fix|text=what is the choice of h?}}
 
The most familiar algorithm of this type is [[Rabin-Karp]] with best and average case performance {{math|''O''(''n''+''mk'')}} and worst case {{math|''O''(''n''·''k'')}} (in all fairness, the worst case here is gravely pathological: both the text string and substring are composed of a repeated single character, such as {{math|''t''}}="AAAAAAAAAAA", and {{math|''s''}}="AAA"). The hash function used for the algorithm is usually the [[Rabin fingerprint]], designed to avoid collisions in 8-bit character strings, but other suitable hash functions are also used.
Line 249 ⟶ 252:
== External links ==
{{Wiktionary|hash}}
*[http://tools.timodenk.com/?p=hash-function Calculate hash of a given value] by Timo Denk
*[http://www.sinfocol.org/archivos/2009/11/Goulburn06.pdf The Goulburn Hashing Function] ([[Portable Document Format|PDF]]) by Mayur Patel
*[https://dspace5.zcu.cz/bitstream/11025/11784/1/Skala_2010_Corfu-NAUN-Hash.pdf Hash Function Construction for Textual and Geometrical Data Retrieval] ([[Portable Document Format|PDF]]) Latest Trends on Computers, Vol.2, pp.&nbsp;483–489, CSCC Conference, Corfu, 2010