Linear hashing

This is an old revision of this page, as edited by 218.215.137.243 (talk) at 02:12, 4 February 2010 (Algorithm Details). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Linear hashing is a dynamic hash table algorithm invented by Witold Litwin (1980) [1], and later popularized by Paul Larson. Linear hashing allows for the expansion of the hash table one slot at a time. The frequent single slot expansion can very effectively control the length of the collision chain. The cost of hash table expansion is spread out across each hash table insertion operation, as opposed to be incurred all at once. [2] Therefore linear hashing is well suited for interactive applications.

Algorithm Details

As usual, a hash function controls the address calculation of linear hashing. In linear hashing, the address calculation is always bounded by a size that is a [[power of two] * N, where N is the chosen original number of buckets. The number of buckets is given by N * 2 ^ Level e.g. Level 0, N; Level 1, 2N; Level 2 , 4N.

 address(level,key) = hash(key) mod  N * (2level)

The 'split' variable controls the read operation, and the expansion operation.

A read operation would use address(level,key) if address(level,key) is greater than or equal to the 'split' variable. Otherwise, address(level+1,key) is used.

A linear hashing table expansion operation would consist of rehashing the entries at slot ___location indicated by the 'split' variable to the target slot ___location of address(level+1,key). The 'split' variable is incremented by 1 at the end of the expansion operation. If the 'split' variable reaches N * 2level, then the 'level' variable is incremented by 1, and the 'split' variable is reset to 0.

Thus the hash buckets are expanded round robin, and seem unrelated to where buckets overflow at the time of expansion. Overflow buckets are used at the sites of bucket overflow ( the normal bucket has a pointer to the overflow bucket), but these are eventually reabsorbed when the round robin comes to the bucket with the overflow bucket, and the contents of that bucket and the overflow bucket are redistributed by the future hash function ( hash(key) mod N * (2 level+1 ).

The degenerate case, which is unlikely with a randomized hash function, is that enough entries are hashed to the same bucket so that there is enough entries to overflow more than one overflow bucket ( assuming overflow bucket size = normal bucket size), before being absorbed when that bucket's turn to split comes in the round robin.

The point of the algorithm seems to be that overflow is preempted by gradually increasing the number of available buckets, and overflow buckets are eventually reabsorbed during a later split, which must eventually happen because splitting occurs round robin.

One problem is that after a bucket is split, its old hash function has been replaced by the newer hash function ( hash(key) mod N * (2 level+1 ). To determine whether the old mod 2 level should be used, the old hash is applied first, and then if it is < split variable , then it is recalculated with mod N * 2 level +1 . Note this depends on the property that for any y = x mod M , y' = y or y + M if calculated as y' = x mod M * 2 . Therefore the new hash function will only hash to buckets created by previous splits.

There is some flexibility in choosing how often the expansion operations are performed. One obvious choice is to perform the expansion operation each time no more slots are available at the target slot ___location. Another choice is to control the expansion with a programmer defined load factor.

The hash table array for linear hashing is usually implemented with a dynamic array algorithm.

Adoption in language systems

Griswold and Townsend [3] discussed the adoption of linear hashing in the Icon language. They discussed the implementation alternatives of dynamic array algorithm used in linear hashing, and presented performance comparisons using a list of Icon benchmark applications.

References

  1. ^ Litwin, Witold (1980), "Linear hashing: A new tool for file and table addressing" (PDF), Proc. 6th Conference on Very Large Databases: 212–223
  2. ^ Larson, Per-Åke (April 1988), "Dynamic Hash Tables", Communications of the ACM, 31: 446–457, doi:10.1145/42404.42410 {{citation}}: Unknown parameter |Number= ignored (|number= suggested) (help)
  3. ^ Griswold, William G.; Townsend, Gregg M. (April 1993), "The Design and Implementation of Dynamic Hashing for Sets and Tables in Icon", Software - Practice and Experience, 23 (4): 351–367

See also