Linear hashing: Difference between revisions

Content deleted Content added
m Addressing: reformatted
Tags: Mobile edit Mobile app edit Android app edit
Bender the Bot (talk | contribs)
m External links: HTTP to HTTPS for SourceForge
 
(22 intermediate revisions by 7 users not shown)
Line 15:
{{Citation | first1 = K. |last1 = Ramamohanarao | first2 = R. | last2 = Sacks-Davis |
title = Recursive linear hashing |
journal = ACM Transactions on DatabasesDatabase Systems|
volume=9 | number=3 |date=Sep 1984 | pages=369–391|doi = 10.1145/1270.1285 |s2cid = 18577730 |doi-access = free }}
</ref>
 
The file structure of a dynamic hashing data structure adapts itself to changes in the size of the file, so expensive periodic file reorganization is avoided.<ref name=RD/> A Linear Hashing file expands by splitting a pre-determinedpredetermined bucket into two and contractsshrinks by merging two predetermined buckets into one. The trigger for a reconstruction depends on the flavor of the scheme; it could be an overflow at a bucket or [[load factor (computer science)|load factor]] (i.e., the number of records divided by the number of buckets) moving outside of a predetermined range.<ref name=WL80 /> In Linear Hashing there are two types of buckets, those that are to be split and those already split. While extendible hashing splits only overflowing buckets, [[spiral Hashing|spiral hashing]] (a.k.a. spiral storage) distributes records unevenly over the buckets such that buckets with high costs of insertion, deletion, or retrieval are earliest in line for a split.<ref name=AL/>
 
Linear Hashing has also been made into a scalable distributed data structure, '''LH*'''. In LH*, each bucket resides at a different server.<ref name=WL93>
Line 39:
 
===Hash functions===
InThe orderhash tofunction access<math>h_i(c)</math> areturns the 0-based index of the bucket that contains the record with key <math>c</math>,. When a familybucket ofwhich uses the hash functions,function called<math>h_i</math> collectivelyis asplit dynamicinto two new buckets, the hash function <math>h_i</math> is appliedreplaced to the keywith <math>ch_{i+1}</math> for both of those new buckets. At any time, at most two hash functions <math>h_ih_l</math> and <math>h_{il+1}</math> are used.; Asuch typicalthat example uses the division modulo x operation in which a record<math>l</math> corresponds to a bucket in which all the keys have the samecurrent rightmost binary digits'''level'''. IfThe the original numberfamily of bucketshash isfunctions <math>N</math> h_i(Usually <math>N=1c)</math>), thenis also referred to as the family ofdynamic hash functions isfunction.
 
Typically, the value of <math>i</math> in <math>h_i</math> corresponds to the number of rightmost binary digits of the key <math>c</math> that are used to segregate the buckets. This dynamic hash function can be expressed arithmetically as <math display="inline"> h_i(c) \mapsto (c \bmod 2^i) </math>. Note that when the total number of buckets is equal to one, <math>i=0</math>.
<math> h_i(c) \mapsto c \pmod{N \cdot 2^i}</math>.
 
Complete the calculations below to determine the correct hashing function for the given hashing key <math>c</math>.<ref name="LMS" /><syntaxhighlight lang="python">
When a bucket which uses the hash function <math>h_i</math> is split into two new buckets, the hash function <math>h_i</math> is replaced with <math>h_{i+1}</math> for both of those new buckets. When the total number of buckets is equal to one, <math>i=0</math>.<ref name="LMS" />
# l represents the current level
 
# s represents the split pointer index
==== Addressing ====
<math>a := h_l(c)</math>
Addressing is based on the file state, consisting of the split pointer <math>s</math>
<math>\text{if } (a < s): \quad a := h_{l+1}(c)</math>.
and the level <math>l</math>. If the level is <math>l</math>, then the hash functions
</syntaxhighlight>
used are <math>h_l</math> and <math>h_{l+1}</math>.
 
The LH algorithm for hashing key <math>c</math> is<ref name="LMS" />
 
<math>a := h_l(c)</math>
 
<math>\text{if } (a < s): \quad a := h_{l+1}(c)</math>.
 
===Split control===
Linear hashing algorithms may use only controlled splits or both controlled and uncontrolled splits.
 
'''Controlled splitting''' occurs if a split is performed whenever the [[load factor (computer science)|load factor]], which is monitored by the file, exceeds a predetermined threshold.<ref name="LMS" /> If the hash index uses controlled splitting, the buckets are allowed to overflow by using linked overflow blocks. When the ''load factor'' surpasses a set threshold, the ''split pointer's'' designated bucket is split. Instead of using the load factor, this threshold can also be expressed as an occupancy percentage, in which case, the maximum number of records in the hash index equals (occupancy percentage)*(max records per non-overflowed bucket)*(number of buckets).<ref name=":0">{{Cite book |lastlast1=Silberschatz |firstfirst1=Abraham |title=Database system concepts |last2=Korth |first2=Henry F. |last3=Sudarshan |first3=S. |date=2020 |publisher=McGraw-Hill Education |isbn=978-1-260-08450-4 |edition=Seventh |___location=New York, NY}}</ref>
 
An '''uncontrolled split''' occurs when a split is performed whenever a bucket overflows, in which case that bucket would be split into two separate buckets.
 
'''Controlled splitting''' occurs if a split is performed whenever the load factor, which is monitored by the file, exceeds a predetermined threshold.<ref name="LMS" /> If the hash index uses controlled splitting, the buckets are allowed to overflow by using linked overflow blocks. When the ''load factor'' surpasses a set threshold, the designated bucket is split. Instead of using the load factor, this threshold can also be expressed as an occupancy percentage, in which case, the maximum number of records in the hash index equals (occupancy percentage)*(max records per non-overflowed bucket)*(number of buckets).<ref name=":0">{{Cite book |last=Silberschatz |first=Abraham |title=Database system concepts |last2=Korth |first2=Henry F. |last3=Sudarshan |first3=S. |date=2020 |publisher=McGraw-Hill Education |isbn=978-1-260-08450-4 |edition=Seventh |___location=New York, NY}}</ref>
 
'''File contraction''' occurs in some LH algorithm implementations if a controlled split causes the load factor to sink below a threshold. In this case, a merge operation would be triggered which would undo the last split, and reset the file state.<ref name="LMS" />
 
===Split pointer===
The numberindex of the next bucket to be split is part of the file state and called the '''split pointer''' <math>s</math>. WhenThe asplit bucketpointer iscorresponds split,to splitthe pointerfirst andbucket possiblythat uses the levelhash arefunction updated<math>h_l</math> accordinginstead toof <math>h_{l+1}</math>.<ref name="LMS" />
 
For example, if numerical records are inserted into the hash index according to their farthest right binary digits, the bucket corresponding with the appended bucket will be split. Thus, if we have the buckets labelled as 000, 001, 10, 11, 100, 101, we would split the bucket 10 because we are appending and creating the next sequential bucket 110. This would give us the buckets 000, 001, 010, 11, 100, 101, 110.<ref name=":0" />
<math>s := s+1</math>
 
When a bucket is split, split pointer and possibly the level are updated according to the following, such that the level is 0 when the linear hashing index only has 1 bucket.<ref name="LMS" /><syntaxhighlight lang="python">
<math>\text{if } (s \ge N\times 2^l): \quad l:=l+1, \ s:=0</math>.<ref name="LMS" />
# l represents the current level
 
# s represents the split pointer index
For example, if numerical records are inserted into the hash index according to their farthest right binary digits, the bucket corresponding with the appended bucket will be split. Thus, if we have the buckets labelled as 000, 001, 10, 11, 100, 101, we would split the bucket 10 because we are appending and creating the next sequential bucket 110. This would give us the buckets 000, 001, 010, 11, 100, 101, 110.<ref name=":0" />
s = s + 1
if (s >= 2^l):
l = l + 1
s = 0
</syntaxhighlight>
 
===LH*===
Line 106 ⟶ 105:
 
==External links==
* [httphttps://tommyds.sourceforge.net/ TommyDS, C implementation of a Linear Hashtable]
* [http://hackthology.com/an-in-memory-go-implementation-of-linear-hashing.html An in Memory Go Implementation with Explanation]
* [https://github.com/KevinStern/index-cpp/blob/master/src/linear_hashing_table.h A C++ Implementation of Linear Hashtable which Supports Both Filesystem and In-Memory storage]