Linear hashing: Difference between revisions

Content deleted Content added
Split control: clarified explanation
Tags: Mobile edit Mobile app edit Android app edit
Bender the Bot (talk | contribs)
m External links: HTTP to HTTPS for SourceForge
 
(36 intermediate revisions by 7 users not shown)
Line 2:
<ref name=Ellis>
{{Citation | first1 = Carla Schlatter | last1=Ellis | title=Concurrency in Linear Hashing | journal=ACM Transactions on Database Systems | volume=12 | number=2 | pages=195–217 | date=June 1987| doi=10.1145/22952.22954 | s2cid=14260177 | doi-access=free }}
</ref> It has been analyzed by Baeza-Yates and Soza-Pollman.<ref name=BS>{{Citation | first1=Ricardo | last1=Baeza-Yates | first2=Hector | last2=Soza-Pollman | title=Analysis of Linear Hashing Revised | journal=Nordic Journal of Computing | pages=70–85 | year=1998 | s2cid=7497598 | url=http://pdfs.semanticscholar.org/e6cd/667fef7cd377ed8d417cc648d3d578675ad5.pdf | archive-url=https://web.archive.org/web/20190307204217/http://pdfs.semanticscholar.org/e6cd/667fef7cd377ed8d417cc648d3d578675ad5.pdf | url-status=dead | archive-date=2019-03-07 }}</ref> It is the first in a number of schemes known as dynamic hashing<ref name=BS/>
<ref name=RD>{{Citation | first1=Richard | last1=Enbody | first2 = HC | last2 = Du | title=Dynamic hashing schemes | journal=ACM Computing Surveys | pages=85–113 | date=June 1988 | volume=20 | number=2 | doi=10.1145/46157.330532 | s2cid=1437123 | doi-access=free }}</ref> such as Larson's Linear Hashing with Partial Extensions, <ref name=AL>{{Citation | first1=Per-Åke | last1=Larson | title=Dynamic Hash Tables | journal=Communications of the ACM | pages=446–457 | date=April 1988 | volume=31 | number=4 | doi=10.1145/42404.42410| s2cid=207548097 | doi-access=free }}</ref> Linear Hashing with Priority Splitting,<ref name=ruchte>
It is the first in a number of schemes known as dynamic hashing
<ref name=BS/>
<ref name=RD>{{Citation | first1=Richard | last1=Enbody | first2 = HC | last2 = Du | title=Dynamic hashing schemes | journal=ACM Computing Surveys | pages=85–113 | date=June 1988 | volume=20 | number=2 | doi=10.1145/46157.330532 | s2cid=1437123 | doi-access=free }}</ref>
such as Larson's Linear Hashing with Partial Extensions,
<ref name=AL>{{Citation | first1=Per-Åke | last1=Larson | title=Dynamic Hash Tables | journal=Communications of the ACM | pages=446–457 | date=April 1988 | volume=31 | number=4 | doi=10.1145/42404.42410| s2cid=207548097 | doi-access=free }}</ref>
Linear Hashing with Priority Splitting,
<ref name=ruchte>
{{Citation | first1 = Willard | last1 = Ruchte | first2 = Alan | last2 = Tharp |
title = Linear hashing with Priority Splitting: A method for improving the retrieval performance of linear hashing |
journal = IEEE Third International Conference on Data Engineering |
date = Feb 1987 | pages=2–9}}
</ref> Linear Hashing with Partial Expansions and Priority Splitting,<ref>
</ref>
Linear Hashing with Partial Expansions and Priority Splitting,
<ref>
{{Citation | first1 = Yannis | last1 = Manolopoulos | first2 = N. | last2 = Lorentzos |
title = Performance of linear hashing schemes for primary key retrieval |
journal = Information Systems | volume = 19 | number = 5 |date = 1994 | pages =433–446| doi = 10.1016/0306-4379(94)90005-1 }}
</ref> or Recursive Linear Hashing.<ref name=RS>
</ref>
or Recursive Linear Hashing.
<ref name=RS>
{{Citation | first1 = K. |last1 = Ramamohanarao | first2 = R. | last2 = Sacks-Davis |
title = Recursive linear hashing |
journal = ACM Transactions on DatabasesDatabase Systems|
volume=9 | number=3 |date=Sep 1984 | pages=369–391|doi = 10.1145/1270.1285 |s2cid = 18577730 |doi-access = free }}
</ref>
 
The file structure of a dynamic hashing data structure adapts itself to changes in the size of the file, so expensive periodic file reorganization is avoided.<ref name=RD/> A Linear Hashing file expands by splitting a predetermined bucket into two and shrinks by merging two predetermined buckets into one. The trigger for a reconstruction depends on the flavor of the scheme; it could be an overflow at a bucket or [[load factor (computer science)|load factor]] (i.e., the number of records divided by the number of buckets) moving outside of a predetermined range.<ref name=WL80 /> In Linear Hashing there are two types of buckets, those that are to be split and those already split. While extendible hashing splits only overflowing buckets, [[spiral Hashing|spiral hashing]] (a.k.a. spiral storage) distributes records unevenly over the buckets such that buckets with high costs of insertion, deletion, or retrieval are earliest in line for a split.<ref name=AL/>
a pre-determined bucket into two and contracts by merging two predetermined buckets into one. The trigger for a reconstruction depends on the flavor of the scheme; it could be an overflow at a bucket or [[load factor (computer science)|load factor]] (number of records divided by the number of buckets) moving outside of a predetermined range.<ref name=WL80 />
In Linear Hashing there are two types of buckets, those that are to be split and those
already split. While extendible hashing splits only overflowing buckets,
[[spiral Hashing|spiral hashing]] (a.k.a. spiral storage) distributes records unevenly over the buckets such
that buckets with high costs of insertion, deletion, or retrieval are earliest in line
for a split.<ref name=AL/>
 
Linear Hashing has also been made into a scalable distributed data structure, '''LH*'''. In LH*, each bucket resides at a different server.<ref name=WL93>
<ref name=WL93>
{{Citation | first1=Witold | last1=Litwin | first2=Marie-Anne |last2 = Neimat |
first3 = Donavan A. | last3 = Schneider | title=LH: Linear Hashing for distributed files | journal=ACM SIGMOD Record |
pages = 327–336 | date = 1993| volume=22 | issue=2 | doi=10.1145/170036.170084 | s2cid=259938726 }}
</ref> LH* itself has been expanded to provide data availability in the presence of failed buckets.<ref name=LMS>
failed buckets.
<ref name=LMS>
{{Citation | first1=Witold | last1=Litwin | first2 = Rim | last2 = Moussa |
first3 = Thomas | last3 = Schwarz | title=LH*RS - a highly-available scalable distributed data structure |journal=ACM Transactions on Database Systems | volume=30 | number=3 | pages=769–811 | date=Sep 2005| doi=10.1145/1093382.1093386 | s2cid=1802386 | url=https://basepub.dauphine.fr/handle/123456789/15124 }}
</ref> Key based operations (inserts, deletes, updates, reads) in LH and LH* take maximum constant time independent of the number of buckets and hence of records.<ref name = WL80/><ref name=LMS/>
LH* take maximum constant time independent of the number of buckets and hence of records.
<ref name = WL80/><ref name=LMS/>
 
==Algorithm details==
 
Records in LH or LH* consists of a key and a content, the latter basically all the other attributes of the record.<ref name=WL80/><ref name=LMS/> They are stored in buckets. For example, in Ellis' implementation, a bucket is a linked list of records.<ref name=Ellis/> The file allows the key based CRUD operations create or insert, read, update, and delete as well as a scan operations that scans all records, for example to do a database select operation on a non-key attribute.<ref name=LMS/> Records are stored in buckets whose numbering starts with 0.<ref name=LMS/>
 
The key distinction from schemes such as Fagin's extendible hashing is that as the file expands due to insertions, only one bucket is split at a time, and the order in which buckets are split is already predetermined.<ref name="Fagin">
{{Citation |last1=Fagin |first1=Ronald |title=Extendible Hashing - A Fast Access Method for Dynamic Files |date=Sep 1979 |url=http://dl.acm.org/citation.cfm?doid=320083.320092 |journal=ACM Transactions on Database Systems |volume=4 |number=2 |pages=315–344 |doi=10.1145/320083.320092 |s2cid=2723596 |last2=Nievergelt |first2=Jurg |last3=Pippenger |first3=Nicholas |last4=Strong |first4=Raymond}}
</ref>
 
===Hash functions===
InThe orderhash tofunction access<math>h_i(c)</math> areturns the 0-based index of the bucket that contains the record with key <math>c</math>,. When a familybucket ofwhich uses the hash functions,function called<math>h_i</math> collectivelyis asplit dynamicinto two new buckets, the hash function <math>h_i</math> is appliedreplaced to the keywith <math>ch_{i+1}</math> for both of those new buckets. At any time, at most two hash functions <math>h_ih_l</math> and <math>h_{il+1}</math> are used.; Asuch typicalthat example uses the division modulo x operation in which a record<math>l</math> corresponds to a bucket in which all the keyscurrent have the same rightmost binary digits'''level'''. IfThe the original numberfamily of bucketshash isfunctions <math>N h_i(c)</math>, thenis thealso familyreferred ofto hashas functionsthe isdynamic <refhash name=LMS/>function.
 
Typically, the value of <math>i</math> in <math>h_i</math> corresponds to the number of rightmost binary digits of the key <math>c</math> that are used to segregate the buckets. This dynamic hash function can be expressed arithmetically as <math display="inline"> h_i(c) \mapsto (c \bmod 2^i) </math>. Note that when the total number of buckets is equal to one, <math>i=0</math>.
<math> h_i(c) \mapsto c \pmod{N \cdot 2^i}</math>
 
Complete the calculations below to determine the correct hashing function for the given hashing key <math>c</math>.<ref name="LMS" /><syntaxhighlight lang="python">
===File expansion===
# l represents the current level
As the file grows through insertions, it expands gracefully through the splitting
# s represents the split pointer index
of one bucket into two buckets. The sequence of buckets to split is predetermined.
a = h_l(c)
This is the fundamental difference to schemes like Fagin's extendible hashing.
if (a < s): a = h_{l+1}(c)
<ref name = Fagin>
</syntaxhighlight>
{{Citation | first1=Ronald | last1 = Fagin | first2= Jurg | last2 = Nievergelt
| first3 = Nicholas | last3 = Pippenger | first4 = Raymond | last4 = Strong
| title = Extendible Hashing - A Fast Access Method for Dynamic Files
| journal = ACM Transactions on Database Systems
| volume = 4 | number = 2 | date = Sep 1979 | pages = 315–344| doi = 10.1145/320083.320092 | s2cid = 2723596 | url = http://dl.acm.org/citation.cfm?doid=320083.320092 }}
</ref>
For the two new buckets, the hash function <math>h_i</math> is replaced with
<math>h_{i+1}</math>. The number of the bucket to be split is part of the
file state and called the split pointer <math>s</math>.<ref name=LMS/>
 
===Split control===
linearLinear hashing algorithms may use only controlled splits, only uncontrolled splits, or both controlled and uncontrolled splits.
 
'''Controlled splitting''' occurs if a split is performed whenever the [[load factor (computer science)|load factor]], which is monitored by the file, exceeds a predetermined threshold.<ref name="LMS" /> If the hash index uses controlled splitting, the buckets are allowed to overflow by using linked overflow blocks. When the ''load factor'' surpasses a set threshold, the ''split pointer's'' designated bucket is split. Instead of using the load factor, this threshold can also be expressed as an occupancy percentage, in which case, the maximum number of records in the hash index equals (occupancy percentage)*(max records per non-overflowed bucket)*(number of buckets).<ref name=":0">{{Cite book |last1=Silberschatz |first1=Abraham |title=Database system concepts |last2=Korth |first2=Henry F. |last3=Sudarshan |first3=S. |date=2020 |publisher=McGraw-Hill Education |isbn=978-1-260-08450-4 |edition=Seventh |___location=New York, NY}}</ref>
 
An '''uncontrolled split''' occurs when a split is performed whenever a bucket overflows, in which case that bucket would be split into two separate buckets.
 
'''File contraction''' occurs in some LH algorithm implementations if a controlled split causes the load factor to sink below a threshold. In this case, a merge operation would be triggered which would undo the last split, and reset the file state.<ref name="LMS" />
'''Controlled splitting''' occurs if a split is performed whenever the load factor, which is monitored by the file, exceeds a predetermined threshold.<ref name="LMS" /> If the hash index uses controlled splitting, the buckets are allowed to overflow by using linked overflow blocks. When the records/buckets ratio surpasses a set threshold, the designated bucket is split. This threshold can also be expressed as an occupancy percentage, in which case, the maximum number of records in the hash index equals (occupancy percentage)*(max records per non-overflowed bucket)*(number of buckets).<ref name=":0">{{Cite book |last=Silberschatz |first=Abraham |title=Database system concepts |last2=Korth |first2=Henry F. |last3=Sudarshan |first3=S. |date=2020 |publisher=McGraw-Hill Education |isbn=978-1-260-08450-4 |edition=Seventh |___location=New York, NY}}</ref>
 
===AddressingSplit pointer===
AddressingThe index of the next bucket to be split is basedpart onof the file state, consistingand ofcalled the '''split pointer''' <math>s</math>. The split pointer corresponds to the first bucket that uses the hash function <math>h_l</math> instead of <math>h_{l+1}</math>.<ref name="LMS" />
and the level <math>l</math>. If the level is <math>l</math>, then the hash functions
used are <math>h_l</math> and <math>h_{l+1}</math>.
 
The LH algorithm for hashing key <math>c</math> is<ref name=LMS/>
 
<math>a := h_l(c)</math>
 
if <math> a<s: a := h_{l+1}(c) </math>
 
===Splitting===
When a bucket is split, split pointer and possibly the level are updated according to
 
<math>s := s+1</math>
 
<math>\text{if } (s \ge N\times 2^l): \quad l:=l+1, \ s:=0</math>.<ref name="LMS" />
 
For example, if numerical records are inserted into the hash index according to their farthest right binary digits, the bucket corresponding with the appended bucket will be split. Thus, if we have the buckets labelled as 000, 001, 10, 11, 100, 101, we would split the bucket 10 because we are appending and creating the next sequential bucket 110. This would give us the buckets 000, 001, 010, 11, 100, 101, 110.<ref name=":0" />
 
When a bucket is split, split pointer and possibly the level are updated according to the following, such that the level is 0 when the linear hashing index only has 1 bucket.<ref name="LMS" /><syntaxhighlight lang="python">
===File contraction===
# l represents the current level
If under controlled splitting the load factor sinks below a threshold, a merge operation
# s represents the split pointer index
is triggered. The merge operation undoes the last split, also resetting the file state.
s = s + 1
<ref name=LMS/>
if (s >= 2^l):
 
l = l + 1
===File state calculation===
s = 0
The file state consists of split pointer <math>s</math> and level <math>l</math>. If the
</syntaxhighlight>
original file started with <math>N=1</math> buckets, then the number of buckets
<math>n</math> and the file state are related via
<ref name = CS>
{{Citation | first1=Juan | last1 = Chabkinian | first2=Thomas | last2=Schwarz
| title=Fast LH* | journal=International Journal of Parallel Programming
| volume=44 | number=4 | date=2016 | pages=709–734
| doi = 10.1007/s10766-015-0371-8 | s2cid = 7448240 }}
</ref>
 
<math>n = 2^l+s </math>
 
===LH*===
Line 128 ⟶ 81:
if there is only one split or merge going on while the request is processed, it can
be shown that there are at most two forwards. After a forward, the final bucket sends an
Image Adjustment Message to the client whose state is now closer to the state of the distributed file.<ref name="LMS" /> While forwards are reasonably rare for active clients,
their number can be even further reduced by additional information exchange between
servers and clients <ref name="CS/">
{{Citation |last1=Chabkinian |first1=Juan |title=Fast LH* |date=2016 |journal=International Journal of Parallel Programming |volume=44 |number=4 |pages=709–734 |doi=10.1007/s10766-015-0371-8 |s2cid=7448240 |last2=Schwarz |first2=Thomas}}
</ref>
 
== Other properties ==
 
===File state calculation===
The file state consists of split pointer <math>s</math> and level <math>l</math>. If the original file started with <math>N=1</math> buckets, then the number of buckets <math>n</math> and the file state are related via
<ref name="CS" />
 
<math>n = 2^l+s </math>.
 
==Adoption in language systems==
Line 142 ⟶ 105:
 
==External links==
* [httphttps://tommyds.sourceforge.net/ TommyDS, C implementation of a Linear Hashtable]
* [http://hackthology.com/an-in-memory-go-implementation-of-linear-hashing.html An in Memory Go Implementation with Explanation]
* [https://github.com/KevinStern/index-cpp/blob/master/src/linear_hashing_table.h A C++ Implementation of Linear Hashtable which Supports Both Filesystem and In-Memory storage]