Locality-sensitive hashing: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Alter: title, pages. Add: date, chapter. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox3 | #UCB_webform_linked 1183/2306
Bender the Bot (talk | contribs)
m External links: HTTP to HTTPS for SourceForge
 
(32 intermediate revisions by 22 users not shown)
Line 1:
{{Short description|Algorithmic technique using hashing}}
In [[computer science]], '''locality-sensitive hashing''' ('''LSH''') is ana algorithmic[[fuzzy hashing]] technique that hashes similar input items into the same "buckets" with high probability.<ref name="MOMD">{{cite web|url=http://infolab.stanford.edu/~ullman/mmds.html|title=Mining of Massive Datasets, Ch. 3.|last1=Rajaraman|first1=A.|last2=Ullman|first2=J.|author2-link=Jeffrey Ullman|year=2010}}</ref> (The number of buckets is much smaller than the universe of possible input items.)<ref name="MOMD" /> Since similar items end up in the same buckets, this technique can be used for [[Cluster analysis|data clustering]] and [[nearest neighbor search]]. It differs from [[Hash function|conventional hashing techniques]] in that [[hash collision]]s are maximized, not minimized. Alternatively, the technique can be seen as a way to [[dimension reduction|reduce the dimensionality]] of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.
 
Hashing-based approximate [[nearest-neighbor search]] algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as [[locality-preserving hashing]] (LPH).<ref>{{cite conference |last1=Zhao |first1=Kang |last2=Lu |first2=Hongtao |last3=Mei |first3=Jincheng |title=Locality Preserving Hashing |conference=AAAI Conference on Artificial Intelligence | volume=28 | year=2014 |url=https://ojs.aaai.org/index.php/AAAI/article/view/9133/8992 |pages=2874–2880}}</ref><ref>{{cite book |last1=Tsai |first1=Yi-Hsuan |last2=Yang |first2=Ming-Hsuan |title=2014 IEEE International Conference on Image Processing (ICIP) |chapter=Locality preserving hashing |date=October 2014 |pages=2988–2992 |doi=10.1109/ICIP.2014.7025604 |isbn=978-1-4799-5751-4 |s2cid=8024458 |issn=1522-4880}}</ref>
 
The first family of localityLocality-preserving hash functionshashing was initially devised as a way to facilitate [[Pipeline (computing)|data pipelining]] in implementations of [[Parallel RAM|parallelmassively random-access machine (PRAM)parallel]] algorithms that use [[Routing#Path_selection|randomized routing]] and [[universal hashing]] to reduce memory [[Resource_contentionResource contention|contention]] and [[network congestion]].<ref name=Chin1991>{{cite thesis |last=Chin |first=Andrew |date=1991 |title=Complexity Issues in General Purpose Parallel Computing |pages=87–95 |type=DPhil |publisher=University of Oxford |url=https://perma.cc/E47H-WCVP}}</ref><ref name=Chin1994>{{cite journal |last1=Chin |first1=Andrew |date=1994 |title=Locality-Preserving Hash Functions for General Purpose Parallel Computation |url=http://unclaw.com/chin/scholarship/hashfunctions.pdf |journal=Algorithmica |volume=12 |issue=2–3 |pages=170–181 |doi=10.1007/BF01185209|s2cid=18108051 }}</ref>
 
==Definitions==
AnA finite family <math> \mathcal F</math> of functions <math>h\colon M \to S</math> is defined to be an ''LSH family''<ref name=MOMD /><ref name=GIM1999>{{cite journal
| author1 = Gionis, A.
| author2-link = Piotr Indyk | author2 = Indyk, P. | author3-link = Rajeev Motwani | author3 = Motwani, R.
Line 21 ⟶ 23:
| title-link = Symposium on Theory of Computing
}}</ref>
for
<math>\mathcal F</math> is defined for
* a [[metric space]] <math>\mathcal M =(M, d)</math>,
* a threshold <math>Rr>0</math>,
* an approximation factor <math>c>1</math>,
* and probabilities <math>P_1</mathp_1 > and <math>P_2p_2</math>.
 
Thisif familyit <math>\mathcal F</math> is a set of functions <math>h\colon M \to S</math> that map elements ofsatisfies the metricfollowing space to buckets <math>s \in S</math>condition. An LSH family must satisfy the following conditions forFor any two points <math>pa, qb \in M</math> and anya hash function <math>h</math> chosen uniformly at random from <math>\mathcal F</math>:
* ifIf <math>d(pa,qb) \le Rr</math>, then <math>h(pa)=h(qb)</math> (i.e., {{mvar|pa}} and {{mvar|qb}} collide) with probability at least <math>P_1p_1</math>,
* ifIf <math>d(pa,qb) \ge cRcr</math>, then <math>h(pa)=h(qb)</math> with probability at most <math>P_2p_2</math>.
 
A family is interesting when <math>P_1>P_2</math>. Such a family <math>\mathcal F</math> is called ''<math>(Rr,cRcr,P_1p_1,P_2p_2)</math>-sensitive''.
 
=== LSH with respect to a similarity measure ===
Alternatively<ref name=Charikar2002>{{cite conference
| last = Charikar|first= Moses S.|author-link=Moses Charikar
Line 40 ⟶ 43:
| pages = 380–388
| doi = 10.1145/509907.509965
|isbn= 1-58113-495-9| citeseerx = 10.1.1.147.4064}}</ref> it is definedpossible withto respectdefine toan LSH family on a universe of items {{mvar|U}} thatendowed havewith a [[String metric|similarity]] function <math>\phi\colon U \times U \to [0,1]</math>. AnIn this setting, a LSH scheme is a family of [[hash function]]s {{mvar|H}} coupled with a [[probability distribution]] {{mvar|D}} over the functions{{mvar|H}} such that a function <math>h \in H</math> chosen according to {{mvar|D}} satisfies the property that <math>Pr_{h \in H}Pr [h(a) = h(b)] = \phi(a,b)</math> for anyeach <math>a,b \in U</math>.
 
===Locality-preserving hashing===
A '''locality-preserving hash''' is a [[hash function]] {{mvar|f}} that maps points in a [[metric space]] <math>\mathcal M =(M, d)</math> to a scalar value such that
:<math>d(p,q) < d(q,r) \Rightarrow |f(p) - f(q)| < |f(q) - f(r)|</math>
 
for any three points <math>p,q,r \in M</math>.{{citation needed|date=November 2022|reason=unusual definition}}
 
In other words, these are hash functions where the relative distance between the input values is preserved in the relative distance between the output hash values; input values that are closer to each other will produce output hash values that are closer to each other.
 
This is in contrast to [[cryptography|cryptographic]] hash functions and [[checksum]]s, which are designed to have [[Avalanche effect|random output difference between adjacent inputs]].
 
The first family of locality-preserving hash functions was devised as a way to facilitate [[Pipeline (computing)|data pipelining]] in implementations of [[Parallel RAM|parallel random-access machine (PRAM)]] algorithms that use [[universal hashing]] to reduce memory [[Resource_contention|contention]] and [[network congestion]].<ref name=Chin1991>{{cite thesis |last=Chin |first=Andrew |date=1991 |title=Complexity Issues in General Purpose Parallel Computing |pages=87–95 |type=DPhil |publisher=University of Oxford |url=https://perma.cc/E47H-WCVP}}</ref><ref name=Chin1994>{{cite journal |last1=Chin |first1=Andrew |date=1994 |title=Locality-Preserving Hash Functions for General Purpose Parallel Computation |url=http://unclaw.com/chin/scholarship/hashfunctions.pdf |journal=Algorithmica |volume=12 |issue=2–3 |pages=170–181 |doi=10.1007/BF01185209|s2cid=18108051 }}</ref>
 
Locality preserving hashes are related to [[space-filling curve]]s.{{how|date=March 2020}}
 
===Amplification===
Line 72 ⟶ 61:
| author = Das, Abhinandan S.
| title = Proceedings of the 16th international conference on World Wide Web
| chapter = Google news personalization: scalable online collaborative filtering
| date = 2007
| pages = 271–280
| year = 2007|doi=10.1145/1242572.1242610|display-authors=etal| isbn = 9781595936547
Line 88 ⟶ 76:
|title=Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
|chapter=Twister Tries
|date=2015
| pages=505–517
| year = 2015 | doi=10.1145/2723372.2751521| isbn = 9781450327589
Line 104 ⟶ 91:
**[[VisualRank]]
*[[Gene expression]] similarity identification{{citation needed|date=October 2013}}
*[[Audio similarity]] identification
*[[Nearest neighbor search]]
*[[Audio fingerprint]]<ref>
Line 111 ⟶ 98:
| url = https://github.com/worldveil/dejavu| date = 2018-12-19}}
</ref>
*[[Digital video fingerprinting]]<ref>
{{citation
| title = A Simple Introduction to Locality Sensitive Hashing (LSH)
| url =https://www.iunera.com/kraken/fabric/locality-sensitive-hashing-lsh/#7-applications-of-lsh| date = 2025-03-27}}
</ref>
*[[Shared memory]] organization in [[parallel computing]]<ref name=Chin1991 /><ref name=Chin1994 />
*Physical data organization in database management systems<ref>
{{citation
Line 124 ⟶ 116:
{{cite conference
| author1 = Oliver, Jonathan| author2 = Cheng, Chun | author3 = Chen, Yanggui
| title = TLSH - a locality sensitive hash
| journaltitle = 4th2013 Fourth Cybercrime and Trustworthy Computing Workshop | chapter = TLSH -- A Locality Sensitive Hash | year = 2013
| year = 2013
| pages = 7–13 | url = https://www.academia.edu/7833902
| doi = 10.1109/CTC.2013.9
| isbn = 978-1-4799-3076-0
}}
</ref>
*[[Machine Learning]]<ref name="NL">
{{citation
| author1 = Fanaee-T, Hadi
| title = Natural Learning
| dateyear = 20072024
| arxiv = 2404.05903
}}
</ref>
 
Line 216 ⟶ 215:
===Random projection===
{{main|Random projection}}
[[File:Cosine-distance.png| thumb | <math>\frac{\theta(u,v)}{\pi}</math> is approximately proportional to <math>1-\cos(\theta(u,v))</math> on the interval [0, <math>\pi</math>]]]
 
The random projection method of LSH due to [[Moses Charikar]]<ref name=Charikar2002 /> called [[SimHash]] (also sometimes called arccos<ref name=Andoni2008>{{cite journal
Line 235 ⟶ 234:
The basic idea of this technique is to choose a random [[hyperplane]] (defined by a normal unit vector {{mvar|r}}) at the outset and use the hyperplane to hash input vectors.
 
Given an input vector {{mvar|v}} and a hyperplane defined by {{mvar|r}}, we let <math>h(v) = \sgn(v \cdot r)</math>. That is, <math>h(v) = \pm 1</math> depending on which side of the hyperplane {{mvar|v}} lies. This way, each possible choice of a random hyperplane {{mvar|r}} can be interpreted as a hash function <math>h(v)</math>.
 
For two vectors <math>{{mvar|u,v</math>}} with angle <math>\theta(u,v)</math> between them, it can be shown that
 
:<math>Pr[h(u) = h(v)] = 1 - \frac{\theta(u,v)}{\pi}.</math>
 
Since the ratio between <math>\frac{\theta(u,v)}{\pi}</math> is proportional toand <math>1-\cos(\theta(u,v))</math>. Inis factat theleast ratio0.439 iswhen always<math>\theta(u, withinv) a factor\in [0, of .87856.\pi]</math>,<ref name=Charikar2002 /><ref name="Goemans Williamson 1995 pp. 1115–1145">{{cite journal | last1=Goemans | first1=Michel X. | last2=Williamson | first2=David P. | title=Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming | journal=Journal of the ACM | publisher=Association for Computing Machinery (ACM) | volume=42 | issue=6 | year=1995 | issn=0004-5411 | doi=10.1145/227683.227684 | pages=1115–1145| s2cid=15794408 | doi-access=free }}</ref> Meaning the probability of the two vectors being on the samedifferent sidesides of the random hyperplane is approximately proportional to the [[Cosine_similarityCosine similarity#Cosine_DistanceCosine Distance|cosine distance]] between them.
 
===Stable distributions===
Line 274 ⟶ 273:
| url = http://hal.inria.fr/inria-00567191/en/
| journal = Pattern Recognition Letters
| volume = 31 | issue = 11 | pages = 1348–1358 | doi = 10.1016/j.patrec.2010.04.004 | bibcode = 2010PaReL..31.1348P | s2cid = 2666044 }}</ref>
In particular k-means hash functions are better in practice than projection-based hash functions, but without any theoretical guarantee.
 
===Semantic hashing===
Semantic hashing is a technique that attempts to map input items to addresses such that closer inputs have higher [[semantic similarity]].<ref>{{Cite journal|last1=Salakhutdinov|first1=Ruslan|last2=Hinton|first2=Geoffrey|date=2008|title=Semantic hashing|journal=International Journal of Approximate Reasoning|language=en|volume=50|issue=7|pages=969–978|doi=10.1016/j.ijar.2008.11.006|doi-access=free}}</ref> The hashcodes are found via training of an [[artificial neural network]] or [[graphical model]].{{cncitation needed|date=September 2021}}
 
==Algorithm for nearest neighbor search==
Line 288 ⟶ 287:
In the preprocessing step we hash all {{mvar|n}} {{mvar|d}}-dimensional points from the data set {{mvar|S}} into each of the {{mvar|L}} hash tables. Given that the resulting hash tables have only {{mvar|n}} non-zero entries, one can reduce the amount of memory used per each hash table to <math>O(n)</math> using standard [[hash functions]].
 
Given a query point {{mvar|q}}, the algorithm iterates over the {{mvar|L}} hash functions {{mvar|g}}. For each {{mvar|g}} considered, it retrieves the data points that are hashed into the same bucket as {{mvar|q}}. The process is stopped as soon as a point within distance <math>{{mvar|cR</math>}} from {{mvar|q}} is found.
 
Given the parameters {{mvar|k}} and {{mvar|L}}, the algorithm has the following performance guarantees:
Line 294 ⟶ 293:
* space: <math>O(nL)</math>, plus the space for storing data points;
* query time: <math>O(L(kt+dnP_2^k))</math>;
* the algorithm succeeds in finding a point within distance <math>{{mvar|cR</math>}} from {{mvar|q}} (if there exists a point within distance {{mvar|R}}) with probability at least <math>1 - ( 1 - P_1^k ) ^ L</math>;
 
For a fixed approximation ratio <math>c=1+\epsilon</math> and probabilities <math>P_1</math> and <math>P_2</math>, one can set <math>k = \bigleft\lceil{\tfrac{\log n}{\log 1/P_2}}\bigright\rceil</math> and <math>L = \lceil P_1^{-k}\rceil = O(n^{\rho}P_1^{-1})</math>, where <math>\rho={\tfrac{\log P_1}{\log P_2}}</math>. Then one obtains the following performance guarantees:
* preprocessing time: <math>O(n^{1+\rho}P_1^{-1}kt)</math>;
* space: <math>O(n^{1+\rho}P_1^{-1})</math>, plus the space for storing data points;
* query time: <math>O(n^{\rho}P_1^{-1}(kt+d))</math>;
 
===Finding nearest neighbor without fixed dimensionality===
 
To generalize the above algorithm without radius {{mvar|R}} being fixed, we can take the algorithm and do a sort of binary search over {{mvar|R}}. It has been shown<ref>{{cite journal |last1=Har-Peled |first1=Sariel |last2=Indyk |first2=Piotr |last3=Motwani |first3=Rajeev |title=Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality |journal=Theory of Computing |date=2012 |volume=8 |issue=Special Issue in Honor of Rajeev Motwani |pages=321-350 |doi=10.4086/toc.2012.v008a014 |url=https://theoryofcomputing.org/articles/v008a014/v008a014.pdf |access-date=23 May 2025}}</ref> that there is a data structure for the approximate nearest neighbor with the following performance guarantees:
* space: <math>O(n^{1+\rho}P_1^{-1}d\log^2 n)</math>;
* query time: <math>O(n^{\rho}P_1^{-1}(kt+d)\log n)</math>;
* the algorithm succeeds in finding the nearest neighbor with probability at least <math>1 - (( 1 - P_1^k ) ^ L\log n)</math>;
 
===Improvements===
Line 310 ⟶ 316:
It is also sometimes the case that the factor <math>1/P_1</math> can be very large.
This happens for example with [[Jaccard similarity]] data, where even the most similar neighbor often has a quite low Jaccard similarity with the query.
In<ref>Ahle, Thomas Dybdahl. "On the Problem of $$ <math>p_1^{-1} $$</math> in Locality-Sensitive Hashing." International Conference on Similarity Search and Applications. Springer, Cham, 2020.</ref> it was shown how to reduce the query time to <math>O(n^\rho/P_1^{1-\rho})</math> (not including hashing costs) and similarly the space usage.
 
==See also==
 
*[[Bloom filter]]
* {{Annotated link |Bloom filter}}
*[[ {{Annotated link |Curse of dimensionality]]}}
*[[ {{Annotated link |Feature hashing]]}}
*[[ {{Annotated link |Fourier-related transforms]]}}
*[[Geohash]]
* {{Annotated link |Geohash}}
*[[ {{Annotated link |Multilinear subspace learning]]}}
*[[ {{Annotated link |Principal component analysis]]}}
*[[Random indexing]]<ref>Gorman, James, and James R. Curran. [https://aclanthology.org/P06-1046.pdf "Scaling distributional similarity to large corpora."] Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006.</ref>
*[[ {{Annotated link |Rolling hash]]}}
*[[ {{Annotated link |Singular value decomposition]]}}
*[[ {{Annotated link |Sparse distributed memory]]}}
*[[ {{Annotated link |Wavelet compression]]}}
* {{Annotated link |Locality of reference}}
 
==References==
Line 333 ⟶ 341:
 
*Samet, H. (2006) ''Foundations of Multidimensional and Metric Data Structures''. Morgan Kaufmann. {{ISBN|0-12-369446-9}}
 
*{{Cite conference| last1 = Indyk | first1 = Piotr | author1-link= Piotr Indyk| last2 = Motwani | first2 = Rajeev | author2-link = Rajeev Motwani| last3 = Raghavan | first3 = Prabhakar| last4 = Vempala | first4 = Santosh | author4-link = Santosh Vempala| title = Locality-preserving hashing in multidimensional spaces | doi = 10.1145/258533.258656 | book-title = Proceedings of the twenty-ninth annual ACM symposium on Theory of computing| conference = [[Symposium on Theory of Computing|STOC '97]]| pages = 618–625| year = 1997 | isbn = 978-0-89791-888-6| citeseerx = 10.1.1.50.4927| s2cid = 15693787 }}
*{{Cite journal | last1 = Chin | first1 = Andrew| title = Locality-preserving hash functions for general purpose parallel computation | url = http://www.unclaw.com/chin/scholarship/hashfunctions.pdf| doi = 10.1007/BF01185209 | journal = [[Algorithmica]]| volume = 12 | issue = 2–3 | pages = 170–181 | year = 1994 | s2cid = 18108051}}
Line 339 ⟶ 346:
==External links==
* [http://web.mit.edu/andoni/www/LSH/index.html Alex Andoni's LSH homepage]
* [httphttps://lshkit.sourceforge.net/ LSHKIT: A C++ Locality Sensitive Hashing Library]
* [https://github.com/simonemainardi/LSHash A Python Locality Sensitive Hashing library that optionally supports persistence via redis]
* [https://web.archive.org/web/20101203074412/http://www.vision.caltech.edu/malaa/software/research/image-search/ Caltech Large Scale Image Search Toolbox]: a Matlab toolbox implementing several LSH hash functions, in addition to Kd-Trees, Hierarchical K-Means, and Inverted File search algorithms.