Locality-sensitive hashing: Difference between revisions

Content deleted Content added
WikiCleanerBot (talk | contribs)
m v2.05b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation)
Bender the Bot (talk | contribs)
m External links: HTTP to HTTPS for SourceForge
 
(12 intermediate revisions by 11 users not shown)
Line 1:
{{Short description|Algorithmic technique using hashing}}
In [[computer science]], '''locality-sensitive hashing''' ('''LSH''') is a [[fuzzy hashing]] technique that hashes similar input items into the same "buckets" with high probability.<ref name="MOMD">{{cite web|url=http://infolab.stanford.edu/~ullman/mmds.html|title=Mining of Massive Datasets, Ch. 3.|last1=Rajaraman|first1=A.|last2=Ullman|first2=J.|author2-link=Jeffrey Ullman|year=2010}}</ref> (The number of buckets is much smaller than the universe of possible input items.)<ref name="MOMD" /> Since similar items end up in the same buckets, this technique can be used for [[Cluster analysis|data clustering]] and [[nearest neighbor search]]. It differs from [[Hash function|conventional hashing techniques]] in that [[hash collision]]s are maximized, not minimized. Alternatively, the technique can be seen as a way to [[dimension reduction|reduce the dimensionality]] of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.
 
Hashing-based approximate [[nearest-neighbor search]] algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as locality-preserving hashing (LPH).<ref>{{cite conference |last1=Zhao |first1=Kang |last2=Lu |first2=Hongtao |last3=Mei |first3=Jincheng |title=Locality Preserving Hashing |conference=AAAI Conference on Artificial Intelligence | volume=28 | year=2014 |url=https://ojs.aaai.org/index.php/AAAI/article/view/9133/8992 |pages=2874–2880}}</ref><ref>{{cite book |last1=Tsai |first1=Yi-Hsuan |last2=Yang |first2=Ming-Hsuan |title=2014 IEEE International Conference on Image Processing (ICIP) |chapter=Locality preserving hashing |date=October 2014 |pages=2988–2992 |doi=10.1109/ICIP.2014.7025604 |isbn=978-1-4799-5751-4 |s2cid=8024458 |issn=1522-4880}}</ref>
Line 7:
 
==Definitions==
A finite family <math> \mathcal F</math> of functions <math>h\colon M \to S</math> is defined to be an ''LSH family''<ref name=MOMD /><ref name=GIM1999>{{cite journal
| author1 = Gionis, A.
| author2-link = Piotr Indyk | author2 = Indyk, P. | author3-link = Rajeev Motwani | author3 = Motwani, R.
Line 30:
 
if it satisfies the following condition. For any two points <math>a, b \in M</math> and a hash function <math>h</math> chosen uniformly at random from <math>\mathcal F</math>:
* If <math>d(a,b) \le r</math>, then <math>h(a)=h(b)</math> (i.e., {{mvar|pa}} and {{mvar|qb}} collide) with probability at least <math>p_1</math>,
* If <math>d(a,b) \ge cr</math>, then <math>h(a)=h(b)</math> with probability at most <math>p_2</math>.
 
Line 98:
| url = https://github.com/worldveil/dejavu| date = 2018-12-19}}
</ref>
*[[Digital video fingerprinting]]<ref>
{{citation
| title = A Simple Introduction to Locality Sensitive Hashing (LSH)
| url =https://www.iunera.com/kraken/fabric/locality-sensitive-hashing-lsh/#7-applications-of-lsh| date = 2025-03-27}}
</ref>
*[[Shared memory]] organization in [[parallel computing]]<ref name=Chin1991 /><ref name=Chin1994 />
*Physical data organization in database management systems<ref>
Line 112 ⟶ 116:
{{cite conference
| author1 = Oliver, Jonathan| author2 = Cheng, Chun | author3 = Chen, Yanggui
| title = TLSH - a locality sensitive hash
| journaltitle = 4th2013 Fourth Cybercrime and Trustworthy Computing Workshop | chapter = TLSH -- A Locality Sensitive Hash | year = 2013
| year = 2013
| pages = 7–13 | url = https://www.academia.edu/7833902
| doi = 10.1109/CTC.2013.9
| isbn = 978-1-4799-3076-0
}}
</ref>
*[[Machine Learning]]<ref name="NL">
{{citation
{{cite conference
| author1 = Fanaee-T, Hadi
| title = Natural Learning
| journal = arXiv
| year = 2024
| pagesarxiv = 1-10 | url = https://arxiv.org/abs/2404.05903
}}
</ref>
 
Line 213 ⟶ 215:
===Random projection===
{{main|Random projection}}
[[File:Cosine-distance.png| thumb | <math>\frac{\theta(u,v)}{\pi}</math> is approximately proportional to <math>1-\cos(\theta(u,v))</math> on the interval [0, <math>\pi</math>]]]
 
The random projection method of LSH due to [[Moses Charikar]]<ref name=Charikar2002 /> called [[SimHash]] (also sometimes called arccos<ref name=Andoni2008>{{cite journal
Line 238 ⟶ 240:
:<math>Pr[h(u) = h(v)] = 1 - \frac{\theta(u,v)}{\pi}.</math>
 
Since the ratio between <math>\frac{\theta(u,v)}{\pi}</math> and <math>1-\cos(\theta(u,v))</math> is at least 0.87856439 when <math>\theta(u, v) \in [0, \pi]</math>,<ref name=Charikar2002 /><ref name="Goemans Williamson 1995 pp. 1115–1145">{{cite journal | last1=Goemans | first1=Michel X. | last2=Williamson | first2=David P. | title=Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming | journal=Journal of the ACM | publisher=Association for Computing Machinery (ACM) | volume=42 | issue=6 | year=1995 | issn=0004-5411 | doi=10.1145/227683.227684 | pages=1115–1145| s2cid=15794408 | doi-access=free }}</ref> the probability of two vectors being on the samedifferent sidesides of the random hyperplane is approximately proportional to the [[Cosine similarity#Cosine Distance|cosine distance]] between them.
 
===Stable distributions===
Line 297 ⟶ 299:
* space: <math>O(n^{1+\rho}P_1^{-1})</math>, plus the space for storing data points;
* query time: <math>O(n^{\rho}P_1^{-1}(kt+d))</math>;
 
===Finding nearest neighbor without fixed dimensionality===
 
To generalize the above algorithm without radius {{mvar|R}} being fixed, we can take the algorithm and do a sort of binary search over {{mvar|R}}. It has been shown<ref>{{cite journal |last1=Har-Peled |first1=Sariel |last2=Indyk |first2=Piotr |last3=Motwani |first3=Rajeev |title=Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality |journal=Theory of Computing |date=2012 |volume=8 |issue=Special Issue in Honor of Rajeev Motwani |pages=321-350 |doi=10.4086/toc.2012.v008a014 |url=https://theoryofcomputing.org/articles/v008a014/v008a014.pdf |access-date=23 May 2025}}</ref> that there is a data structure for the approximate nearest neighbor with the following performance guarantees:
* space: <math>O(n^{1+\rho}P_1^{-1}d\log^2 n)</math>;
* query time: <math>O(n^{\rho}P_1^{-1}(kt+d)\log n)</math>;
* the algorithm succeeds in finding the nearest neighbor with probability at least <math>1 - (( 1 - P_1^k ) ^ L\log n)</math>;
 
===Improvements===
Line 323 ⟶ 332:
* {{Annotated link |Sparse distributed memory}}
* {{Annotated link |Wavelet compression}}
* {{Annotated link |Locality of reference}}
 
==References==
Line 336 ⟶ 346:
==External links==
* [http://web.mit.edu/andoni/www/LSH/index.html Alex Andoni's LSH homepage]
* [httphttps://lshkit.sourceforge.net/ LSHKIT: A C++ Locality Sensitive Hashing Library]
* [https://github.com/simonemainardi/LSHash A Python Locality Sensitive Hashing library that optionally supports persistence via redis]
* [https://web.archive.org/web/20101203074412/http://www.vision.caltech.edu/malaa/software/research/image-search/ Caltech Large Scale Image Search Toolbox]: a Matlab toolbox implementing several LSH hash functions, in addition to Kd-Trees, Hierarchical K-Means, and Inverted File search algorithms.