Locality-sensitive hashing: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 16:47, 12 May 2023 edit Skyerise (talk \| contribs) Extended confirmed users, Pending changes reviewers, Rollbackers 149,364 edits →Random projection: WP:ALLCAPS ← Previous edit		Latest revision as of 21:56, 9 August 2025 edit undo Bender the Bot (talk \| contribs) Bots 1,064,377 edits m →External links: HTTP to HTTPS for SourceForge Tag: AWB
(39 intermediate revisions by 24 users not shown)
Line 1: {{Short description\|Algorithmic technique using hashing}} In [[computer science]], '''locality-sensitive hashing''' ('''LSH''') is ana ~~algorithmic~~[[fuzzy hashing]] technique that hashes similar input items into the same "buckets" with high probability.<ref name="MOMD">{{cite web\|url=http://infolab.stanford.edu/~ullman/mmds.html\|title=Mining of Massive Datasets, Ch. 3.\|last1=Rajaraman\|first1=A.\|last2=Ullman\|first2=J.\|author2-link=Jeffrey Ullman\|year=2010}}</ref> (The number of buckets is much smaller than the universe of possible input items.)<ref name="MOMD" /> Since similar items end up in the same buckets, this technique can be used for [[Cluster analysis\|data clustering]] and [[nearest neighbor search]]. It differs from [[Hash function\|conventional hashing techniques]] in that [[hash collision]]s are maximized, not minimized. Alternatively, the technique can be seen as a way to [[dimension reduction\|reduce the dimensionality]] of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items. Hashing-based approximate [[nearest -neighbor search]] algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as [[locality-preserving hashing]] (LPH).<ref>{{cite conference \|last1=Zhao \|first1=Kang \|last2=Lu \|first2=Hongtao \|last3=Mei \|first3=Jincheng \|title=Locality Preserving Hashing \|conference=AAAI Conference on Artificial Intelligence \| volume=28 \| year=2014 \|url=https://ojs.aaai.org/index.php/AAAI/article/view/9133/8992 \|pages=2874–2880}}</ref><ref>{{cite book \|last1=Tsai \|first1=Yi-Hsuan \|last2=Yang \|first2=Ming-Hsuan \|title=2014 IEEE International Conference on Image Processing (ICIP) \|chapter=Locality preserving hashing \|date=October 2014 \|pages=2988–2992 \|doi=10.1109/ICIP.2014.7025604 \|isbn=978-1-4799-5751-4 \|s2cid=8024458 \|issn=1522-4880}}</ref> ~~The first family of locality~~Locality-preserving ~~hash functions~~hashing was initially devised as a way to facilitate [[Pipeline (computing)\|data pipelining]] in implementations of [[Parallel RAM\|~~parallel~~massively ~~random-access machine (PRAM)~~parallel]] algorithms that use [[Routing#Path_selection\|randomized routing]] and [[universal hashing]] to reduce memory [[~~Resource_contention~~Resource contention\|contention]] and [[network congestion]].<ref name=Chin1991>{{cite thesis \|last=Chin \|first=Andrew \|date=1991 \|title=Complexity Issues in General Purpose Parallel Computing \|pages=87–95 \|type=DPhil \|publisher=University of Oxford \|url=https://perma.cc/E47H-WCVP}}</ref><ref name=Chin1994>{{cite journal \|last1=Chin \|first1=Andrew \|date=1994 \|title=Locality-Preserving Hash Functions for General Purpose Parallel Computation \|url=http://unclaw.com/chin/scholarship/hashfunctions.pdf \|journal=Algorithmica \|volume=12 \|issue=2–3 \|pages=170–181 \|doi=10.1007/BF01185209\|s2cid=18108051 }}</ref>▼ ==Definitions== AnA finite family <math> \mathcal F</math> of functions <math>h\colon M \to S</math> is defined to be an ''LSH family''<ref name=MOMD /><ref name=GIM1999>{{cite journal \| author1 = Gionis, A. \| author2-link = Piotr Indyk \| author2 = Indyk, P. \| author3-link = Rajeev Motwani \| author3 = Motwani, R. Line 21 ⟶ 23: \| title-link = Symposium on Theory of Computing }}</ref> for ~~<math>\mathcal F</math> is defined for~~ * a [[metric space]] <math>\mathcal M =(M, d)</math>, * a threshold <math>Rr>0</math>, * an approximation factor <math>c>1</math>, * and probabilities <math>~~P_1</math~~p_1 > ~~and <math>P_2~~p_2</math>. ~~This~~if ~~family~~it ~~<math>\mathcal F</math> is a set of functions <math>h\colon M \to S</math> that map elements of~~satisfies the ~~metric~~following ~~space to buckets <math>s \in S</math>~~condition. ~~An LSH family must satisfy the following conditions for~~For any two points <math>pa, qb \in M</math> and ~~any~~a hash function <math>h</math> chosen uniformly at random from <math>\mathcal F</math>: * ifIf <math>d(pa,qb) \le Rr</math>, then <math>h(pa)=h(qb)</math> (i.e., {{mvar\|pa}} and {{mvar\|qb}} collide) with probability at least <math>~~P_1~~p_1</math>, * ifIf <math>d(pa,qb) \ge cRcr</math>, then <math>h(pa)=h(qb)</math> with probability at most <math>~~P_2~~p_2</math>. ~~A family is interesting when <math>P_1>P_2</math>.~~ Such a family <math>\mathcal F</math> is called ''<math>(Rr,cRcr,~~P_1~~p_1,~~P_2~~p_2)</math>-sensitive''. === LSH with respect to a similarity measure === Alternatively<ref name=Charikar2002>{{cite conference \| last = Charikar\|first= Moses S.\|author-link=Moses Charikar Line 40 ⟶ 43: \| pages = 380–388 \| doi = 10.1145/509907.509965 \|isbn= 1-58113-495-9\| citeseerx = 10.1.1.147.4064}}</ref> it is ~~defined~~possible ~~with~~to ~~respect~~define toan LSH family on a universe of items {{mvar\|U}} ~~that~~endowed ~~have~~with a ~~[[String metric\|~~similarity]] function <math>\phi\colon U \times U \to [0,1]</math>. AnIn this setting, a LSH scheme is a family of [[hash function]]s {{mvar\|H}} coupled with a [[probability distribution]] {{mvar\|D}} over ~~the functions~~{{mvar\|H}} such that a function <math>h \in H</math> chosen according to {{mvar\|D}} satisfies ~~the property that~~ <math>~~Pr_{h \in H}~~Pr [h(a) = h(b)] = \phi(a,b)</math> for ~~any~~each <math>a,b \in U</math>. ~~===Locality-preserving hashing===~~ ~~A '''locality-preserving hash''' is a [[hash function]] {{mvar\|f}} that maps points in a [[metric space]] <math>\mathcal M =(M, d)</math> to a scalar value such that~~ ~~:<math>d(p,q) < d(q,r) \Rightarrow \|f(p) - f(q)\| < \|f(q) - f(r)\|</math>~~ ~~for any three points <math>p,q,r \in M</math>.{{citation needed\|date=November 2022\|reason=unusual definition}}~~ In other words, these are hash functions where the relative distance between the input values is preserved in the relative distance between the output hash values; input values that are closer to each other will produce output hash values that are closer to each other. ~~This is in contrast to [[cryptography\|cryptographic]] hash functions and [[checksum]]s, which are designed to have [[Avalanche effect\|random output difference between adjacent inputs]].~~ ▲The first family of locality-preserving hash functions was devised as a way to facilitate [[Pipeline (computing)\|data pipelining]] in implementations of [[Parallel RAM\|parallel random-access machine (PRAM)]] algorithms that use [[universal hashing]] to reduce memory [[Resource_contention\|contention]] and [[network congestion]].<ref name=Chin1991>{{cite thesis \|last=Chin \|first=Andrew \|date=1991 \|title=Complexity Issues in General Purpose Parallel Computing \|pages=87–95 \|type=DPhil \|publisher=University of Oxford \|url=https://perma.cc/E47H-WCVP}}</ref><ref name=Chin1994>{{cite journal \|last1=Chin \|first1=Andrew \|date=1994 \|title=Locality-Preserving Hash Functions for General Purpose Parallel Computation \|url=http://unclaw.com/chin/scholarship/hashfunctions.pdf \|journal=Algorithmica \|volume=12 \|issue=2–3 \|pages=170–181 \|doi=10.1007/BF01185209\|s2cid=18108051 }}</ref> ~~Locality preserving hashes are related to [[space-filling curve]]s.{{how\|date=March 2020}}~~ ===Amplification=== Line 71 ⟶ 60: {{citation \| author = Das, Abhinandan S. \| ~~journal~~title = Proceedings of the 16th ~~International~~international ~~Conference~~conference on World Wide Web ~~\| pages = 271~~▼ \| ~~title~~chapter = Google news personalization: scalable online collaborative filtering ▲ \| journal = Proceedings of the 16th International Conference on World Wide Web \| pages = 271 \| pages = 271–280 \| year = 2007\|doi=10.1145/1242572.1242610\|display-authors=etal\| isbn = 9781595936547 \| s2cid = 207163129 Line 84 ⟶ 74: {{citation \| author = Cochez, Michael \|author2=Mou, Hao \| ~~journal~~ title= ~~Proceeding SIGMOD '15~~ Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data ~~\| pages=505–517~~▼ ~~\| title = Twister Tries: Approximate Hierarchical Agglomerative Clustering for Average Distance in Linear Time~~ \|chapter=Twister Tries ▲ \| journal = Proceeding SIGMOD '15 Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data \| pages=505–517 \| pages=505–517 \| year = 2015 \| doi=10.1145/2723372.2751521\| isbn = 9781450327589 \|s2cid=14414777 Line 100 ⟶ 91: *[[VisualRank]] [[Gene expression]] similarity identification{{citation needed\|date=October 2013}} [[Audio similarity]] identification [[Nearest neighbor search]] [[Audio fingerprint]]<ref> Line 107 ⟶ 98: \| url = https://github.com/worldveil/dejavu\| date = 2018-12-19}} </ref> [[Digital video fingerprinting]]<ref> {{citation \| title = A Simple Introduction to Locality Sensitive Hashing (LSH) \| url =https://www.iunera.com/kraken/fabric/locality-sensitive-hashing-lsh/#7-applications-of-lsh\| date = 2025-03-27}} </ref> [[Shared memory]] organization in [[parallel computing]]<ref name=Chin1991 /><ref name=Chin1994 /> Physical data organization in database management systems<ref> {{citation Line 120 ⟶ 116: {{cite conference \| author1 = Oliver, Jonathan\| author2 = Cheng, Chun \| author3 = Chen, Yanggui ~~\| title = TLSH - a locality sensitive hash~~ \| ~~journal~~title = ~~4th~~2013 Fourth Cybercrime and Trustworthy Computing Workshop \| chapter = TLSH -- A Locality Sensitive Hash \| year = 2013 \| year = 2013▼ \| pages = 7–13 \| url = https://www.academia.edu/7833902 \| doi = 10.1109/CTC.2013.9 \| isbn = 978-1-4799-3076-0 }} </ref> [[Machine Learning]]<ref name="NL"> {{citation \| author1 = Fanaee-T, Hadi \| title = Natural Learning ▲ \| year = ~~2013~~2024 \| arxiv = 2404.05903 }} </ref> Line 201 ⟶ 204: # The encoding must be robust against intentional attacks. # The encoding should support an extremely low risk of false positives. Testing performed in the paper on a range of file types identified the Nilsimsa hash as having a significantly higher false positive rate when compared to other similarity digest schemes such as TLSH, Ssdeep and Sdhash.<ref>{{cite journal\|author=Oliver\|display-authors=etal\|title=TLSH - A Locality Sensitive Hash\|url=https://www.academia.edu/7833902\|journal=4th Cybercrime and Trustworthy Computing Workshop\|date=2013\|accessdate=2015-06-04}}</ref>▼ ====TLSH==== '''TLSH''' is locality-sensitive hashing algorithm designed for a range of security and digital forensic applications.<ref name="TLSH"/> The goal of TLSH is to generate hash digests for messages such that low distances between digests indicate that their corresponding messages are likely to be similar. ▲Testing performed in the paper on a range of file types identified the Nilsimsa hash as having a significantly higher false positive rate when compared to other similarity digest schemes such as TLSH, Ssdeep and Sdhash. An implementation of TLSH is available as [[open-source software]].<ref>{{cite web\|url=https://github.com/trendmicro/tlsh \|title=TLSH \|website=[[GitHub]] \|access-date=2014-04-10}}</ref> Line 212 ⟶ 215: ===Random projection=== {{main\|Random projection}} [[File:Cosine-distance.png\| thumb \| <math>\frac{\theta(u,v)}{\pi}</math> is approximately proportional to <math>1-\cos(\theta(u,v))</math> on the interval [0, <math>\pi</math>]]] The random projection method of LSH due to [[Moses Charikar]]<ref name=Charikar2002 /> called [[SimHash]] (also sometimes called arccos<ref name=Andoni2008>{{cite journal Line 231 ⟶ 234: The basic idea of this technique is to choose a random [[hyperplane]] (defined by a normal unit vector {{mvar\|r}}) at the outset and use the hyperplane to hash input vectors. Given an input vector {{mvar\|v}} and a hyperplane defined by {{mvar\|r}}, we let <math>h(v) = \sgn(v \cdot r)</math>. That is, <math>h(v) = \pm 1</math> depending on which side of the hyperplane {{mvar\|v}} lies. This way, each possible choice of a random hyperplane {{mvar\|r}} can be interpreted as a hash function <math>h(v)</math>. For two vectors ~~<math>~~{{mvar\|u,v~~</math>~~}} with angle <math>\theta(u,v)</math> between them, it can be shown that :<math>Pr[h(u) = h(v)] = 1 - \frac{\theta(u,v)}{\pi}.</math> Since the ratio between <math>\frac{\theta(u,v)}{\pi}</math> ~~is proportional to~~and <math>1-\cos(\theta(u,v))</math>. Inis ~~fact~~at ~~the~~least ~~ratio~~0.439 iswhen ~~always~~<math>\theta(u, ~~within~~v) a\in ~~factor~~[0, ~~of .87856~~\pi]</math>,<ref name=Charikar2002 /><ref name="Goemans Williamson 1995 pp. 1115–1145">{{cite journal \| ~~last~~last1=Goemans \| ~~first~~first1=Michel X. \| last2=Williamson \| first2=David P. \| title=Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming \| journal=Journal of the ACM \| publisher=Association for Computing Machinery (ACM) \| volume=42 \| issue=6 \| year=1995 \| issn=0004-5411 \| doi=10.1145/227683.227684 \| pages=1115–1145\| s2cid=15794408 \| doi-access=free }}</ref>~~. Meaning~~ the probability of ~~the~~ two vectors being on ~~the same~~different ~~side~~sides of the random hyperplane is approximately proportional to the [[~~Cosine_similarity~~Cosine similarity#~~Cosine_Distance~~Cosine Distance\|cosine distance]] between them. ===Stable distributions=== Line 270 ⟶ 273: \| url = http://hal.inria.fr/inria-00567191/en/ \| journal = Pattern Recognition Letters \| volume = 31 \| issue = 11 \| pages = 1348–1358 \| doi = 10.1016/j.patrec.2010.04.004 \| bibcode = 2010PaReL..31.1348P \| s2cid = 2666044 }}</ref> In particular k-means hash functions are better in practice than projection-based hash functions, but without any theoretical guarantee. ===Semantic hashing=== Semantic hashing is a technique that attempts to map input items to addresses such that closer inputs have higher [[semantic similarity]].<ref>{{Cite journal\|last1=Salakhutdinov\|first1=Ruslan\|last2=Hinton\|first2=Geoffrey\|date=2008\|title=Semantic hashing\|journal=International Journal of Approximate Reasoning\|language=en\|volume=50\|issue=7\|pages=969–978\|doi=10.1016/j.ijar.2008.11.006\|doi-access=free}}</ref> The hashcodes are found via training of an [[artificial neural network]] or [[graphical model]].{{cncitation needed\|date=September 2021}} ==Algorithm for nearest neighbor search== Line 284 ⟶ 287: In the preprocessing step we hash all {{mvar\|n}} {{mvar\|d}}-dimensional points from the data set {{mvar\|S}} into each of the {{mvar\|L}} hash tables. Given that the resulting hash tables have only {{mvar\|n}} non-zero entries, one can reduce the amount of memory used per each hash table to <math>O(n)</math> using standard [[hash functions]]. Given a query point {{mvar\|q}}, the algorithm iterates over the {{mvar\|L}} hash functions {{mvar\|g}}. For each {{mvar\|g}} considered, it retrieves the data points that are hashed into the same bucket as {{mvar\|q}}. The process is stopped as soon as a point within distance ~~<math>~~{{mvar\|cR~~</math>~~}} from {{mvar\|q}} is found. Given the parameters {{mvar\|k}} and {{mvar\|L}}, the algorithm has the following performance guarantees: Line 290 ⟶ 293: space: <math>O(nL)</math>, plus the space for storing data points; * query time: <math>O(L(kt+dnP_2^k))</math>; * the algorithm succeeds in finding a point within distance ~~<math>~~{{mvar\|cR~~</math>~~}} from {{mvar\|q}} (if there exists a point within distance {{mvar\|R}}) with probability at least <math>1 - ( 1 - P_1^k ) ^ L</math>; For a fixed approximation ratio <math>c=1+\epsilon</math> and probabilities <math>P_1</math> and <math>P_2</math>, one can set <math>k = \~~big~~left\lceil{\tfrac{\log n}{\log 1/P_2}}\~~big~~right\rceil</math> and <math>L = \lceil P_1^{-k}\rceil = O(n^{\rho}P_1^{-1})</math>, where <math>\rho={\tfrac{\log P_1}{\log P_2}}</math>. Then one obtains the following performance guarantees: * preprocessing time: <math>O(n^{1+\rho}P_1^{-1}kt)</math>; * space: <math>O(n^{1+\rho}P_1^{-1})</math>, plus the space for storing data points; * query time: <math>O(n^{\rho}P_1^{-1}(kt+d))</math>; ===Finding nearest neighbor without fixed dimensionality=== To generalize the above algorithm without radius {{mvar\|R}} being fixed, we can take the algorithm and do a sort of binary search over {{mvar\|R}}. It has been shown<ref>{{cite journal \|last1=Har-Peled \|first1=Sariel \|last2=Indyk \|first2=Piotr \|last3=Motwani \|first3=Rajeev \|title=Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality \|journal=Theory of Computing \|date=2012 \|volume=8 \|issue=Special Issue in Honor of Rajeev Motwani \|pages=321-350 \|doi=10.4086/toc.2012.v008a014 \|url=https://theoryofcomputing.org/articles/v008a014/v008a014.pdf \|access-date=23 May 2025}}</ref> that there is a data structure for the approximate nearest neighbor with the following performance guarantees: * space: <math>O(n^{1+\rho}P_1^{-1}d\log^2 n)</math>; * query time: <math>O(n^{\rho}P_1^{-1}(kt+d)\log n)</math>; * the algorithm succeeds in finding the nearest neighbor with probability at least <math>1 - (( 1 - P_1^k ) ^ L\log n)</math>; ===Improvements=== Line 301 ⟶ 311: When {{mvar\|t}} is large, it is possible to reduce the hashing time from <math>O(n^{\rho})</math>. This was shown by<ref>Dahlgaard, Søren, Mathias Bæk Tejs Knudsen, and Mikkel Thorup. [https://arxiv.org/pdf/1704.04370 "Fast similarity sketching."] 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2017.</ref> and<ref>Christiani, Tobias. [https://arxiv.org/pdf/1708.07586 "Fast locality-sensitive hashing frameworks for approximate near neighbor search."] International Conference on Similarity Search and Applications. Springer, Cham, 2019.</ref> which gave * query time: <math>O(t\log^2(1/P_2)/P_1 + n^{\rho}(d + 1/P_1))</math>; * space: <math>O(n^{1+\rho}/P_1 + \log^2(1/P_2)/P_1)</math>; It is also sometimes the case that the factor <math>1/P_1</math> can be very large. This happens for example with [[Jaccard similarity]] data, where even the most similar neighbor often has a quite low Jaccard similarity with the query. In<ref>Ahle, Thomas Dybdahl. "On the Problem of $$ <math>p_1^{-1} $$</math> in Locality-Sensitive Hashing." International Conference on Similarity Search and Applications. Springer, Cham, 2020.</ref> it was shown how to reduce the query time to <math>O(n^\rho/P_1^{1-\rho})</math> (not including hashing costs) and similarly the space usage. ==See also== [[Bloom filter]] {{Annotated link \|Bloom filter}} [[ {{Annotated link \|Curse of dimensionality]]}} [[ {{Annotated link \|Feature hashing]]}} [[ {{Annotated link \|Fourier-related transforms]]}} [[Geohash]] * {{Annotated link \|Geohash}} [[ {{Annotated link \|Multilinear subspace learning]]}} [[ {{Annotated link \|Principal component analysis]]}} [[Random indexing]]<ref>Gorman, James, and James R. Curran. [https://aclanthology.org/P06-1046.pdf "Scaling distributional similarity to large corpora."] Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006.</ref> [[ {{Annotated link \|Rolling hash]]}} [[ {{Annotated link \|Singular value decomposition]]}} [[ {{Annotated link \|Sparse distributed memory]]}} [[ {{Annotated link \|Wavelet compression]]}} {{Annotated link \|Locality of reference}} ==References== Line 329 ⟶ 341: Samet, H. (2006) ''Foundations of Multidimensional and Metric Data Structures''. Morgan Kaufmann. {{ISBN\|0-12-369446-9}} {{Cite conference\| last1 = Indyk \| first1 = Piotr \| author1-link= Piotr Indyk\| last2 = Motwani \| first2 = Rajeev \| author2-link = Rajeev Motwani\| last3 = Raghavan \| first3 = Prabhakar\| last4 = Vempala \| first4 = Santosh \| author4-link = Santosh Vempala\| title = Locality-preserving hashing in multidimensional spaces \| doi = 10.1145/258533.258656 \| book-title = Proceedings of the twenty-ninth annual ACM symposium on Theory of computing\| conference = [[Symposium on Theory of Computing\|STOC '97]]\| pages = 618–625\| year = 1997 \| isbn = 978-0-89791-888-6\| citeseerx = 10.1.1.50.4927\| s2cid = 15693787 }} {{Cite journal \| last1 = Chin \| first1 = Andrew\| title = Locality-preserving hash functions for general purpose parallel computation \| url = http://www.unclaw.com/chin/scholarship/hashfunctions.pdf\| doi = 10.1007/BF01185209 \| journal = [[Algorithmica]]\| volume = 12 \| issue = 2–3 \| pages = 170–181 \| year = 1994 \| s2cid = 18108051}} Line 335 ⟶ 346: ==External links== [http://web.mit.edu/andoni/www/LSH/index.html Alex Andoni's LSH homepage] * [~~http~~https://lshkit.sourceforge.net/ LSHKIT: A C++ Locality Sensitive Hashing Library] * [https://github.com/simonemainardi/LSHash A Python Locality Sensitive Hashing library that optionally supports persistence via redis] * [https://web.archive.org/web/20101203074412/http://www.vision.caltech.edu/malaa/software/research/image-search/ Caltech Large Scale Image Search Toolbox]: a Matlab toolbox implementing several LSH hash functions, in addition to Kd-Trees, Hierarchical K-Means, and Inverted File search algorithms.