Locality-sensitive hashing: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 15:26, 27 March 2025 edit Astrid224442 (talk \| contribs) 12 edits m added a reference for applications ← Previous edit		Latest revision as of 21:56, 9 August 2025 edit undo Bender the Bot (talk \| contribs) Bots 1,064,377 edits m →External links: HTTP to HTTPS for SourceForge Tag: AWB
(4 intermediate revisions by 4 users not shown)
Line 1: {{Short description\|Algorithmic technique using hashing}} In [[computer science]], '''locality-sensitive hashing''' ('''LSH''') is a [[fuzzy hashing]] technique that hashes similar input items into the same "buckets" with high probability.<ref name="MOMD">{{cite web\|url=http://infolab.stanford.edu/~ullman/mmds.html\|title=Mining of Massive Datasets, Ch. 3.\|last1=Rajaraman\|first1=A.\|last2=Ullman\|first2=J.\|author2-link=Jeffrey Ullman\|year=2010}}</ref> (The number of buckets is much smaller than the universe of possible input items.)<ref name="MOMD" /> Since similar items end up in the same buckets, this technique can be used for [[Cluster analysis\|data clustering]] and [[nearest neighbor search]]. It differs from [[Hash function\|conventional hashing techniques]] in that [[hash collision]]s are maximized, not minimized. Alternatively, the technique can be seen as a way to [[dimension reduction\|reduce the dimensionality]] of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items. Hashing-based approximate [[nearest-neighbor search]] algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as locality-preserving hashing (LPH).<ref>{{cite conference \|last1=Zhao \|first1=Kang \|last2=Lu \|first2=Hongtao \|last3=Mei \|first3=Jincheng \|title=Locality Preserving Hashing \|conference=AAAI Conference on Artificial Intelligence \| volume=28 \| year=2014 \|url=https://ojs.aaai.org/index.php/AAAI/article/view/9133/8992 \|pages=2874–2880}}</ref><ref>{{cite book \|last1=Tsai \|first1=Yi-Hsuan \|last2=Yang \|first2=Ming-Hsuan \|title=2014 IEEE International Conference on Image Processing (ICIP) \|chapter=Locality preserving hashing \|date=October 2014 \|pages=2988–2992 \|doi=10.1109/ICIP.2014.7025604 \|isbn=978-1-4799-5751-4 \|s2cid=8024458 \|issn=1522-4880}}</ref> Line 215: ===Random projection=== {{main\|Random projection}} [[File:Cosine-distance.png\| thumb \| <math>\frac{\theta(u,v)}{\pi}</math> is approximately proportional to <math>1-\cos(\theta(u,v))</math> on the interval [0, <math>\pi</math>]] The random projection method of LSH due to [[Moses Charikar]]<ref name=Charikar2002 /> called [[SimHash]] (also sometimes called arccos<ref name=Andoni2008>{{cite journal Line 299: * space: <math>O(n^{1+\rho}P_1^{-1})</math>, plus the space for storing data points; * query time: <math>O(n^{\rho}P_1^{-1}(kt+d))</math>; ===Finding nearest neighbor without fixed dimensionality=== To generalize the above algorithm without radius {{mvar\|R}} being fixed, we can take the algorithm and do a sort of binary search over {{mvar\|R}}. It has been shown<ref>{{cite journal \|last1=Har-Peled \|first1=Sariel \|last2=Indyk \|first2=Piotr \|last3=Motwani \|first3=Rajeev \|title=Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality \|journal=Theory of Computing \|date=2012 \|volume=8 \|issue=Special Issue in Honor of Rajeev Motwani \|pages=321-350 \|doi=10.4086/toc.2012.v008a014 \|url=https://theoryofcomputing.org/articles/v008a014/v008a014.pdf \|access-date=23 May 2025}}</ref> that there is a data structure for the approximate nearest neighbor with the following performance guarantees: * space: <math>O(n^{1+\rho}P_1^{-1}d\log^2 n)</math>; * query time: <math>O(n^{\rho}P_1^{-1}(kt+d)\log n)</math>; * the algorithm succeeds in finding the nearest neighbor with probability at least <math>1 - (( 1 - P_1^k ) ^ L\log n)</math>; ===Improvements=== Line 325 ⟶ 332: * {{Annotated link \|Sparse distributed memory}} * {{Annotated link \|Wavelet compression}} * {{Annotated link \|Locality of reference}} ==References== Line 338 ⟶ 346: ==External links== * [http://web.mit.edu/andoni/www/LSH/index.html Alex Andoni's LSH homepage] * [~~http~~https://lshkit.sourceforge.net/ LSHKIT: A C++ Locality Sensitive Hashing Library] * [https://github.com/simonemainardi/LSHash A Python Locality Sensitive Hashing library that optionally supports persistence via redis] * [https://web.archive.org/web/20101203074412/http://www.vision.caltech.edu/malaa/software/research/image-search/ Caltech Large Scale Image Search Toolbox]: a Matlab toolbox implementing several LSH hash functions, in addition to Kd-Trees, Hierarchical K-Means, and Inverted File search algorithms.