Locality-sensitive hashing: Difference between revisions

Content deleted Content added
restore incorrectly removed subtitle
m clean up
Line 25:
* a threshold <math>R>0</math>,
* an approximation factor <math>c>1</math>,
* and probabilities <math>P_1</math> and <math>P_2</math>.
 
This family <math>\mathcal F</math> is a set of functions <math>h\colon M \to S</math> that map elements of the metric space to buckets <math>s \in S</math>. An LSH family must satisfy the following conditions for any two points <math>p, q \in M</math> and any hash function <math>h</math> chosen uniformly at random from <math>\mathcal F</math>:
Line 52:
This is in contrast to [[cryptography|cryptographic]] hash functions and [[checksum]]s, which are designed to have [[Avalanche effect|random output difference between adjacent inputs]].
 
The first family of locality-preserving hash functions was devised as a way to facilitate [[Pipeline (computing)|data pipelining]] in implementations of [[Parallel RAM|parallel random-access machine (PRAM)]] algorithms that use [[universal hashing]] to reduce memory [[Resource_contentionResource contention|contention]] and [[network congestion]].<ref name=Chin1991>{{cite thesis |last=Chin |first=Andrew |date=1991 |title=Complexity Issues in General Purpose Parallel Computing |pages=87–95 |type=DPhil |publisher=University of Oxford |url=https://perma.cc/E47H-WCVP}}</ref><ref name=Chin1994>{{cite journal |last1=Chin |first1=Andrew |date=1994 |title=Locality-Preserving Hash Functions for General Purpose Parallel Computation |url=http://unclaw.com/chin/scholarship/hashfunctions.pdf |journal=Algorithmica |volume=12 |issue=2–3 |pages=170–181 |doi=10.1007/BF01185209|s2cid=18108051 }}</ref>
 
Locality preserving hashes are related to [[space-filling curve]]s.{{how|date=March 2020}}
Line 73:
| title = Proceedings of the 16th international conference on World Wide Web
| chapter = Google news personalization: scalable online collaborative filtering
| date = 2007
| pages = 271–280
| year = 2007|doi=10.1145/1242572.1242610|display-authors=etal| isbn = 9781595936547
Line 88 ⟶ 87:
|title=Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
|chapter=Twister Tries
|date=2015
| pages=505–517
| year = 2015 | doi=10.1145/2723372.2751521| isbn = 9781450327589
Line 241 ⟶ 239:
<math>Pr[h(u) = h(v)] = 1 - \frac{\theta(u,v)}{\pi}</math>
 
<math>\frac{\theta(u,v)}{\pi}</math> is proportional to <math>1-\cos(\theta(u,v))</math>. In fact the ratio is always within a factor of .87856.<ref name=Charikar2002 /><ref name="Goemans Williamson 1995 pp. 1115–1145">{{cite journal | last1=Goemans | first1=Michel X. | last2=Williamson | first2=David P. | title=Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming | journal=Journal of the ACM | publisher=Association for Computing Machinery (ACM) | volume=42 | issue=6 | year=1995 | issn=0004-5411 | doi=10.1145/227683.227684 | pages=1115–1145| s2cid=15794408 }}</ref> Meaning the probability of the two vectors being on the same side of the random hyperplane is proportional to the [[Cosine_similarityCosine similarity#Cosine_DistanceCosine Distance|cosine distance]] between them.
 
===Stable distributions===
Line 278 ⟶ 276:
 
===Semantic hashing===
Semantic hashing is a technique that attempts to map input items to addresses such that closer inputs have higher [[semantic similarity]].<ref>{{Cite journal|last1=Salakhutdinov|first1=Ruslan|last2=Hinton|first2=Geoffrey|date=2008|title=Semantic hashing|journal=International Journal of Approximate Reasoning|language=en|volume=50|issue=7|pages=969–978|doi=10.1016/j.ijar.2008.11.006|doi-access=free}}</ref> The hashcodes are found via training of an [[artificial neural network]] or [[graphical model]].{{cncitation needed|date=September 2021}}
 
==Algorithm for nearest neighbor search==
Line 333 ⟶ 331:
 
*Samet, H. (2006) ''Foundations of Multidimensional and Metric Data Structures''. Morgan Kaufmann. {{ISBN|0-12-369446-9}}
 
*{{Cite conference| last1 = Indyk | first1 = Piotr | author1-link= Piotr Indyk| last2 = Motwani | first2 = Rajeev | author2-link = Rajeev Motwani| last3 = Raghavan | first3 = Prabhakar| last4 = Vempala | first4 = Santosh | author4-link = Santosh Vempala| title = Locality-preserving hashing in multidimensional spaces | doi = 10.1145/258533.258656 | book-title = Proceedings of the twenty-ninth annual ACM symposium on Theory of computing| conference = [[Symposium on Theory of Computing|STOC '97]]| pages = 618–625| year = 1997 | isbn = 978-0-89791-888-6| citeseerx = 10.1.1.50.4927| s2cid = 15693787 }}
*{{Cite journal | last1 = Chin | first1 = Andrew| title = Locality-preserving hash functions for general purpose parallel computation | url = http://www.unclaw.com/chin/scholarship/hashfunctions.pdf| doi = 10.1007/BF01185209 | journal = [[Algorithmica]]| volume = 12 | issue = 2–3 | pages = 170–181 | year = 1994 | s2cid = 18108051}}