Locality-sensitive hashing: Difference between revisions

Content deleted Content added
Nilsimsa Hash: add citation
Citation bot (talk | contribs)
Alter: url. URLs might have been anonymized. Add: bibcode, s2cid, authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox2 | #UCB_webform_linked 873/2384
Line 202:
# The encoding should support an extremely low risk of false positives.
 
Testing performed in the paper on a range of file types identified the Nilsimsa hash as having a significantly higher false positive rate when compared to other similarity digest schemes such as TLSH, Ssdeep and Sdhash.<ref>{{cite journal|author=Oliver|display-authors=etal|title=TLSH - A Locality Sensitive Hash|url=https://www.academia.edu/7833902/TLSH_-A_Locality_Sensitive_Hash|journal=4th Cybercrime and Trustworthy Computing Workshop|date=2013|accessdate=2015-06-04}}</ref>
 
====TLSH====
Line 237:
<math>Pr[h(u) = h(v)] = 1 - \frac{\theta(u,v)}{\pi}</math>
 
<math>\frac{\theta(u,v)}{\pi}</math> is proportional to <math>1-\cos(\theta(u,v))</math>. In fact the ratio is always within a factor of .87856.<ref name=Charikar2002 /><ref name="Goemans Williamson 1995 pp. 1115–1145">{{cite journal | lastlast1=Goemans | firstfirst1=Michel X. | last2=Williamson | first2=David P. | title=Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming | journal=Journal of the ACM | publisher=Association for Computing Machinery (ACM) | volume=42 | issue=6 | year=1995 | issn=0004-5411 | doi=10.1145/227683.227684 | pages=1115–1145| s2cid=15794408 }}</ref> Meaning the probability of the two vectors being on the same side of the random hyperplane is proportional to the [[Cosine_similarity#Cosine_Distance|cosine distance]] between them.
 
===Stable distributions===
Line 270:
| url = http://hal.inria.fr/inria-00567191/en/
| journal = Pattern Recognition Letters
| volume = 31 | issue = 11 | pages = 1348–1358 | doi = 10.1016/j.patrec.2010.04.004 | bibcode = 2010PaReL..31.1348P }}</ref>
In particular k-means hash functions are better in practice than projection-based hash functions, but without any theoretical guarantee.