LSH¶

Locality Sensitive Hashing can help you search through enormous data sets for approximated nearest neighbors. If you want to read more about this algorithm try the following sources:

Introduction

The gist of the algorithm is that data points (vectors) that are close in some high dimensional space will be likely to have the same hash. The hash functions we choose to hash the vectors determine the distance function we use to define “closeness”. At the moment we expose the following hashers:

Hasher	Distance/ similarity
Sign Random Projections	Cosine similarity
P-stable distributions	Euclidean

Hyperparameters¶

The LSH algorithm requires two hyperparameters:

The length of the generated hash k. A larger value for k leads to less hash collisions, thus faster query times.
The number of different hash tables L. There will be L hash tables with L randomly generated hash functions.

The L hyperparamter can be derived from the query success probability and k. Read my blog post on that subject to get an explanation.

L2¶

The L2 LSH has an additional hyperparameter r. This is the width of bucket hash values can fall in. If you normalize your data by the distance threshold \(R\) this hyperparameter should be approximately 4.