No one measures vector distance using the hamming distance on binary representations. That's silly. We use L1 or L2, usually, and the binary encoding of the numbers is irrelevant.
It sounds like the LSH is maaaaybe equivalent to vector quantization. In which case this would be a form of regularization, which sometimes works well, and sometimes meh.
In terms of accuracy, it totally depends on the resolution needed. We can get >99% accuracy of L2 waaaaay faster with 1/10 of the memory overhead. For what we are doing that is the perfect trade off.
In terms of LSH, we tried projection hashing and quantization and were always disappointed.
Or is there actually some interesting hash-based neural algorithm lurking around somewhere?
You have a map from some high dimensional vector space ~ k^N -> H, some space of hashes. H sort of looks one dimensional. I assume that actually the interesting geometry of your training data lies on a relatively low dimensional subvariety/subset in k^N, so maybe its not actually that bad? It could be a really twisted and complicated curve.
However you still need to somehow preserve the relative structure right? Things that are far apart in k^N need to be far apart in H. Seems like you want things to at least approximately be an isometry. Although there are things like space filling curves that might do this for some degree.
Also maybe even though H looks low dimensional, it can actually capture quite a bit (if your data is encoded as a coefficient of a power of 2, you could think of powers of 2 as some sort of basis, so maybe it is also pretty high dimensional).
Or they are just shingling different ML hash functions, which is kinda lazy.
I used this Python library: https://github.com/trendmicro/tlsh.
lsh is most famously used for approximating jaccard distances, which even if you're not doing stuff like looking at lengths or distances in l1 or l2, is still a vector operation.
lsh is best described in jeff ullman's mining massive datasets textbook (available free online), which describes how it was used for webpage deduplication in the early days at google.
Is the speedup that remarkable? I'd be curious at to the increase in speed versus loss of precision.
The speed up is really necessity as direct methods are just too expensive both dollar wise and time wise (lowish latency is goal).
Edit: also, talking about LSH, must check out FAISS library https://github.com/facebookresearch/faiss and the current SOTA http://ann-benchmarks.com/
It didn't.[1]
[1]: https://www.merriam-webster.com/words-at-play/pique-vs-peak-...