Vectors are over, hashes are the future (opens in new tab)

(algolia.com)

172 pointsjsilvers3y ago38 comments

38 comments

37 comments · 17 top-level

nelsondev3y ago· 6 in thread

Seems the author is proposing LSH instead of vectors for doing ANN?

There are benchmarks here, http://ann-benchmarks.com/ , but LSH underperforms the state of the art ANN algorithms like HNSW on recall/throughput.

LSH I believe was state of the art 10ish years ago, but has since been surpassed. Although the caching aspect is really nice.

kvathupo3y ago

To elaborate on Noe's comment, the article is suggesting the use of LSH where the hashing function is learned by a neural network such that similar vectors correspond to similar hashes via Hamming weight (whilst enforcing some load factor). In effect, a good hash is generated by a neural network. It appears Elastiknn a prioi chooses the hash function? Not sure, not my area of knowledge.

This approach seems feasible tbh. For example, a stock's historical bids/asks probably don't deviate greatly from month to month. That said, the generation of a good hash is dependent on the stock ticker, and a human doesn't have the time to find a good one for every stock at scale.

molodec3y ago

It is true that HNSW outperforms LSH on recall and throughput, but for some use cases LSH outperforms HNSW. I just deployed this week to prod a new system for short text streaming clustering using LSH. I used algorithms from this crate that I also built https://github.com/serega/gaoya

HNSW index is slow to construct, so it is best suited for search or recommendation engines where you build the index and serve. For workloads where you continuously mutate the index, like streaming clustering/deduplication LSH outperforms HNSW.

Noe20973y ago

LSH is a _technique_, whose performance vastly/mostly depends on the hashing function and on how this function enables neighborhood exploration.

It might not be trendy, but it doesn't mean it can't work as good or better than HNSW. It all depends on the hashing function you come up with.

a-dub3y ago

when combined with minhashing it approximates jaccard similarity, so it seems it would be bounded by that.

a-dub3y ago

10? no, it's more like 20+. lsh was a core piece of the google crawler. it was used for high performance fuzzy deduplication.

see ullman's text: mining massive datasets. it's free on the web.

johanvts3y ago

I think LSH was only introduced in 99 by Indyk et. al. I would say it was a pretty active research area 10 years ago.

1 more reply

fzliu3y ago· 5 in thread

Hashes are fine, but to say that "vectors are over" is just plain nonsense. We continue to see vectors as a core part of production systems for entity representation and recommendation (example: https://slack.engineering/recommend-api) and within models themselves (example: multimodal and diffusion models). For folks into metrics, we're building a vector database specifically for storing, indexing, and searching across massive quantities of vectors (https://github.com/milvus-io/milvus), and we've seen close to exponential growth in terms of total downloads.

Vectors are just getting started.

gauddasa3y ago

True. The title is just clickbait and what we find inside is suggestions for dimensionality reduction by a person who appears to be on the verge of reinventing autoencoders disguised as neural hashes. Is it a mere coincidence that the article fails to mention autoencoders?

kvathupo3y ago

Click-bait title aside : ^ ), I'd agree. Neural hashes seem to be a promising advancement imo, but I question its impact on the convergence time of AI models. In the pecking order of neural network bottlenecks, I'd imagine it's not terribly expensive to access training data from some database. Rather, hardware considerations for improving parallelism seem to be the biggest hurdle [1].

[1] - https://www.nvidia.com/en-us/data-center/nvlink/

evrydayhustling3y ago

Yes this is funny to read when (a) embeddings are such a huge leap in reusable machine learning investment and (b) almost nobody is using them yet. On the other hand, neural hashes do look similar to the density tree analysis that is the first step in many of our applications of language embeddings. It makes sense to me that some of this might be incorporated into vector dbs in the near future. Do you have plans to?

PaulHoule3y ago

Frequently people use vectors as a hash. It's a bit like a fashionista declaring clothes obsolete.

jurschreuder3y ago

For searching on faces I also needed to find vectors in a database.

I used random projection hashing to increase the search speed because you can just match directly (or at least narrow down search) instead of calculating the euclidean distance for each row.

gk13y ago· 3 in thread

This is a rehash (pardon me) of this post from 2021: https://www.search.io/blog/vectors-versus-hashes

The demand for vector embedding models (like those released by OpenAI, Cohere, HuggingFace, etc) and vector databases (like https://pinecone.io -- disclosure: I work there) has only grown since then. The market has decided that vectors are not, in fact, over.

packetlost3y ago

PineCone seems interesting. Is the storage backend open source? I've been working on a persistent hashmap database that's somewhat similar (albeit not done) that should have less RAM requirements than bitcask (ie. larger than RAM keysets)

gk13y ago

Although we may open-source parts in the future, currently no part of Pinecone is open-sourced. Instead, there are several proprietary index types available, packaged along with hardware/compute resources into what we call “pods.”

psteitz3y ago

Have a look at Milvus (BSD license) and Weviate (Apache 2)

robotresearcher3y ago· 2 in thread

A state vector can represent a point in the state space of floating-point representation, a point in the state space of a hash function, or any other discrete space.

Vectors didn't go anywhere. The article is discussing which function to use to interpret a vector.

Is there a special meaning of 'vector' here that I am missing? Is it so synonymous in the ML context with 'multidimensional floating point state space descriptor' that any other use is not a vector any more?

ummonk3y ago

Keep in mind that this is the same field which uses multidimensional arrays that fail to obey tensor transformation laws (because ML requires the kind of nonlinear structure introduced by functions such as ReLU that requires a preferred basis and cannot be transformed between bases) but insists on calling them tensors.

1 more reply

Firmwarrior3y ago

The title probably makes a lot more sense in the context of where it was originally posted

I was as confused and annoyed as you were, though, since I don't have a machine learning background

mrkeen3y ago· 2 in thread

> The analogy here would be the choice between a 1 second flight to somewhere random in the suburb of your choosing in any city in the world versus a 10 hour trip putting you at the exact house you wanted in the city of your choice.

Wouldn't the first part of the analogy actually be:

A 1 second flight that will probably land at your exact destination, but could potentially land you anywhere on earth?

hansvm3y ago

To be fair, that sounds like an extraordinarily fun prospect.

8note3y ago

That somewhere could be the middle of the Pacific, and it just drops you without a boat

whatever13y ago· 1 in thread

Omg NN “research” is just heuristics on top of heuristics on top of mambo jumbo.

Hopefully someone who knows math will enter the field one day and build the theoretical basis for all this mess and allow us to make real progress.

auraham3y ago

Old post of Yann LeCun [1]:

> But another important goal is inventing new methods, new techniques, and yes, new tricks. In the history of science and technology, the engineering artifacts have almost always preceded the theoretical understanding: the lens and the telescope preceded optics theory, the steam engine preceded thermodynamics, the airplane preceded flight aerodynamics, radio and data communication preceded information theory, the computer preceded computer science.

[1] https://www.reddit.com/r/MachineLearning/comments/7i1uer/n_y...

olliej3y ago· 1 in thread

So my interpretation of the neural hash approach is largely that it is essentially trading a much larger number of very small “neurons” vs a smaller number of floats. Given that I’d be curious about what the total size difference is.

I could see the hash approach at a functional level resulting in different features essentially getting a different number of bit directly, which be approximately equivalent to having a NN with variable precision floats, all in a very hand wavy way.

Eg we could say a NN/NH needs N bits of information to work accurately, in which case you’re trading the format and operations on those Nbits

2overwhelm3y ago

Sry, non native english speaker here... yes, you're right, but wasn't the topic about progress in building selfhealing-systems by programming in 'blocks', setting routines searching for (that i want to call an vector in terms of a setted goal) problems, and be able to slove the problems ? Or what had we learned by all the 'having crypto' when not programming in 'blocks' -so hashes seem the smallest part of how to begin, not ? puh Hope that it wasn't too non-understandable, regards

PLenz3y ago

Hashes are just short, constrained membership vectors

euphetar3y ago

Very shallow article. Would like to see a list of mentioned "recent breakthroughs" about using hashes in ML besides the retrieval applications, because this is genuinely interesting

cratermoon3y ago

And then there's this: https://news.ycombinator.com/item?id=33125640

whycombinetor3y ago

The article's 0.65 vs 0.66 float64 example doesn't indicate much since neither 0.64 nor 0.65 have a terminating representation in base 2...

eterevsky3y ago

So... Isn't this just embeddings with 1 bit per value?

The natural question is: how are you going to train it?

tomrod3y ago

Maybe I'm misunderstanding the guy, but he is effectively calling for lower dimensional mappings from vectors to hashes. That is fine and all, but aren't hashes a single dimension in the way he is describing the use?

whimsicalism3y ago

I work in this field and I found this article... very difficult to follow. More technical description would be helpful so I can pattern match to my existing knowledge.

Are they re-inventing autoencoders?

sramam3y ago

(I know nothing about the area.)

Am I incorrect in thinking we are headed to future AIs that jump to conclusions? Or is it just my "human neural hash" being triggered in error?!

ummonk3y ago

Such hashes are vectors over the Boolean field (with addition being bitwise XOR).

aaaaaaaaaaab3y ago

Pfhew, I thought you wanted to ditch std::vector for hash maps!

j / k navigate · click thread line to collapse

38 comments

37 comments · 17 top-level

nelsondev3y ago· 6 in thread

Seems the author is proposing LSH instead of vectors for doing ANN?

There are benchmarks here, http://ann-benchmarks.com/ , but LSH underperforms the state of the art ANN algorithms like HNSW on recall/throughput.

LSH I believe was state of the art 10ish years ago, but has since been surpassed. Although the caching aspect is really nice.

kvathupo3y ago

molodec3y ago

Noe20973y ago

LSH is a _technique_, whose performance vastly/mostly depends on the hashing function and on how this function enables neighborhood exploration.

It might not be trendy, but it doesn't mean it can't work as good or better than HNSW. It all depends on the hashing function you come up with.

a-dub3y ago

when combined with minhashing it approximates jaccard similarity, so it seems it would be bounded by that.

a-dub3y ago

10? no, it's more like 20+. lsh was a core piece of the google crawler. it was used for high performance fuzzy deduplication.

see ullman's text: mining massive datasets. it's free on the web.

johanvts3y ago

I think LSH was only introduced in 99 by Indyk et. al. I would say it was a pretty active research area 10 years ago.

1 more reply

fzliu3y ago· 5 in thread

Vectors are just getting started.

gauddasa3y ago

kvathupo3y ago

[1] - https://www.nvidia.com/en-us/data-center/nvlink/

evrydayhustling3y ago

PaulHoule3y ago

Frequently people use vectors as a hash. It's a bit like a fashionista declaring clothes obsolete.

jurschreuder3y ago

For searching on faces I also needed to find vectors in a database.

I used random projection hashing to increase the search speed because you can just match directly (or at least narrow down search) instead of calculating the euclidean distance for each row.

gk13y ago· 3 in thread

This is a rehash (pardon me) of this post from 2021: https://www.search.io/blog/vectors-versus-hashes

packetlost3y ago

gk13y ago

psteitz3y ago

Have a look at Milvus (BSD license) and Weviate (Apache 2)

robotresearcher3y ago· 2 in thread

A state vector can represent a point in the state space of floating-point representation, a point in the state space of a hash function, or any other discrete space.

Vectors didn't go anywhere. The article is discussing which function to use to interpret a vector.

ummonk3y ago

1 more reply

Firmwarrior3y ago

The title probably makes a lot more sense in the context of where it was originally posted

I was as confused and annoyed as you were, though, since I don't have a machine learning background

mrkeen3y ago· 2 in thread

Wouldn't the first part of the analogy actually be:

A 1 second flight that will probably land at your exact destination, but could potentially land you anywhere on earth?

hansvm3y ago

To be fair, that sounds like an extraordinarily fun prospect.

8note3y ago

That somewhere could be the middle of the Pacific, and it just drops you without a boat

whatever13y ago· 1 in thread

Omg NN “research” is just heuristics on top of heuristics on top of mambo jumbo.

Hopefully someone who knows math will enter the field one day and build the theoretical basis for all this mess and allow us to make real progress.

auraham3y ago

Old post of Yann LeCun [1]:

[1] https://www.reddit.com/r/MachineLearning/comments/7i1uer/n_y...

olliej3y ago· 1 in thread

Eg we could say a NN/NH needs N bits of information to work accurately, in which case you’re trading the format and operations on those Nbits

2overwhelm3y ago

PLenz3y ago

Hashes are just short, constrained membership vectors

euphetar3y ago

Very shallow article. Would like to see a list of mentioned "recent breakthroughs" about using hashes in ML besides the retrieval applications, because this is genuinely interesting

cratermoon3y ago

And then there's this: https://news.ycombinator.com/item?id=33125640

whycombinetor3y ago

The article's 0.65 vs 0.66 float64 example doesn't indicate much since neither 0.64 nor 0.65 have a terminating representation in base 2...

eterevsky3y ago

So... Isn't this just embeddings with 1 bit per value?

The natural question is: how are you going to train it?

tomrod3y ago

whimsicalism3y ago

I work in this field and I found this article... very difficult to follow. More technical description would be helpful so I can pattern match to my existing knowledge.

Are they re-inventing autoencoders?

sramam3y ago

(I know nothing about the area.)

Am I incorrect in thinking we are headed to future AIs that jump to conclusions? Or is it just my "human neural hash" being triggered in error?!

ummonk3y ago

Such hashes are vectors over the Boolean field (with addition being bitwise XOR).

aaaaaaaaaaab3y ago

Pfhew, I thought you wanted to ditch std::vector for hash maps!

j / k navigate · click thread line to collapse