undefined | Better HN

0 pointstraverseda2y ago0 comments

This is pretty early in the game to be relying on proprietary embeddings, don't you think? If if they are 20% better, blink and there will be a new normal.

It's insane to me that someone, this early in the gold rush, would be mining in someone else's mine, so to speak

0 comments

10 comments · 2 top-level

janalsncm2y ago· 7 in thread

It’s not just that. Embeddings aren’t magic. If you’re going to be creating embeddings for similarity search, the first thing you need to ask yourself is what makes two vectors similar such that two embeddings should even be close together?

There are a lot of related sources of similarity, but they’re slightly different. And I have no idea what Cohere is doing. Additionally, it’s not clear to me how queries can and should be embedded. Queries are typically much shorter than their associated documents, so they typically need to be trained jointly.

Selling “embeddings as a service” is a bit like selling hashing as a service. There are a lot of different hash functions. Cryptographic hashes, locality sensitive hashes, hashes for checksum, etc.

pradn2y ago

I'm with you on this. The vector embedding craze seems to be confusing mechanism and problem. The problem is semantic similarity search. One mechanism is vector embedding. I think all this comes from taking LLMs as a given, seeing that they work reasonably well with phrase-input semantic retrieval, and then hyper-optimizing vector embedding / search to achieve it.

Are there other semantic search systems? What happened to the entire field of Information Retrieval - is vector search the only method? Are all the stemming, linguistic analysis, all that - all obsoleted by vectors?

Or is it purely because vector search is quick? That's just an engineering problem. I'm not convinced it's the only method here. Happy to be corrected!

bunderbunder2y ago

The entire field of information retrieval is still here. This was touched on by the OReilly article on lessons learned working with LLMS that hit the HN front page yesterday [1], in their section on RAG.

My sense is that you can currently break the whole thing down into two groups: the proverbial grownups in the room are typically building pipelines that are still doing it basically how the top-performing systems did in the '90s, with a souped up keyword and metadata search engine for the initial pass and an embedding model for catching some stuff it misses and/or result ranking. This isn't how most general-purpose search engines work, but it's likely how the ones you don't particularly mind using work. Web search, for example.

And then there's the proverbial internet comments section, which wants to skip past all the boring labor-intensive oldschool stuff, and instead just begin and end with approximate nearest neighbors search using an off-the-shelf embedding model. The primary advantage to this approach - and I should admit here that I've tried it myself - is that you can bodge it together over a weekend and have the blog post up by Monday.

I guess what I'm getting at is, the people producing content on the Internet and the people producing effective software aren't necessarily the same people. I mean, heck, look at me, I'm only here to type this comment because I'm slacking off at work today.

1: https://www.oreilly.com/radar/what-we-learned-from-a-year-of...

1 more reply

janalsncm2y ago

> Are there other semantic search systems?

Not a semantic search but stemming + BM25 often works surprisingly well and is a fast and cheap baseline.

bunderbunder2y ago

You probably won't find out exactly what they're doing any more than anyone's going to find out a whole lot of details on what OpenAI is doing with GPT. But, as the popularity of GPT demonstrates, it seems that many business customers are now comfortable embracing closed models.

There is more information here, though: https://cohere.com/blog/introducing-embed-v3

riku_iki2y ago

> here are a lot of different hash functions. Cryptographic hashes, locality sensitive hashes, hashes for checksum, etc.

and there are some standard hash functions in the lib, which cover 98% of usecases. I think the same is embeddings, you can train some foundational multitask model, and embedding will work for variety of tasks too.

vineyardmike2y ago

> If you’re going to be creating embeddings for similarity search, the first thing you need to ask yourself is what makes two vectors similar such that two embeddings should even be close together?

I have no association with Cohere, but in their docs clearly say that their embedding were trained so two similar vectors have similar "semantic meaning". Which is still pretty vague, but it's at least clear what their goals were.

> Selling “embeddings as a service” is a bit like selling hashing as a service.

Coincidentally, Cohere also aggressively advertises that they want you to fine-tune and co-develop custom models (with their proprietary services).

nostrebored2y ago

But this is the GPs point — that doesn’t mean they’re optimized for retrieval.

bunderbunder2y ago· 1 in thread

I have no idea. But that wasn't the question I was answering. It was, "how does the article's author estimate that would cost $5000?" And I think that's how. Or at least, that gets to a number that's in the same ballpark as what the author was suggesting.

That said, first guess, if you do want to evaluate Cohere embeddings for a commercial application, using this dataset could be a decent basis for a lower-cost spike.

jbellis2y ago

Yes, that is how I came up with that number.

j / k navigate · click thread line to collapse

0 comments

10 comments · 2 top-level

janalsncm2y ago· 7 in thread

pradn2y ago

Or is it purely because vector search is quick? That's just an engineering problem. I'm not convinced it's the only method here. Happy to be corrected!

bunderbunder2y ago

1: https://www.oreilly.com/radar/what-we-learned-from-a-year-of...

1 more reply

janalsncm2y ago

> Are there other semantic search systems?

Not a semantic search but stemming + BM25 often works surprisingly well and is a fast and cheap baseline.

bunderbunder2y ago

There is more information here, though: https://cohere.com/blog/introducing-embed-v3

riku_iki2y ago

> here are a lot of different hash functions. Cryptographic hashes, locality sensitive hashes, hashes for checksum, etc.

vineyardmike2y ago

> If you’re going to be creating embeddings for similarity search, the first thing you need to ask yourself is what makes two vectors similar such that two embeddings should even be close together?

> Selling “embeddings as a service” is a bit like selling hashing as a service.

Coincidentally, Cohere also aggressively advertises that they want you to fine-tune and co-develop custom models (with their proprietary services).

nostrebored2y ago

But this is the GPs point — that doesn’t mean they’re optimized for retrieval.

bunderbunder2y ago· 1 in thread

That said, first guess, if you do want to evaluate Cohere embeddings for a commercial application, using this dataset could be a decent basis for a lower-cost spike.

jbellis2y ago

Yes, that is how I came up with that number.

j / k navigate · click thread line to collapse