Embeddings are a good starting point for the AI curious app developer (opens in new tab)

(bawolf.substack.com)

675 pointsbryantwolf2y ago174 comments

174 comments

134 comments · 30 top-level

thisiszilff2y ago· 22 in thread

One straightforward way to get started is to understand embedding without any AI/deep learning magic. Just pick a vocabulary of words (say, some 50k words), pick a unique index between 0 and 49,999 for each of the words, and then produce embedding by adding +1 to the given index for a given word each time it occurs in a text. Then normalize the embedding so it adds up to one.

Presto -- embeddings! And you can use cosine similarity with them and all that good stuff and the results aren't totally terrible.

The rest of "embeddings" builds on top of this basic strategy (smaller vectors, filtering out words/tokens that occur frequently enough that they don't signify similarity, handling synonyms or words that are related to one another, etc. etc.). But stripping out the deep learning bits really does make it easier to understand.

HarHarVeryFunny2y ago

Those would really just be identifiers. I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.

The classic example is word embeddings such as word2vec, or GloVE, where due to the embeddings being meaningful in this way, one can see vector relationships such as "man - woman" = "king - queen".

thisiszilff2y ago

> I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.

In this case each dimension is the presence of a word in a particular text. So when you take the dot product of two texts you are effectively counting the number of words the two texts have in common (subject to some normalization constants depending on how you normalize the embedding). Cosine similarity still works for even these super naive embeddings which makes it slightly easier to understand before getting into any mathy stuff.

You are 100% right this won't give you the word embedding analogies like king - man = queen or stuff like that. This embedding has no concept of relationships between words.

2 more replies

IanCal2y ago

They're not, I get why you think that though.

They're making a vector for a text that's the term frequencies in the document.

It's one step simpler than tfidf which is a great starting point.

2 more replies

pstorm2y ago

I'm trying to understand this approach. Maybe I am expecting too much out of this basic approach, but how does this create a similarity between words with indices close to each other? Wouldn't it just be a popularity contest - the more common words have higher indices and vice versa? For instance, "king" and "prince" wouldn't necessarily have similar indices, but they are semantically very similar.

svieira2y ago

You are expecting too much out of this basic approach. The "simple" similarity search in word2vec (used in https://semantle.com/ if you haven't seen it) is based on _multiple_ embeddings like this one (it's a simple neural network not a simple embedding).

_akhe2y ago

This is a simple example where it scores their frequency. If you scored every word by their frequency only you might have embeddings like this:

  act: [0.1]
  as:  [0.4]
  at:  [0.3]
  ...

That's a very simple 1D embedding, and like you said would only give you popularity. But say you wanted other stuff like its: Vulgarity, prevalence over time, whether its slang or not, how likely it is to start or end a sentence, etc. you would need more than 1 number. In text-embedding-ada-002 there are 1536 numbers in the array (vector), so it's like:

  act: [0.1, 0.1, 0.3, 0.0001, 0.000003, 0.003, ... (1536 items)]
  ...

The numbers don't mean anything in-and-of-themselves. The values don't represent qualities of the words, they're just numbers in relation to others in the training data. They're different numbers in different training data because all the words are scored in relation to each other, like a graph. So when you compute them you arrive at words and meanings in the training data as you would arrive at a point in a coordinate space if you subtracted one [x,y,z] from another [x,y,z] in 3D.

So the rage about a vector db is that it's a database for arrays of numbers (vectors) designed for computing them against each other, optimized for that instead of say a SQL or NoSQL which are all about retrieval etc.

So king vs prince etc. - When you take into account the 1536 numbers, you can imagine how compared to other words in training data they would actually be similar, always used in the same kinds of ways, and are indeed semantically similar - you'd be able to "arrive" at that fact, and arrive at antonyms, synonyms, their French alternatives, etc. but the system doesn't "know" that stuff. Throw in Burger King training data and talk about French fries a lot though, and you'd mess up the embeddings when it comes arriving at the French version of a king! You might get "pomme de terre".

jncfhnb2y ago

King doesn’t need to appear commonly with prince. It just needs to appear in the same context as prince.

It also leaves out the old “tf idf” normalization of considering how common a word is broadly (less interesting) vs in that particular document. Kind of like a shittier attention. Used to make a big difference.

sdwr2y ago

It doesn't even work as described for popularity - one word starts at 49,999 and one starts at 0.

1 more reply

im3w1l2y ago

It's a document embedding, not a word embedding.

zachrose2y ago

Maybe the idea is to order your vocabulary into some kind of “semantic rainbow”? Like a one-dimensional embedding?

dekhn2y ago

Is that really an embedding? I normally think of an embedding as an approximate lower-dimensional matrix of coefficients that operate on a reduced set of composite variables that map the data from a nonlinear to linear space.

thisiszilff2y ago

You're right that what I described isn't what people commonly think about as embeddings (given we are more advanced now the above description), but broadly an embedding is anything (in nlp at least) that maps text into a fixed length vector. When you make embedding like this, the nice thing is that cosine similarity has an easy to understand similarity meaning: count the number of words two documents have in common (subject to some normalization constant).

Most fancy modern embedding strategies basically start with this and then proceed to build on top of it to reduce dimensions, represent words as vectors in their own right, pass this into some neural layer, etc.

1 more reply

mschulkind2y ago

Aren't you just describing a bag-of-words model?

https://en.wikipedia.org/wiki/Bag-of-words_model

thisiszilff2y ago

Yes! And the follow up that cosine similarity (for BoW) is a super simple similarity metric based on counting up the number of words the two vectors have in common.

afro882y ago

How does this enable cosine similarity usage? I don't get the link between incrementing a word's index by it's count in a text and how this ends up with words that have similar meaning to have a high cosine similarity value

twelfthnight2y ago

I think they are talking about bag-of-words. If you apply a dimensionality reduction technique like SVD or even random projection on bag-of-words, you can effectively create a basic embedding. Check out latent semantic indexing / latent semantic analysis.

sell_dennis2y ago

You're right, that approach doesn't enable getting embeddings for an individual word. But it would work for comparing similarity of documents - not that well of course, but it's a toy example that might feel more intuitive

Someone2y ago

I think that strips away way too much. What you describe is “counting words”. It produces 50,000-dimensional vectors (most of them zero for the vast majority of texts) for each text, so it’s not a proper embedding.

What makes embeddings useful is that they do dimensionality reduction (https://en.wikipedia.org/wiki/Dimensionality_reduction) while keeping enough information to keep dissimilar texts away from each other.

I also doubt your claim “and the results aren't totally terrible”. In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc (https://en.wikipedia.org/wiki/Most_common_words_in_English)

A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.

vineyardmike2y ago

> I think that strips away way too much. What you describe is “counting words”. It produces 50,000-dimensional vectors (most of them zero for the vast majority of texts) for each text, so it’s not a proper embedding.

You can simplify this with a map, and only store non-zero values, but also you can be in-efficient: this is for learning. You can choose to store more valuable information than just word count. You can store any "feature" you want - various tags on a post, cohort topics for advertising, bucketed time stamps, etc.

For learning just storing word count gives you the mechanics you need of understanding vectors without actually involving neural networks and weights.

> I also doubt your claim “and the results aren't totally terrible”.

> In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc

(1) the comment suggested filtering out these words, and (2) the results aren't terrible. This is literally the first assignment in Stanfords AI class [1], and the results aren't terrible.

> A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.

Wow that seems a lot more complicated for something that was supposed to be a learning exercise.

[1] https://stanford-cs221.github.io/autumn2023/assignments/sent...

1 more reply

_giorgio_2y ago

Embeddings must be trained, otherwise they don't have any meaning, and are just random numbers.

wjholden2y ago

Really appreciate you explaining this idea, I want to try this! It wasn't clear to me until I read the discussion that you meant that you'd have similarity of entire documents, not among words.

thisiszilff2y ago

Yes! And that’s an oversight on my part — word embeddings are interesting but I usually deal with documents when doing nlp work and only deal with word embeddings when thinking about how to combine them into a document embedding.

Give it a shot! I’d grab a corpus like https://scikit-learn.org/stable/datasets/real_world.html#the... to play with and see what you get. It’s not going to be amazing, but it’s a great way to build some baseline intuition for nlp work with text that you can do on a laptop.

minimaxir2y ago· 18 in thread

One of my biggest annoyances with the modern AI tooling hype is that you need to use a vector store for just working with embeddings. You don't.

The reason vector stores are important for production use-cases are mostly latency-related for larger sets of data (100k+ records), but if you're working on a toy project just learning how to use embeddings, you can compute cosine distance with a couple lines of numpy by doing a dot product of a normalized query vectors with a matrix of normalized records.

Best of all, it gives you a reason to use Python's @ operator, which with numpy matrices does a dot product.

hereonout22y ago

100k records is still pretty small!

It feels a bit like the hype that happended with "big data". People ended up creating spark clusters to query a few million records. Or using Hadoop for a dataset you could process with awk.

Professionally I've only ever worked with dataset sizes in the region of low millions and have never needed specialist tooling to cope.

I assume these tools do serve a purpose but perhaps one that only kicks in at a scale approaching billions.

hot_gril2y ago

I've been in the "mid-sized" area a lot where Numpy etc cannot handle it, so I had to go to Postgres or more specialized tooling like Spark. But I always started with the simple thing and only moved up if it didn't suffice.

Similarly, I read how Postgres won't scale for a backend application and I should use Citus, Spanner, or some NoSQL thing. But that day has not yet arrived.

2 more replies

smahs2y ago

This sentiment is pretty common I guess. Outside of a niche, the massive scale for which a vast majority of the data tech was designed doesn't exist and KISS wins outright. Though I guess that's evolution, we want to test the limits in pursuit of grandeur before mastering the utility (ex. pyramids).

1 more reply

mritchie7122y ago

yeah, glad the hype around big data is dead. Not a lot of solid numbers in here, but this post covers it well[0].

We have duckdb embedded in our product[1] and it works perfectly well for billions of rows of a data without the hadoop overhead.

0 - https://motherduck.com/blog/big-data-is-dead/

1 - https://www.definite.app/

ertgbnm2y ago

When I'm messing around, I normally have everything in a Pandas DataFrame already so I just add embeddings as a column and calculate cosine similarity on the fly. Even with a hundred thousand rows, it's fast enough to calculate before I can even move my eyes down on the screen to read the output.

I regret ever messing around with Pinecone for my tiny and infrequently used set ups.

laurshelly2y ago

Could not agree more. For some reason Pandas seems to get phased out as developers advance.

m11172y ago

Actually, I had a pretty good experience with Pinecone.

christiangenco2y ago

Yup. I was just playing around with this in Javascript yesterday and with ChatGPT's help it was surprisingly simple to go from text => embedding (via. `openai.embeddings.create`) and then to compare the embedding similarity with the cosine distance (which ChatGPT wrote for me): https://gist.github.com/christiangenco/3e23925885e3127f2c177...

Seems like the next standard feature in every app is going to be natural language search powered by embeddings.

minimaxir2y ago

For posterity, OpenAI embeddings come pre-normalized so you can immediately dot-product.

Most embeddings providers do normalization by default, and SentenceTransformers has a normalize_embeddings parameter which does that. (it's a wrapper around PyTorch's F.normalize)

bryantwolfOP2y ago

As an individual, I love the idea of pushing to simplify even further to understand these core concepts. For the ecosystem, I like that vector stores make these features accessible to environments outside of Python.

simonw2y ago

If you ask ChatGPT to give you a cosine similarity function that works against two arrays of floating numbers in any programming language you'll get the code that you need.

Here's one in JavaScript (my prompt was "cosine similarity function for two javascript arrays of floating point numbers"):

    function cosineSimilarity(vecA, vecB) {
        if (vecA.length !== vecB.length) {
            throw "Vectors do not have the same dimensions";
        }
        let dotProduct = 0.0;
        let normA = 0.0;
        let normB = 0.0;
        for (let i = 0; i < vecA.length; i++) {
            dotProduct += vecA[i] * vecB[i];
            normA += vecA[i] ** 2;
            normB += vecB[i] ** 2;
        }
        if (normA === 0 || normB === 0) {
            throw "One of the vectors is zero, cannot compute similarity";
        }
        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
    }

Vector stores really aren't necessary if you're dealing with less than a few hundred thousand vectors - load them up in a bunch of in-memory arrays and run a function like that against them using brute-force.

2 more replies

twelfthnight2y ago

Even in production my guess is most teams would be better off just rolling their own embedding model (huggingface) + caching (redis/rocksdb) + FAISS (nearest neighbor) and be good to go. I suppose there is some expertise needed, but working with a vector database vendor has major drawbacks too.

danielbln2y ago

Or you just shove it into Postgres + pg_vector and just use the DBMS you already use anyway.

hackernoteng2y ago

Using Postgres with pgvector is trivial and cheap. Its also available on AWS RDS.

1 more reply

leobg2y ago

hnswlib, usearch. Both handle tens of millions of vectors easily. The latter even without holding them in RAM.

itronitron2y ago

Does anyone know the provenance for when vectors started to be called embeddings?

nmfisher2y ago

In an NLP context, earliest I could find was ICML 2008:

http://machinelearning.org/archive/icml2008/papers/391.pdf

I'm sure there are earlier instances, though - the strict mathematical definition of embedding has surely been around for a lot longer.

(interestingly, the word2vec papers don't use the term either, so I guess it didn't enter "common" usage until the mid-late 2010s)

minimaxir2y ago

I think it was due to GloVe embeddings back then: I don't recall them ever being called GloVe vectors, although the "Ve" does stand for vector so it could have been RAS syndrome.

1 more reply

dullcrisp2y ago· 13 in thread

Is there any easy way to run the embedding logic locally? Maybe even locally to the database? My understanding is that they’re hitting OpenAI’s API to get the embedding for each search query and then storing that in the database. I wouldn’t want my search function to be dependent on OpenAI if I could help it.

jmorgan2y ago

Support for _some_ embedding models works in Ollama (and llama.cpp - Bert models specifically)

  ollama pull all-minilm

  curl http://localhost:11434/api/embeddings -d '{
    "model": "all-minilm",
    "prompt": "Here is an article about llamas..."
  }'

Embedding models run quite well even on CPU since they are smaller models. There are other implementations with a library form factor like transformers.js https://xenova.github.io/transformers.js/ and sentence-transformers https://pypi.org/project/sentence-transformers/

laktek2y ago

If you are building using Supabase stack (Postgres as DB with pgVector), we just released a built-in embedding generation API yesterday. This works both locally (in CPUs) and you can deploy it without any modifications.

Check this video on building Semantic Search in Supabase: https://youtu.be/w4Rr_1whU-U

Also, the blog on announcement with links to text versions of the tutorials: https://supabase.com/blog/ai-inference-now-available-in-supa...

jonplackett2y ago

So handy! I already got some embeddings working with supabase pgvector and OpenAI and it worked great.

What would the cost of running this be like compared to the OpenAI embedding api?

1 more reply

_bramses2y ago

neat! one thing i’d really love tooling for: supporting multi user apps where each has their own siloed data and embeddings. i find myself having to set up databases from scratch for all my clients, which results in a lot of repetitive work. i’d love to have the ability one day to easily add users to the same db and let them get to embedding without having to have any knowledge going in

1 more reply

ngalstyan42y ago

We provide this functionality in Lantern cloud via our Lantern Extras extension: <https://github.com/lanterndata/lantern_extras>

You can generate CLIP embeddings locally on the DB server via:

  SELECT abstract,
       introduction,
       figure1,
       clip_text(abstract) AS abstract_ai,
       clip_text(introduction) AS introduction_ai,
       clip_image(figure1) AS figure1_ai
  INTO papers_augmented
  FROM papers;

Then you can search for embeddings via:

  SELECT abstract, introduction FROM papers_augmented ORDER BY clip_text(query) <=> abstract_ai LIMIT 10;

The approach significantly decreases search latency and results in cleaner code. As an added bonus, EXPLAIN ANALYZE can now tell percentage of time spent in embedding generation vs search.

The linked library enables embedding generation for a dozen open source models and proprietary APIs (list here: <https://lantern.dev/docs/develop/generate>, and adding new ones is really easy.

charlieyuan2y ago

Lantern seems really cool! Interestingly we did try CLIP (openclip) image embeddings but the results were poor for 24px by 24px icons. Any ideas?

Charlie @ v0.app

1 more reply

simonw2y ago

There are a bunch of embedding models you can run on your own machine. My LLM tool had plugins for some of those:

- https://llm.datasette.io/en/stable/plugins/directory.html#em...

Here's how to use them: https://simonwillison.net/2023/Sep/4/llm-embeddings/

dvt2y ago

Yes, I use fastembed-rs[1] in a project I'm working on and it runs flawlessly. You can store the embeddings in any boring database (it's just an array of f32s at the end of the day). But for fast vector math (which you need for similarity search), a vector database is recommended, e.g. the pgvector[2] postgres extension.

[1] https://github.com/Anush008/fastembed-rs

[2] https://github.com/pgvector/pgvector

J_Shelby_J2y ago

Fun timing!

I literally just published my first crate: candle_embed[1]

It uses Candle under the hood (the crate is more of a user friendly wrapper) and lets you use any model on HF like the new SoTA model from Snowflake[2].

[1] https://github.com/ShelbyJenkins/candle_embed [2] https://huggingface.co/Snowflake/snowflake-arctic-embed-l

jonnycoder2y ago

The MTEB leaderboard has you covered. That is a goto for finding the leading embedding models and I believe many of them can run locally.

https://huggingface.co/spaces/mteb/leaderboard

bryantwolfOP2y ago

This is a good call out. OpenAI embeddings were simple to stand up, pretty good, cheap at this scale, and accessible to everyone. I think that makes them a good starting point for many people. That said, they're closed-source, and there are open-source embeddings you can run on your infrastructure to reduce external dependencies.

notakash2y ago

If you're building an iOS app, I've had success storing vectors in coredata and using a tiny coreml model that runs on device for embedding and then doing cosine similarity.

internet1010102y ago

Open WebUI has langchain built-in and integrates perfectly with ollama. They have several variations of docker compose files on their github.

https://github.com/open-webui/open-webui

crowcroft2y ago· 12 in thread

My smooth brain might not understand this properly, but the idea is we generate embeddings, store them, then use retrieval each time we want to use them.

For simple things we might not need to worry about storing much, we can generate the embeddings and just cache them or send them straight to retrieval as an array or something...

The storing of embeddings seems the hard part, do I need a special database or PG extension? Is there any reason I can't store them as a blobs in SQlite if I don't have THAT much data, and I don't care too much about speed? Do embeddings generated ever 'expire'?

bryantwolfOP2y ago

You'd have to update the embedding every time the data used to generate it changes. For example, if you had an embedding for user profiles and they updated their bio, you would want to make a new embedding.

I don't expect to have to change the embeddings for each icon all that often, so storing them seemed like a good choice. However, you probably don't need to cache the embedding for each search query since there will be long-tail ones that don't change that much.

The reason to use pgvector over blobs is if you want to use the distance functions in your queries.

kmeisthax2y ago

Yes, you can shove the embeddings in a BLOB, but then you can't do the kinds of query operations you expect to be able to do with embeddings.

simonw2y ago

You can run similarity scores with a custom SQLite function.

I use a Python one usually, but it's also possible to build a much faster one in C: https://simonwillison.net/2024/Mar/23/building-c-extensions-...

crowcroft2y ago

Right like you could use it sort of like cache and send the blobs to OpenAI to use their similarity API, but you couldn't really use SQL to do cosine similarity operations?

My understanding of what's going on at a technical level might be a bit limited.

1 more reply

laborcontract2y ago

A KV store is both good enough and highly performant. I use Redis for storing embeddings and expire them after a while. Unless you have a highly specialized use case it’s not economical to persistently store chunk embedding.

Redis also does have vector search capability as well. However, the most popular answer you’ll get here is to use Postgres (pgvectpr).

crowcroft2y ago

Redis sounds like a good option. I like that it’s not more infrastructure, I already have redis setup for my app so I’m not adding more to the stack.

H1Supreme2y ago

Vector databases are used to store embeddings.

crowcroft2y ago

But why is that? I’m sure it’s the ‘best’ way to do things, but it also means more infrastructure which for simple apps isn’t worth the hassle.

I should use redis for queues but often I’ll just use a table in a SQLite database. For small scale projects I find it works fine, I’m wondering what an equivalent simple option for embeddings would be.

1 more reply

chuckhend2y ago

check out https://github.com/tembo-io/pg_vectorize - we're taking it a little bit beyond just the storage and index. The project uses pgvector for the indices and distance operators, but also adds a simpler API, hooks into pre-trained embedding models, and helps you keep embeddings updated as data changes/grows

alexgarcia-xyz2y ago

Re storing vectors in BLOB columns: ya, if it's not a lot of data and it's fast enough for you, then there's no problem doing it like that. I'd even just store then in JSON/npy files first and see how long you can get away with it. Once that gets too slow, then try SQLite/redis/valkey, and when that gets too slow, look into pgvector or other vector database solutions.

For SQLite specifically, very large BLOB columns might effect query performance, especially for large embeddings. For example, a 1536-dimension vector from OpenAI would take 1536 * 4 = 6144 bytes of space, if stored in a compact BLOB format. That's larger than SQLite default page size of 4096, so that extra data will overflow into overflow pages. Which again, isn't too big of a deal, but if the original table had small values before, then table scans can be slower.

One solution is to move it to a separate table, ex on an original `users` table, you can make a new `CREATE TABLE users_embeddings(user_id, embedding)` table and just LEFT JOIN that when you need it. Or you can use new techniques like Matryoshka embeddings[0] or scalar/binary quantization[1] to reduce the size of individual vectors, at the cost of lower accuracy. Or you can bump the page size of your SQLite database with `PRAGMA page_size=8192`.

I also have a SQLite extension for vector search[2], but there's a number of usability/ergonomic issues with it. I'm making a new one that I hope to release soon, which will hopefully be a great middle ground between "store vectors in a .npy files" and "use pgvector".

Re "do embeddings ever expire": nope! As long as you have access to the same model, the same text input should give the same embedding output. It's not like LLMs that have temperatures/meta prompts/a million other dials that make outputs non-deterministic, most embedding models should be deterministic and should work forever.

[0] https://huggingface.co/blog/matryoshka

[1] https://huggingface.co/blog/embedding-quantization

[2] https://github.com/asg017/sqlite-vss

crowcroft2y ago

This is very useful appreciate the insight. Storing embeddings in a table and joining when needed feels like a really nice solution for what I'm trying to do.

simonw2y ago

I store them as blobs in SQLite. It works fine - depending on the model they take up 1-2KB each.

Imnimo2y ago· 10 in thread

One of the challenges here is handling homonyms. If I search in the app for "king", most of the top ten results are "ruler" icons - showing a measuring stick. Rodent returns mostly computer mice, etc.

https://www.v0.app/search?q=king

https://www.v0.app/search?q=rodent

This isn't a criticism of the app - I'd rather get a few funny mismatches in exchange for being able to find related icons. But it's an interesting puzzle to think about.

itronitron2y ago

>> If I search in the app for "king", most of the top ten results are "ruler" icons

I believe that's the measure of a man.

charlieyuan2y ago

Good call out! We think of this as a two part problem.

1. The intent of the user. Is it a description of the look of the icon or the utility of the icon? 2. How best to rank the results which is a combination of intent, CTR of past search queries, bootstrapping popularity via usage on open source projects etc.

- Charlie of v0.app

joshspankit2y ago

This is imo the worst part of embedding search.

Somehow Amazon continues to be the leader in muddy results which is a sign that it’s a huge problem domain and not easily fixable even if you have massive resources.

Aachen2y ago

I don't seem to have this issue on any other webshop that uses normal keyword searches and always wondered what Amazon did to mess it up so much and why people use that website (also for other reasons, but search is definitely one of them: no way to search properly). The answer isn't always "massive resources" towards being more hi-tech

But, thanks, this explains a lot about Amazon's search results and might help me steer it if I need to use it in the future :)

dceddia2y ago

I was reading this article and thinking about things like, in the case of doing transcription, if you heard the spoken word “sign” in isolation you couldn’t be sure whether it meant road sign, spiritual sign, +/- sign, or even the sine function. This seems like a similar problem where you pretty much require context to make a good guess, otherwise the best it could do is go off of how many times the word appears in the dataset right? Is there something smarter it could do?

anon3738392y ago

Wouldn’t it help to provide affordances guiding the user to submit a question rather than a keyword? Then, “Why are kings selected by primogeniture?” probably wouldn’t be near passages about measuring sticks in the embedding space. (Of course, this idea doesn’t work for icon search.)

feoren2y ago

Only if you have an attention mechanism.

lubesGordi2y ago

I think this is the point of the Attention portion in an llm, to use context to skew the embedding result closer to what youre looking for.

It does seem a little strange 'ruler' would be closer to 'king' versus something like 'crown'.

bryantwolfOP2y ago

Yeah, these can be cute, but they're not ideal. I think the user feedback mechanism could help naturally align this over time, but it would also be gameable. It's all interesting stuff

jonnycoder2y ago

As the op, you can do both semantic search (embedding) and keyword search. Some RAG techniques call out using both for better results. Nice product by the way!

1 more reply

m11172y ago· 6 in thread

ah pgvector is kind of annoying to start with, you have to set it up and maintain, and then it starts falling apart when you have more vectors

sdesol2y ago

Can you elaborate more on the falling apart? I can see pgvector being intimidating for users with no experience standing up a DB, but I don't see how Postgres or pgvector would fall apart. Note, my reason for asking is I'm planning on going all in with Postgres, so pgvector makes sense for me.

cargobuild2y ago

https://www.pinecone.io/blog/pinecone-vs-pgvector/ check it out :)

hackernoteng2y ago

What is "more vectors"? How many are we talking about? We've been using pgvector in production for more than 1 year without any issues. We dont have a ton of vectors, less than 100,000, and we filter queries by other fields so our total per cosine function is probably more like max of 5000. Performance is fine and no issues.

chuckhend2y ago

Take a look at https://github.com/tembo-io/pg_vectorize. It makes it a lot easier to get started. It runs on pgvector, but as a user, its completely abstracted from you. It also provides you with a way to auto-update embeddings as you add new data or update existing source data.

l1am02y ago

This is good until it isn’t. Tried to get it working for 4 hours and it just did not.

And then I had an important architectural gotcha moment: I want my database to be dump. Its purpose is to store and query data in an efficient and ACID way.

Adding cronjobs and http calls to the database is a bad idea.

I love the simplicity and that it helps to keep embedding a up to date (if it works), but I decided to not treat my database as application.

ntry012y ago

on the other hand, if you have postgres already, it may be easier to add pgvector than to add another dependency to your stack (especially if you are using something like supabase)

another benefit is that you can easily filter your embeddings by other field, so everything is kept in one place and could help with perfomance

it's a good place to start in those cases and if it is successful and you need extreme performance you can always move to other specialized tools like qdrant, pinecone or weaviate which were purpose-built for vectors

gchadwick2y ago· 4 in thread

For an article extolling the benefits of embeddings for developers looking to dip their toe into the waters of AI it's odd they don't actually have an intro to embeddings or to vector databases. They just assume the reader already knows these concepts and dives on in to how they use them.

Sure many do know these concepts already but they're probably not the people wondering about a 'good starting point for the AI curious app developer'.

simonw2y ago

I published this pretty comprehensive intro to embeddings last year: https://simonwillison.net/2023/Oct/23/embeddings/

nicbou2y ago

I found many of your other posts and they were the spark that finally made me "get it" and look deeper into LLMs. This post looks like another slam dunk.

Keep up the good work!

gk12y ago

To add to the other recommendations, here's a primer on vector DB's: https://www.pinecone.io/learn/vector-database/

charlieyuan2y ago

Apologies!

Here's a good primer on embeddings from openai: https://platform.openai.com/docs/guides/embeddings

hot_gril2y ago· 3 in thread

This is where I got started too. Glove embedding stored in Postgres.

Pgvector is nice, and it's cool seeing quick tutorials using it. Back then, we only had cube, which didn't do cosine similarity indexing out of the box (you had to normalize vectors and use euclidean indexes) and only supported up to 100 dimensions. And there were maybe other inconveniences I don't remember, cause front page AI tutorials weren't using it.

isoprophlex2y ago

PGvector is very nice indeed. And you get to store your vectors close to the rest of your data. I'm yet to understand the unique use case for dedicated vector dbs. It seems so annoying, having to query your vectors in a separate database without being able to easily join/filter based on the rest of your tables.

I stored ~6 million hacker news posts, their metadata, and the vector embeddings in a cheap 20$/month vm running pgvector. Querying is very fast. Maybe there's some penalty to pay when you get to the billion+ row counts, but I'm happy so far.

brianjking2y ago

As I'm trying to work on some pricing info for PGVector - can you share some more info about the hacker news posts you've embedded?

* Which embedding model? (or number of dimensions) * When you say 6 million posts - it's just the URL of the post, title, and author, or do you mean you've also embedded the linked URL (be it HN or elsewhere)?

Cheers!

hot_gril2y ago

You can also store vectors or matrices in a split-up fashion as separate rows in a table, which is particularly useful if they're sparse. I've handled huge sparse matrix expressions (add, subtract, multiply, transpose) that way, cause numpy couldn't deal with them.

thorum2y ago· 3 in thread

Can embeddings be used to capture stylistic features of text, rather than semantic? Like writing style?

levocardia2y ago

Probably, but you might need something more sophisticated than cosine distance. For example, you might take a dataset of business letters, diary entries, and fiction stories and train some classifier on top of the embeddings of each of the three types of text, then run (embeddings --> your classifier) on new text. But at that point you might just want to ask an LLM directly with a prompt like - "Classify the style of the following text as business, personal, or fiction: $YOUR TEXT$"

vladimirzaytsev2y ago

You may get way more accurate results from relatively small models as well as logits for each class if you ask one question per class instead.

vladimirzaytsev2y ago

Likely not, embeddings are very crude. Embeddings of a text is just an average of "meanings" of words.

As is embeddings lack a lot of tricks that made transformers so efficient.

mrkeen2y ago· 2 in thread

Given

  not because they’re sufficiently advanced technology indistinguishable from magic, but the opposite.

  Unlike LLMs, working with embeddings feels like regular deterministic code.

  <h3>Creating embeddings</h3>

I was hoping for a bit more than:

  They’re a bit of a black box

  Next, we chose an embedding model. OpenAI’s embedding models will probably work just fine.

Aachen2y ago

Same here. I was saving the article for when I have a few hours to really dive into it, build upon it, learn from seeing and doing. Imagine my disappointment when I had the evening cleared, started reading, and discover all they're showing is how to concatenate a string, download someone else's black box model which outputs the similarity between the user's query and the concatenated info about each object, and then write queries on the output

It's good to know you can do this performantly on your own system, but if the article had started out with "look, this model can output similarity between two texts and we can make a search engine with that", that'd be much more up front about what to expect to learn from it

Edit: another comment mentioned you can't even run it yourself, you need to ask ClosedAI for every search query a user does on your website. WTF is this article, at that point you might as well pipe the query into general-purpose chatgpt which everyone already knows and let that sort it out

akoboldfrying2y ago

I agree. The article was useful insofar as it detailed the steps they took to solve their problem clearly, and it's easy to see that many common problems are similar and could therefore be solved similarly, but I went in expecting more insight. How are the strings turned into arrays of numbers? Why does turning them into numbers that way lead to these nice properties?

primitivesuave2y ago· 2 in thread

I learned how to use embeddings by building semantic search for the Bhagavad Gita. I simply saved the embeddings for all 700 verses into a big file which is stored in a Lambda function, and compared against incoming queries with a single query to OpenAI's embedding endpoint.

Shameless plug in case anyone wants to test it out - https://gita.pub

forgingahead2y ago

Really nice and beautiful site!

primitivesuave2y ago

Thank you! :)

cargobuild2y ago· 2 in thread

seeing comments about using pgvector... at pinecone, we spent some time understanding it's limitations and pain points. pinecone eliminates these pain points entirely and makes things simple at any scale. check it out: https://www.pinecone.io/blog/pinecone-vs-pgvector/

gregorymichael2y ago

Has Pinecone gotten any cheaper? Last time I tried it was $75/month for the starter plan / single vector store.

cargobuild2y ago

yep. pinecone serverless has reduced costs significantly for many workloads.

aidenn02y ago· 2 in thread

Can someone give a qualitative explanation of what the vector of a word with 2 unrelated meanings would look like compared to the vector of a synonym of each of those meanings?

base6982y ago

If you think about it like a point on a graph, and the vectors as just 2D points (x,y), then the synonyms would be close and the unrelated meanings would be further away.

aidenn02y ago

I'm guessing 2 dimensions isn't for this.

Here's a concrete example: "bow" would need to be close to "ribbon" (as in a bow on a present) and also close to "gun" (as a weapon that shoots a projectile), but "ribbon" and "gun" would seem to need be far from each other. How does something like word2vec resolve this? Any transitive relationship would seem to fall afoul of this.

1 more reply

benreesman2y ago· 1 in thread

Without getting into any big debates about whether or not RAG is medium-term interesting or whatever, you can ‘pip install sentence-transformers faiss’ and just immediately start having fun. I recommend using straightforward cosine similarity to just crush the NYT’s recommender as a fun project for two reasons: there’s an API and plenty of corpus, and it’s like, whoa, that’s better than the New York Times.

He’s trying to sell a SaaS product (Pinecone), but he’s doing it the right way: it’s ok to be an influencer if you know what you’re taking about.

James Briggs has great stuff on this: https://youtube.com/@jamesbriggs

aeth0s2y ago

> crush the NYT’s recommender as a fun project for two reasons

Could you share what recommender you're referring to here, and how you can evaluate "crushing" it?

Sounds fun!

clementmas2y ago· 1 in thread

Embeddings are indeed a good starting point. Next step is choosing the model and the database. The comments here have been taken over by database companies so I'm skeptical about the opinions. I wish MySQL had a cosine search feature built in

bootsmann2y ago

pg_vector has you covered

patrick-fitz2y ago· 1 in thread

Nice project! I find it can be hard to think of a idea that is well suited to use AI. Using embeddings for search is definitely a good option to start with.

ParanoidShroom2y ago

I made a reverse image search when I learned about embeddings. It's pretty fun to work with images https://medium.com/@christophe.smet1/finding-dirty-xtc-with-...

dvaun2y ago· 1 in thread

I’d love to build a suite of local tooling to play around with different embedding approaches.

I’ve had great results using SentenceTransformers for quick one-off tasks at work for unique data asks.

I’m curious about clustering within the embeddings and seeing what different approaches can yield and what applications they work best for.

PaulHoule2y ago

If I have 50,000 historical articles and 5,000 new articles I apply SBERT and then k-means with N=20 I get great results in terms of articles about Ukraine, sports, chemistry, and nerdcore from Lobsters ending up in distinct clusters.

I’ve used DBSCAN for finding duplicate content, this is less successful. With the parameters I am using it is rare for there to be a false positives, but there aren’t that many true positives. I’m sure I could do do better if I tuned it up but I’m not sure if there is an operating point I’d really like.

KasianFranks2y ago· 1 in thread

They are named 'feature' vectors with scored attributes, similar to associative arrays.Just ask MI. Jordan, D. Blie, S. Mian or A. Ng.

jerrygenser2y ago

They are embedded into a particular semantic vector space that is learned based on a model. Another feature vector could be hand rolled based on feature engineering, tidf ngrams etc. Embedding is typically distinct from feature engineering that is manual.

voxelc4L2y ago

It begs the question though, doesn't it...? Embeddings require a neural network or some reasonable facsimile to produce the embedding in the first place. Compression to a vector (a semantic space of some sort) still needs to happen – and that's the crux of the understanding/meaning. To just say "embeddings are cool let's use them" is ignoring the core problem of semantics/meaning/information-in-context etc. Knowing where an embedding came from is pretty damn important.

Embeddings live a very biased existence. They are the product of a network (or some algorithm) that was trained (or built) with specific data (and/or code) and assume particular biases intrinsically (network structure/algorithm) or extrinsically (e.g., data used to train a network) which they impose on the translation of data into some n-dimensional space. Any engineered solution always lives with such limitations, but with the advent of more and more sophisticated methods for the generation of them, I feel like it's becoming more about the result than the process. This strikes me as problematic on a global scale... might be fine for local problems but could be not-so-great in an ever changing world.

suprgeek2y ago

Great project and excellent initiative to learn about embeddings. Two possible avenues to explore more. Your system backend could be thought of as being composed of two parts: |Icons->Embedder->|PGVector|->Retriever->Display Result|

1. In the embedder part trying out different embedding models and/or vector dimensions to explore if the Recall@K & Precision@K for your data set (icons) improves. Models make a surprising amount of difference to the quality of the results. Try the MTEB Leaderboard for ideas on which models to explore.

2. In the Information Retriever part you can try a couple of approaches: a.after you retrieve from PGVector see if you can use a reranker like Cohere to get better results https://cohere.com/blog/rerank

b.You could try a "fusion ranking" similar to the one you do but structured such that 50% of the weight is for a plain old keyword search in the metadata and 50% is for the embedding based search

Finally something more interesting to noodle on - what if the embeddings were based on the icon images and the model knew how to search for a textual descriptions in the latent space?

adamgordonbell2y ago

My problem with this is that it doesn't explain a lot.

You can manually make a vector of a word and then step wise get up to word2vec approach and then document embedding. My post[1] does some of the first part and this great word2vec post[2] dives into it in more detail.

[1] https://earthly.dev/blog/cosine_similarity_text_embeddings/

[2] https://jalammar.github.io/illustrated-word2vec/

kaycebasques2y ago

I have been saying similar things to my fellow technical writers ever since the ChatGPT explosion. We now have a tool that makes semantic search on arbitrary, diverse input much easier. Improved semantic search could make a lot of common technical writing workflows much more efficient. E.g. speeding up the mandatory research that you must do before it's even possible to write an effective doc.

thomasfromcdnjs2y ago

I've been adding embeddings to every project I work on for the purpose of vector similarity searches.

I was just trying to order uber eats and wondering why they don't have a better search based off embeddings.

Almost finished building a feature on JSON Resume, that takes your hosted resume and WhoIsHiring job posts and uses embeddings to return relevant results -> https://registry.jsonresume.org/thomasdavis/jobs

EcommerceFlow2y ago

Embeddings have a special place in my heart since I learned about them 2 years ago. Working in SEO, it felt like everything finally "clicked" and I understood, on a lower level, how Google search actually works, how they're able to show specific content snippets directly on the search results page, etc. I never found any "SEO Guru" discussing this at all back then (maybe even now?), even though this was complete gold. It explains "topical authority" and gave you clues on how Google itself understands it.

mistermann2y ago

> You can even try dog breeds like ‘hound,’ ‘poodle,’ or my favorite ‘samoyed.’ It pretty much just works. But that’s not all; it also works for other languages. Try ‘chien’ and even ‘犬’1!

Can anyone explain how this language translation works? The magic is in the embeddings of course, but how does it work, how does it translate ~all words across all languages?

willcodeforfoo2y ago

One thing I'm not sure of is how much of a larger bit of text should go into an embedding? I assume it's a trade off of context and recall, with one word not meaning much semantically, and the whole document being too much to represent with just numbers. Is there a sweet spot (e.g. split by sentence) or am I missing something here?

tapatio2y ago

Tangential question: how are people using GenAI for financial datasets for insights and recommendations? Assume tens of desparate databases with financial data. Does NL2SQL work well for this? Or OpenAI Tools (formerly OpenAI Functions)? What have you found that is consistently accurate?

mehulashah2y ago

I think he is saying: embeddings are deterministic, so they are more predictable in production.

They’re still magic, with little explain ability or adaptability when they don’t work.

pantulis2y ago

I strongly agree with the title of the article. RAG is very interesting right now just as an example of how technology moves from being just fresh out of academia to being engineered and commoditized into regular out of the shelf tools. On the other hand I don't think it's that important to understand how embeddings are calculated, for the beginner it's more important to showcase why they work and why they enable simple reasoning like "queen = woman + (king - men)" and the possible use cases.

1 more reply

LunaSea2y ago

Does anyone have examples of word (ngram) disambiguation when doing Approximate Nearest Neighbour (ANN) on word vector embeddings?

j / k navigate · click thread line to collapse

174 comments

134 comments · 30 top-level

thisiszilff2y ago· 22 in thread

Presto -- embeddings! And you can use cosine similarity with them and all that good stuff and the results aren't totally terrible.

HarHarVeryFunny2y ago

The classic example is word embeddings such as word2vec, or GloVE, where due to the embeddings being meaningful in this way, one can see vector relationships such as "man - woman" = "king - queen".

thisiszilff2y ago

You are 100% right this won't give you the word embedding analogies like king - man = queen or stuff like that. This embedding has no concept of relationships between words.

2 more replies

IanCal2y ago

They're not, I get why you think that though.

They're making a vector for a text that's the term frequencies in the document.

It's one step simpler than tfidf which is a great starting point.

2 more replies

pstorm2y ago

svieira2y ago

_akhe2y ago

This is a simple example where it scores their frequency. If you scored every word by their frequency only you might have embeddings like this:

  act: [0.1]
  as:  [0.4]
  at:  [0.3]
  ...

  act: [0.1, 0.1, 0.3, 0.0001, 0.000003, 0.003, ... (1536 items)]
  ...

jncfhnb2y ago

King doesn’t need to appear commonly with prince. It just needs to appear in the same context as prince.

sdwr2y ago

It doesn't even work as described for popularity - one word starts at 49,999 and one starts at 0.

1 more reply

im3w1l2y ago

It's a document embedding, not a word embedding.

zachrose2y ago

Maybe the idea is to order your vocabulary into some kind of “semantic rainbow”? Like a one-dimensional embedding?

dekhn2y ago

thisiszilff2y ago

1 more reply

mschulkind2y ago

Aren't you just describing a bag-of-words model?

https://en.wikipedia.org/wiki/Bag-of-words_model

thisiszilff2y ago

Yes! And the follow up that cosine similarity (for BoW) is a super simple similarity metric based on counting up the number of words the two vectors have in common.

afro882y ago

twelfthnight2y ago

sell_dennis2y ago

Someone2y ago

vineyardmike2y ago

For learning just storing word count gives you the mechanics you need of understanding vectors without actually involving neural networks and weights.

> I also doubt your claim “and the results aren't totally terrible”.

> In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc

(1) the comment suggested filtering out these words, and (2) the results aren't terrible. This is literally the first assignment in Stanfords AI class [1], and the results aren't terrible.

Wow that seems a lot more complicated for something that was supposed to be a learning exercise.

[1] https://stanford-cs221.github.io/autumn2023/assignments/sent...

1 more reply

_giorgio_2y ago

Embeddings must be trained, otherwise they don't have any meaning, and are just random numbers.

wjholden2y ago

Really appreciate you explaining this idea, I want to try this! It wasn't clear to me until I read the discussion that you meant that you'd have similarity of entire documents, not among words.

thisiszilff2y ago

minimaxir2y ago· 18 in thread

One of my biggest annoyances with the modern AI tooling hype is that you need to use a vector store for just working with embeddings. You don't.

Best of all, it gives you a reason to use Python's @ operator, which with numpy matrices does a dot product.

hereonout22y ago

100k records is still pretty small!

It feels a bit like the hype that happended with "big data". People ended up creating spark clusters to query a few million records. Or using Hadoop for a dataset you could process with awk.

Professionally I've only ever worked with dataset sizes in the region of low millions and have never needed specialist tooling to cope.

I assume these tools do serve a purpose but perhaps one that only kicks in at a scale approaching billions.

hot_gril2y ago

Similarly, I read how Postgres won't scale for a backend application and I should use Citus, Spanner, or some NoSQL thing. But that day has not yet arrived.

2 more replies

smahs2y ago

1 more reply

mritchie7122y ago

yeah, glad the hype around big data is dead. Not a lot of solid numbers in here, but this post covers it well[0].

We have duckdb embedded in our product[1] and it works perfectly well for billions of rows of a data without the hadoop overhead.

0 - https://motherduck.com/blog/big-data-is-dead/

1 - https://www.definite.app/

ertgbnm2y ago

I regret ever messing around with Pinecone for my tiny and infrequently used set ups.

laurshelly2y ago

Could not agree more. For some reason Pandas seems to get phased out as developers advance.

m11172y ago

Actually, I had a pretty good experience with Pinecone.

christiangenco2y ago

Seems like the next standard feature in every app is going to be natural language search powered by embeddings.

minimaxir2y ago

For posterity, OpenAI embeddings come pre-normalized so you can immediately dot-product.

Most embeddings providers do normalization by default, and SentenceTransformers has a normalize_embeddings parameter which does that. (it's a wrapper around PyTorch's F.normalize)

bryantwolfOP2y ago

simonw2y ago

If you ask ChatGPT to give you a cosine similarity function that works against two arrays of floating numbers in any programming language you'll get the code that you need.

Here's one in JavaScript (my prompt was "cosine similarity function for two javascript arrays of floating point numbers"):

    function cosineSimilarity(vecA, vecB) {
        if (vecA.length !== vecB.length) {
            throw "Vectors do not have the same dimensions";
        }
        let dotProduct = 0.0;
        let normA = 0.0;
        let normB = 0.0;
        for (let i = 0; i < vecA.length; i++) {
            dotProduct += vecA[i] * vecB[i];
            normA += vecA[i] ** 2;
            normB += vecB[i] ** 2;
        }
        if (normA === 0 || normB === 0) {
            throw "One of the vectors is zero, cannot compute similarity";
        }
        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
    }

2 more replies

twelfthnight2y ago

danielbln2y ago

Or you just shove it into Postgres + pg_vector and just use the DBMS you already use anyway.

hackernoteng2y ago

Using Postgres with pgvector is trivial and cheap. Its also available on AWS RDS.

1 more reply

leobg2y ago

hnswlib, usearch. Both handle tens of millions of vectors easily. The latter even without holding them in RAM.

itronitron2y ago

Does anyone know the provenance for when vectors started to be called embeddings?

nmfisher2y ago

In an NLP context, earliest I could find was ICML 2008:

http://machinelearning.org/archive/icml2008/papers/391.pdf

I'm sure there are earlier instances, though - the strict mathematical definition of embedding has surely been around for a lot longer.

(interestingly, the word2vec papers don't use the term either, so I guess it didn't enter "common" usage until the mid-late 2010s)

minimaxir2y ago

I think it was due to GloVe embeddings back then: I don't recall them ever being called GloVe vectors, although the "Ve" does stand for vector so it could have been RAS syndrome.

1 more reply

dullcrisp2y ago· 13 in thread

jmorgan2y ago

Support for _some_ embedding models works in Ollama (and llama.cpp - Bert models specifically)

  ollama pull all-minilm

  curl http://localhost:11434/api/embeddings -d '{
    "model": "all-minilm",
    "prompt": "Here is an article about llamas..."
  }'

laktek2y ago

Check this video on building Semantic Search in Supabase: https://youtu.be/w4Rr_1whU-U

Also, the blog on announcement with links to text versions of the tutorials: https://supabase.com/blog/ai-inference-now-available-in-supa...

jonplackett2y ago

So handy! I already got some embeddings working with supabase pgvector and OpenAI and it worked great.

What would the cost of running this be like compared to the OpenAI embedding api?

1 more reply

_bramses2y ago

1 more reply

ngalstyan42y ago

We provide this functionality in Lantern cloud via our Lantern Extras extension: <https://github.com/lanterndata/lantern_extras>

You can generate CLIP embeddings locally on the DB server via:

  SELECT abstract,
       introduction,
       figure1,
       clip_text(abstract) AS abstract_ai,
       clip_text(introduction) AS introduction_ai,
       clip_image(figure1) AS figure1_ai
  INTO papers_augmented
  FROM papers;

Then you can search for embeddings via:

  SELECT abstract, introduction FROM papers_augmented ORDER BY clip_text(query) <=> abstract_ai LIMIT 10;

The approach significantly decreases search latency and results in cleaner code. As an added bonus, EXPLAIN ANALYZE can now tell percentage of time spent in embedding generation vs search.

The linked library enables embedding generation for a dozen open source models and proprietary APIs (list here: <https://lantern.dev/docs/develop/generate>, and adding new ones is really easy.

charlieyuan2y ago

Lantern seems really cool! Interestingly we did try CLIP (openclip) image embeddings but the results were poor for 24px by 24px icons. Any ideas?

Charlie @ v0.app

1 more reply

simonw2y ago

There are a bunch of embedding models you can run on your own machine. My LLM tool had plugins for some of those:

- https://llm.datasette.io/en/stable/plugins/directory.html#em...

Here's how to use them: https://simonwillison.net/2023/Sep/4/llm-embeddings/

dvt2y ago

[1] https://github.com/Anush008/fastembed-rs

[2] https://github.com/pgvector/pgvector

J_Shelby_J2y ago

Fun timing!

I literally just published my first crate: candle_embed[1]

It uses Candle under the hood (the crate is more of a user friendly wrapper) and lets you use any model on HF like the new SoTA model from Snowflake[2].

[1] https://github.com/ShelbyJenkins/candle_embed [2] https://huggingface.co/Snowflake/snowflake-arctic-embed-l

jonnycoder2y ago

The MTEB leaderboard has you covered. That is a goto for finding the leading embedding models and I believe many of them can run locally.

https://huggingface.co/spaces/mteb/leaderboard

bryantwolfOP2y ago

notakash2y ago

If you're building an iOS app, I've had success storing vectors in coredata and using a tiny coreml model that runs on device for embedding and then doing cosine similarity.

internet1010102y ago

Open WebUI has langchain built-in and integrates perfectly with ollama. They have several variations of docker compose files on their github.

https://github.com/open-webui/open-webui

crowcroft2y ago· 12 in thread

My smooth brain might not understand this properly, but the idea is we generate embeddings, store them, then use retrieval each time we want to use them.

For simple things we might not need to worry about storing much, we can generate the embeddings and just cache them or send them straight to retrieval as an array or something...

bryantwolfOP2y ago

The reason to use pgvector over blobs is if you want to use the distance functions in your queries.

kmeisthax2y ago

Yes, you can shove the embeddings in a BLOB, but then you can't do the kinds of query operations you expect to be able to do with embeddings.

simonw2y ago

You can run similarity scores with a custom SQLite function.

I use a Python one usually, but it's also possible to build a much faster one in C: https://simonwillison.net/2024/Mar/23/building-c-extensions-...

crowcroft2y ago

Right like you could use it sort of like cache and send the blobs to OpenAI to use their similarity API, but you couldn't really use SQL to do cosine similarity operations?

My understanding of what's going on at a technical level might be a bit limited.

1 more reply

laborcontract2y ago

Redis also does have vector search capability as well. However, the most popular answer you’ll get here is to use Postgres (pgvectpr).

crowcroft2y ago

Redis sounds like a good option. I like that it’s not more infrastructure, I already have redis setup for my app so I’m not adding more to the stack.

H1Supreme2y ago

Vector databases are used to store embeddings.

crowcroft2y ago

But why is that? I’m sure it’s the ‘best’ way to do things, but it also means more infrastructure which for simple apps isn’t worth the hassle.

1 more reply

chuckhend2y ago

alexgarcia-xyz2y ago

[0] https://huggingface.co/blog/matryoshka

[1] https://huggingface.co/blog/embedding-quantization

[2] https://github.com/asg017/sqlite-vss

crowcroft2y ago

This is very useful appreciate the insight. Storing embeddings in a table and joining when needed feels like a really nice solution for what I'm trying to do.

simonw2y ago

I store them as blobs in SQLite. It works fine - depending on the model they take up 1-2KB each.

Imnimo2y ago· 10 in thread

https://www.v0.app/search?q=king

https://www.v0.app/search?q=rodent

This isn't a criticism of the app - I'd rather get a few funny mismatches in exchange for being able to find related icons. But it's an interesting puzzle to think about.

itronitron2y ago

>> If I search in the app for "king", most of the top ten results are "ruler" icons

I believe that's the measure of a man.

charlieyuan2y ago

Good call out! We think of this as a two part problem.

- Charlie of v0.app

joshspankit2y ago

This is imo the worst part of embedding search.

Somehow Amazon continues to be the leader in muddy results which is a sign that it’s a huge problem domain and not easily fixable even if you have massive resources.

Aachen2y ago

But, thanks, this explains a lot about Amazon's search results and might help me steer it if I need to use it in the future :)

dceddia2y ago

anon3738392y ago

feoren2y ago

Only if you have an attention mechanism.

lubesGordi2y ago

I think this is the point of the Attention portion in an llm, to use context to skew the embedding result closer to what youre looking for.

It does seem a little strange 'ruler' would be closer to 'king' versus something like 'crown'.

bryantwolfOP2y ago

Yeah, these can be cute, but they're not ideal. I think the user feedback mechanism could help naturally align this over time, but it would also be gameable. It's all interesting stuff

jonnycoder2y ago

As the op, you can do both semantic search (embedding) and keyword search. Some RAG techniques call out using both for better results. Nice product by the way!

1 more reply

m11172y ago· 6 in thread

ah pgvector is kind of annoying to start with, you have to set it up and maintain, and then it starts falling apart when you have more vectors

sdesol2y ago

cargobuild2y ago

https://www.pinecone.io/blog/pinecone-vs-pgvector/ check it out :)

hackernoteng2y ago

chuckhend2y ago

l1am02y ago

This is good until it isn’t. Tried to get it working for 4 hours and it just did not.

And then I had an important architectural gotcha moment: I want my database to be dump. Its purpose is to store and query data in an efficient and ACID way.

Adding cronjobs and http calls to the database is a bad idea.

I love the simplicity and that it helps to keep embedding a up to date (if it works), but I decided to not treat my database as application.

ntry012y ago

on the other hand, if you have postgres already, it may be easier to add pgvector than to add another dependency to your stack (especially if you are using something like supabase)

another benefit is that you can easily filter your embeddings by other field, so everything is kept in one place and could help with perfomance

gchadwick2y ago· 4 in thread

Sure many do know these concepts already but they're probably not the people wondering about a 'good starting point for the AI curious app developer'.

simonw2y ago

I published this pretty comprehensive intro to embeddings last year: https://simonwillison.net/2023/Oct/23/embeddings/

nicbou2y ago

I found many of your other posts and they were the spark that finally made me "get it" and look deeper into LLMs. This post looks like another slam dunk.

Keep up the good work!

gk12y ago

To add to the other recommendations, here's a primer on vector DB's: https://www.pinecone.io/learn/vector-database/

charlieyuan2y ago

Apologies!

Here's a good primer on embeddings from openai: https://platform.openai.com/docs/guides/embeddings

hot_gril2y ago· 3 in thread

This is where I got started too. Glove embedding stored in Postgres.

isoprophlex2y ago

brianjking2y ago

As I'm trying to work on some pricing info for PGVector - can you share some more info about the hacker news posts you've embedded?

Cheers!

hot_gril2y ago

thorum2y ago· 3 in thread

Can embeddings be used to capture stylistic features of text, rather than semantic? Like writing style?

levocardia2y ago

vladimirzaytsev2y ago

You may get way more accurate results from relatively small models as well as logits for each class if you ask one question per class instead.

vladimirzaytsev2y ago

Likely not, embeddings are very crude. Embeddings of a text is just an average of "meanings" of words.

As is embeddings lack a lot of tricks that made transformers so efficient.

mrkeen2y ago· 2 in thread

Given

  not because they’re sufficiently advanced technology indistinguishable from magic, but the opposite.

  Unlike LLMs, working with embeddings feels like regular deterministic code.

  <h3>Creating embeddings</h3>

I was hoping for a bit more than:

  They’re a bit of a black box

  Next, we chose an embedding model. OpenAI’s embedding models will probably work just fine.

Aachen2y ago

akoboldfrying2y ago

primitivesuave2y ago· 2 in thread

Shameless plug in case anyone wants to test it out - https://gita.pub

forgingahead2y ago

Really nice and beautiful site!

primitivesuave2y ago

Thank you! :)

cargobuild2y ago· 2 in thread

gregorymichael2y ago

Has Pinecone gotten any cheaper? Last time I tried it was $75/month for the starter plan / single vector store.

cargobuild2y ago

yep. pinecone serverless has reduced costs significantly for many workloads.

aidenn02y ago· 2 in thread

Can someone give a qualitative explanation of what the vector of a word with 2 unrelated meanings would look like compared to the vector of a synonym of each of those meanings?

base6982y ago

If you think about it like a point on a graph, and the vectors as just 2D points (x,y), then the synonyms would be close and the unrelated meanings would be further away.

aidenn02y ago

I'm guessing 2 dimensions isn't for this.

1 more reply

benreesman2y ago· 1 in thread

He’s trying to sell a SaaS product (Pinecone), but he’s doing it the right way: it’s ok to be an influencer if you know what you’re taking about.

James Briggs has great stuff on this: https://youtube.com/@jamesbriggs

aeth0s2y ago

> crush the NYT’s recommender as a fun project for two reasons

Could you share what recommender you're referring to here, and how you can evaluate "crushing" it?

Sounds fun!

clementmas2y ago· 1 in thread

bootsmann2y ago

pg_vector has you covered

patrick-fitz2y ago· 1 in thread

Nice project! I find it can be hard to think of a idea that is well suited to use AI. Using embeddings for search is definitely a good option to start with.

ParanoidShroom2y ago

I made a reverse image search when I learned about embeddings. It's pretty fun to work with images https://medium.com/@christophe.smet1/finding-dirty-xtc-with-...

dvaun2y ago· 1 in thread

I’d love to build a suite of local tooling to play around with different embedding approaches.

I’ve had great results using SentenceTransformers for quick one-off tasks at work for unique data asks.

I’m curious about clustering within the embeddings and seeing what different approaches can yield and what applications they work best for.

PaulHoule2y ago

KasianFranks2y ago· 1 in thread

They are named 'feature' vectors with scored attributes, similar to associative arrays.Just ask MI. Jordan, D. Blie, S. Mian or A. Ng.

jerrygenser2y ago

voxelc4L2y ago

suprgeek2y ago

b.You could try a "fusion ranking" similar to the one you do but structured such that 50% of the weight is for a plain old keyword search in the metadata and 50% is for the embedding based search

Finally something more interesting to noodle on - what if the embeddings were based on the icon images and the model knew how to search for a textual descriptions in the latent space?

adamgordonbell2y ago

My problem with this is that it doesn't explain a lot.

[1] https://earthly.dev/blog/cosine_similarity_text_embeddings/

[2] https://jalammar.github.io/illustrated-word2vec/

kaycebasques2y ago

thomasfromcdnjs2y ago

I've been adding embeddings to every project I work on for the purpose of vector similarity searches.

I was just trying to order uber eats and wondering why they don't have a better search based off embeddings.

EcommerceFlow2y ago

mistermann2y ago

Can anyone explain how this language translation works? The magic is in the embeddings of course, but how does it work, how does it translate ~all words across all languages?

willcodeforfoo2y ago

tapatio2y ago

mehulashah2y ago

I think he is saying: embeddings are deterministic, so they are more predictable in production.

They’re still magic, with little explain ability or adaptability when they don’t work.

pantulis2y ago

1 more reply

LunaSea2y ago

Does anyone have examples of word (ngram) disambiguation when doing Approximate Nearest Neighbour (ANN) on word vector embeddings?

j / k navigate · click thread line to collapse