Presto -- embeddings! And you can use cosine similarity with them and all that good stuff and the results aren't totally terrible.
The rest of "embeddings" builds on top of this basic strategy (smaller vectors, filtering out words/tokens that occur frequently enough that they don't signify similarity, handling synonyms or words that are related to one another, etc. etc.). But stripping out the deep learning bits really does make it easier to understand.
The classic example is word embeddings such as word2vec, or GloVE, where due to the embeddings being meaningful in this way, one can see vector relationships such as "man - woman" = "king - queen".
In this case each dimension is the presence of a word in a particular text. So when you take the dot product of two texts you are effectively counting the number of words the two texts have in common (subject to some normalization constants depending on how you normalize the embedding). Cosine similarity still works for even these super naive embeddings which makes it slightly easier to understand before getting into any mathy stuff.
You are 100% right this won't give you the word embedding analogies like king - man = queen or stuff like that. This embedding has no concept of relationships between words.
They're making a vector for a text that's the term frequencies in the document.
It's one step simpler than tfidf which is a great starting point.
act: [0.1]
as: [0.4]
at: [0.3]
...
That's a very simple 1D embedding, and like you said would only give you popularity. But say you wanted other stuff like its: Vulgarity, prevalence over time, whether its slang or not, how likely it is to start or end a sentence, etc. you would need more than 1 number. In text-embedding-ada-002 there are 1536 numbers in the array (vector), so it's like: act: [0.1, 0.1, 0.3, 0.0001, 0.000003, 0.003, ... (1536 items)]
...
The numbers don't mean anything in-and-of-themselves. The values don't represent qualities of the words, they're just numbers in relation to others in the training data. They're different numbers in different training data because all the words are scored in relation to each other, like a graph. So when you compute them you arrive at words and meanings in the training data as you would arrive at a point in a coordinate space if you subtracted one [x,y,z] from another [x,y,z] in 3D.So the rage about a vector db is that it's a database for arrays of numbers (vectors) designed for computing them against each other, optimized for that instead of say a SQL or NoSQL which are all about retrieval etc.
So king vs prince etc. - When you take into account the 1536 numbers, you can imagine how compared to other words in training data they would actually be similar, always used in the same kinds of ways, and are indeed semantically similar - you'd be able to "arrive" at that fact, and arrive at antonyms, synonyms, their French alternatives, etc. but the system doesn't "know" that stuff. Throw in Burger King training data and talk about French fries a lot though, and you'd mess up the embeddings when it comes arriving at the French version of a king! You might get "pomme de terre".
It also leaves out the old “tf idf” normalization of considering how common a word is broadly (less interesting) vs in that particular document. Kind of like a shittier attention. Used to make a big difference.
Most fancy modern embedding strategies basically start with this and then proceed to build on top of it to reduce dimensions, represent words as vectors in their own right, pass this into some neural layer, etc.
What makes embeddings useful is that they do dimensionality reduction (https://en.wikipedia.org/wiki/Dimensionality_reduction) while keeping enough information to keep dissimilar texts away from each other.
I also doubt your claim “and the results aren't totally terrible”. In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc (https://en.wikipedia.org/wiki/Most_common_words_in_English)
A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.
You can simplify this with a map, and only store non-zero values, but also you can be in-efficient: this is for learning. You can choose to store more valuable information than just word count. You can store any "feature" you want - various tags on a post, cohort topics for advertising, bucketed time stamps, etc.
For learning just storing word count gives you the mechanics you need of understanding vectors without actually involving neural networks and weights.
> I also doubt your claim “and the results aren't totally terrible”.
> In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc
(1) the comment suggested filtering out these words, and (2) the results aren't terrible. This is literally the first assignment in Stanfords AI class [1], and the results aren't terrible.
> A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.
Wow that seems a lot more complicated for something that was supposed to be a learning exercise.
[1] https://stanford-cs221.github.io/autumn2023/assignments/sent...
Give it a shot! I’d grab a corpus like https://scikit-learn.org/stable/datasets/real_world.html#the... to play with and see what you get. It’s not going to be amazing, but it’s a great way to build some baseline intuition for nlp work with text that you can do on a laptop.
The reason vector stores are important for production use-cases are mostly latency-related for larger sets of data (100k+ records), but if you're working on a toy project just learning how to use embeddings, you can compute cosine distance with a couple lines of numpy by doing a dot product of a normalized query vectors with a matrix of normalized records.
Best of all, it gives you a reason to use Python's @ operator, which with numpy matrices does a dot product.
It feels a bit like the hype that happended with "big data". People ended up creating spark clusters to query a few million records. Or using Hadoop for a dataset you could process with awk.
Professionally I've only ever worked with dataset sizes in the region of low millions and have never needed specialist tooling to cope.
I assume these tools do serve a purpose but perhaps one that only kicks in at a scale approaching billions.
Similarly, I read how Postgres won't scale for a backend application and I should use Citus, Spanner, or some NoSQL thing. But that day has not yet arrived.
We have duckdb embedded in our product[1] and it works perfectly well for billions of rows of a data without the hadoop overhead.
I regret ever messing around with Pinecone for my tiny and infrequently used set ups.
Seems like the next standard feature in every app is going to be natural language search powered by embeddings.
Most embeddings providers do normalization by default, and SentenceTransformers has a normalize_embeddings parameter which does that. (it's a wrapper around PyTorch's F.normalize)
Here's one in JavaScript (my prompt was "cosine similarity function for two javascript arrays of floating point numbers"):
function cosineSimilarity(vecA, vecB) {
if (vecA.length !== vecB.length) {
throw "Vectors do not have the same dimensions";
}
let dotProduct = 0.0;
let normA = 0.0;
let normB = 0.0;
for (let i = 0; i < vecA.length; i++) {
dotProduct += vecA[i] * vecB[i];
normA += vecA[i] ** 2;
normB += vecB[i] ** 2;
}
if (normA === 0 || normB === 0) {
throw "One of the vectors is zero, cannot compute similarity";
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
Vector stores really aren't necessary if you're dealing with less than a few hundred thousand vectors - load them up in a bunch of in-memory arrays and run a function like that against them using brute-force.http://machinelearning.org/archive/icml2008/papers/391.pdf
I'm sure there are earlier instances, though - the strict mathematical definition of embedding has surely been around for a lot longer.
(interestingly, the word2vec papers don't use the term either, so I guess it didn't enter "common" usage until the mid-late 2010s)
ollama pull all-minilm
curl http://localhost:11434/api/embeddings -d '{
"model": "all-minilm",
"prompt": "Here is an article about llamas..."
}'
Embedding models run quite well even on CPU since they are smaller models. There are other implementations with a library form factor like transformers.js https://xenova.github.io/transformers.js/ and sentence-transformers https://pypi.org/project/sentence-transformers/Check this video on building Semantic Search in Supabase: https://youtu.be/w4Rr_1whU-U
Also, the blog on announcement with links to text versions of the tutorials: https://supabase.com/blog/ai-inference-now-available-in-supa...
What would the cost of running this be like compared to the OpenAI embedding api?
You can generate CLIP embeddings locally on the DB server via:
SELECT abstract,
introduction,
figure1,
clip_text(abstract) AS abstract_ai,
clip_text(introduction) AS introduction_ai,
clip_image(figure1) AS figure1_ai
INTO papers_augmented
FROM papers;
Then you can search for embeddings via: SELECT abstract, introduction FROM papers_augmented ORDER BY clip_text(query) <=> abstract_ai LIMIT 10;
The approach significantly decreases search latency and results in cleaner code.
As an added bonus, EXPLAIN ANALYZE can now tell percentage of time spent in embedding generation vs search.The linked library enables embedding generation for a dozen open source models and proprietary APIs (list here: <https://lantern.dev/docs/develop/generate>, and adding new ones is really easy.
Charlie @ v0.app
- https://llm.datasette.io/en/stable/plugins/directory.html#em...
Here's how to use them: https://simonwillison.net/2023/Sep/4/llm-embeddings/
I literally just published my first crate: candle_embed[1]
It uses Candle under the hood (the crate is more of a user friendly wrapper) and lets you use any model on HF like the new SoTA model from Snowflake[2].
[1] https://github.com/ShelbyJenkins/candle_embed [2] https://huggingface.co/Snowflake/snowflake-arctic-embed-l
For simple things we might not need to worry about storing much, we can generate the embeddings and just cache them or send them straight to retrieval as an array or something...
The storing of embeddings seems the hard part, do I need a special database or PG extension? Is there any reason I can't store them as a blobs in SQlite if I don't have THAT much data, and I don't care too much about speed? Do embeddings generated ever 'expire'?
I don't expect to have to change the embeddings for each icon all that often, so storing them seemed like a good choice. However, you probably don't need to cache the embedding for each search query since there will be long-tail ones that don't change that much.
The reason to use pgvector over blobs is if you want to use the distance functions in your queries.
I use a Python one usually, but it's also possible to build a much faster one in C: https://simonwillison.net/2024/Mar/23/building-c-extensions-...
My understanding of what's going on at a technical level might be a bit limited.
Redis also does have vector search capability as well. However, the most popular answer you’ll get here is to use Postgres (pgvectpr).
I should use redis for queues but often I’ll just use a table in a SQLite database. For small scale projects I find it works fine, I’m wondering what an equivalent simple option for embeddings would be.
For SQLite specifically, very large BLOB columns might effect query performance, especially for large embeddings. For example, a 1536-dimension vector from OpenAI would take 1536 * 4 = 6144 bytes of space, if stored in a compact BLOB format. That's larger than SQLite default page size of 4096, so that extra data will overflow into overflow pages. Which again, isn't too big of a deal, but if the original table had small values before, then table scans can be slower.
One solution is to move it to a separate table, ex on an original `users` table, you can make a new `CREATE TABLE users_embeddings(user_id, embedding)` table and just LEFT JOIN that when you need it. Or you can use new techniques like Matryoshka embeddings[0] or scalar/binary quantization[1] to reduce the size of individual vectors, at the cost of lower accuracy. Or you can bump the page size of your SQLite database with `PRAGMA page_size=8192`.
I also have a SQLite extension for vector search[2], but there's a number of usability/ergonomic issues with it. I'm making a new one that I hope to release soon, which will hopefully be a great middle ground between "store vectors in a .npy files" and "use pgvector".
Re "do embeddings ever expire": nope! As long as you have access to the same model, the same text input should give the same embedding output. It's not like LLMs that have temperatures/meta prompts/a million other dials that make outputs non-deterministic, most embedding models should be deterministic and should work forever.
[0] https://huggingface.co/blog/matryoshka
https://www.v0.app/search?q=king
https://www.v0.app/search?q=rodent
This isn't a criticism of the app - I'd rather get a few funny mismatches in exchange for being able to find related icons. But it's an interesting puzzle to think about.
I believe that's the measure of a man.
1. The intent of the user. Is it a description of the look of the icon or the utility of the icon? 2. How best to rank the results which is a combination of intent, CTR of past search queries, bootstrapping popularity via usage on open source projects etc.
- Charlie of v0.app
Somehow Amazon continues to be the leader in muddy results which is a sign that it’s a huge problem domain and not easily fixable even if you have massive resources.
But, thanks, this explains a lot about Amazon's search results and might help me steer it if I need to use it in the future :)
It does seem a little strange 'ruler' would be closer to 'king' versus something like 'crown'.
And then I had an important architectural gotcha moment: I want my database to be dump. Its purpose is to store and query data in an efficient and ACID way.
Adding cronjobs and http calls to the database is a bad idea.
I love the simplicity and that it helps to keep embedding a up to date (if it works), but I decided to not treat my database as application.
another benefit is that you can easily filter your embeddings by other field, so everything is kept in one place and could help with perfomance
it's a good place to start in those cases and if it is successful and you need extreme performance you can always move to other specialized tools like qdrant, pinecone or weaviate which were purpose-built for vectors
Sure many do know these concepts already but they're probably not the people wondering about a 'good starting point for the AI curious app developer'.
Keep up the good work!
Here's a good primer on embeddings from openai: https://platform.openai.com/docs/guides/embeddings
Pgvector is nice, and it's cool seeing quick tutorials using it. Back then, we only had cube, which didn't do cosine similarity indexing out of the box (you had to normalize vectors and use euclidean indexes) and only supported up to 100 dimensions. And there were maybe other inconveniences I don't remember, cause front page AI tutorials weren't using it.
I stored ~6 million hacker news posts, their metadata, and the vector embeddings in a cheap 20$/month vm running pgvector. Querying is very fast. Maybe there's some penalty to pay when you get to the billion+ row counts, but I'm happy so far.
* Which embedding model? (or number of dimensions) * When you say 6 million posts - it's just the URL of the post, title, and author, or do you mean you've also embedded the linked URL (be it HN or elsewhere)?
Cheers!
As is embeddings lack a lot of tricks that made transformers so efficient.
not because they’re sufficiently advanced technology indistinguishable from magic, but the opposite.
Unlike LLMs, working with embeddings feels like regular deterministic code.
<h3>Creating embeddings</h3>
I was hoping for a bit more than: They’re a bit of a black box
Next, we chose an embedding model. OpenAI’s embedding models will probably work just fine.It's good to know you can do this performantly on your own system, but if the article had started out with "look, this model can output similarity between two texts and we can make a search engine with that", that'd be much more up front about what to expect to learn from it
Edit: another comment mentioned you can't even run it yourself, you need to ask ClosedAI for every search query a user does on your website. WTF is this article, at that point you might as well pipe the query into general-purpose chatgpt which everyone already knows and let that sort it out
Shameless plug in case anyone wants to test it out - https://gita.pub
Here's a concrete example: "bow" would need to be close to "ribbon" (as in a bow on a present) and also close to "gun" (as a weapon that shoots a projectile), but "ribbon" and "gun" would seem to need be far from each other. How does something like word2vec resolve this? Any transitive relationship would seem to fall afoul of this.
He’s trying to sell a SaaS product (Pinecone), but he’s doing it the right way: it’s ok to be an influencer if you know what you’re taking about.
James Briggs has great stuff on this: https://youtube.com/@jamesbriggs
Could you share what recommender you're referring to here, and how you can evaluate "crushing" it?
Sounds fun!
I’ve had great results using SentenceTransformers for quick one-off tasks at work for unique data asks.
I’m curious about clustering within the embeddings and seeing what different approaches can yield and what applications they work best for.
I’ve used DBSCAN for finding duplicate content, this is less successful. With the parameters I am using it is rare for there to be a false positives, but there aren’t that many true positives. I’m sure I could do do better if I tuned it up but I’m not sure if there is an operating point I’d really like.
Embeddings live a very biased existence. They are the product of a network (or some algorithm) that was trained (or built) with specific data (and/or code) and assume particular biases intrinsically (network structure/algorithm) or extrinsically (e.g., data used to train a network) which they impose on the translation of data into some n-dimensional space. Any engineered solution always lives with such limitations, but with the advent of more and more sophisticated methods for the generation of them, I feel like it's becoming more about the result than the process. This strikes me as problematic on a global scale... might be fine for local problems but could be not-so-great in an ever changing world.
1. In the embedder part trying out different embedding models and/or vector dimensions to explore if the Recall@K & Precision@K for your data set (icons) improves. Models make a surprising amount of difference to the quality of the results. Try the MTEB Leaderboard for ideas on which models to explore.
2. In the Information Retriever part you can try a couple of approaches: a.after you retrieve from PGVector see if you can use a reranker like Cohere to get better results https://cohere.com/blog/rerank
b.You could try a "fusion ranking" similar to the one you do but structured such that 50% of the weight is for a plain old keyword search in the metadata and 50% is for the embedding based search
Finally something more interesting to noodle on - what if the embeddings were based on the icon images and the model knew how to search for a textual descriptions in the latent space?
You can manually make a vector of a word and then step wise get up to word2vec approach and then document embedding. My post[1] does some of the first part and this great word2vec post[2] dives into it in more detail.
[1] https://earthly.dev/blog/cosine_similarity_text_embeddings/
I was just trying to order uber eats and wondering why they don't have a better search based off embeddings.
Almost finished building a feature on JSON Resume, that takes your hosted resume and WhoIsHiring job posts and uses embeddings to return relevant results -> https://registry.jsonresume.org/thomasdavis/jobs
Can anyone explain how this language translation works? The magic is in the embeddings of course, but how does it work, how does it translate ~all words across all languages?
They’re still magic, with little explain ability or adaptability when they don’t work.