We made this because sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through the code itself.
We tried to make it dead-simple to use. With two scripts, you can index and get a functional interface for your repo. Every generated response shows where in the code the context for the answer was pulled from.
We also made it plug-and-play where every component from the embeddings, to the vector store, to the LLM is completely customizable.
If you want to see a hosted version of the chat interface with its features, here's a link: https://www.youtube.com/watch?v=CNVzmqRXUCA
We would love your feedback!
- Mihail and Julia
I would also like to be able to have the LLM know all of the documentation for any dependencies in the same way.
This is a great idea. Definitely something we plan to support.
P.S. I worked on BERT at Google and have PTSD from how much we tried to make it work for retrieval, and it never really did well. Don't have much experience with BGE though.
I'm using OpenAI embeddings right now in my own project and I'm asking because I'd like to evaluate other embedding models that I can run in/adjacent-to my backend server, so that I don't have to wait 200ms to embed the user's search phrase/query. I'm very impressed by your project and I thought I might save myself some trouble if you had done some clear evals and decided OpenAI is far-and-away better :)
That being said, our goal was to make the library modular so you can easily add support for whatever embeddings you want. Definitely encourage experimenting for your use-case because even in our tests, we found that trends which hold true in research benchmarks don't always translate to custom use-cases.
Exactly why I asked! If you don't mind a followup question, how were you evaluating embeddings models — was it mostly just vibes on your own repos, or something more rigorous? Asking because I'm working on something similar and based on what you've shipped, I think I could learn a lot from you!
For the time being, indexing and retrieving a good collection of 10-20 code chunks is more effective/performant in practice.
So yes you can certainly use to index and query your own repos for yourself, but it's also a way to get more of your OSS lib users onboarded.
We stress-tested with repos like langchain, llamaindex, kubernetes and there the retrieval still needs work to effectively return relevant chunks. This is still an open research question.
So far two similar solutions I tested crapped out on non-ASCII characters. Because Python's UTF-8 decoder is quite strict about it.
P.S. You'll see a bunch of warnings for e.g. binary files that are ignored. https://github.com/Storia-AI/repo2vec/commit/1864102949e7203...