Show HN: Repo2vec – an open-source library for chatting with any codebase (opens in new tab)

(github.com)

93 pointsnutellalover1y ago54 comments

Hi HN, We're excited to share repo2vec: a simple-to-use, modular library enabling you to chat with any public or private codebase. It's like Github Copilot but with the most up-to-date information about your repo.

We made this because sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through the code itself.

We tried to make it dead-simple to use. With two scripts, you can index and get a functional interface for your repo. Every generated response shows where in the code the context for the answer was pulled from.

We also made it plug-and-play where every component from the embeddings, to the vector store, to the LLM is completely customizable.

If you want to see a hosted version of the chat interface with its features, here's a link: https://www.youtube.com/watch?v=CNVzmqRXUCA

We would love your feedback!

- Mihail and Julia

Show HN: Repo2vec – an open-source library for chatting with any codebase

(github.com)

93 pointsnutellalover1y ago54 comments

We made this because sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through the code itself.

We also made it plug-and-play where every component from the embeddings, to the vector store, to the LLM is completely customizable.

If you want to see a hosted version of the chat interface with its features, here's a link: https://www.youtube.com/watch?v=CNVzmqRXUCA

We would love your feedback!

- Mihail and Julia

54 comments

42 comments · 12 top-level

peterldowns1y ago· 5 in thread

Very cool project, I'm definitely going to try this out. One question — why use the OpenAI embeddings API instead of BGE (BERT) or other embeddings model that can be efficiently run client-side? Was there a quality difference or did you just default to using OpenAI embeddings?

CuriousJ1y ago

OP's cofounder here. For us, OpenAI embeddings worked best. When building a system that has many points of failure, I like to start with the highest quality ones (even if they're expensive / lack privacy) just to get an upper threshold of how good the system can be. Then start replacing pieces one by one and measure how much I'm losing in quality.

P.S. I worked on BERT at Google and have PTSD from how much we tried to make it work for retrieval, and it never really did well. Don't have much experience with BGE though.

peterldowns1y ago

Understood, thanks for the clear answer. Very cool that you worked on BERT at Google — thank you (and your team) for all of the open source releasing and publishing you've done over the years.

I'm using OpenAI embeddings right now in my own project and I'm asking because I'd like to evaluate other embedding models that I can run in/adjacent-to my backend server, so that I don't have to wait 200ms to embed the user's search phrase/query. I'm very impressed by your project and I thought I might save myself some trouble if you had done some clear evals and decided OpenAI is far-and-away better :)

xrd1y ago

I wish you could tell the stories of how you eval'ed BERT at Google. Sounds meaty.