Show HN: SeaGOAT – local, “AI-based” grep for semantic code search (opens in new tab)

(github.com)

240 pointskantord2y ago39 comments

39 comments

36 comments · 18 top-level

jarulraj2y ago· 4 in thread

Neat AI app!

1. What feature extractor is used to derive code embeddings?

2. Would support for more complex queries be useful inside the app?

   --- Retrieve a subset of code snippets
   SELECT name 
   FROM snippets
   WHERE file_name LIKE "%py" AND author_name LIKE "John%"
   ORDER BY
      Similarity(
         CodeFeatureExtractor(Open(query)),
         CodeFeatureExtractor(data)
      )
   LIMIT 5;

kantordOP2y ago

embeddings are done using ChromaDB

support for more complex queries could be useful, but probably not using a query language since that would make it more difficult to use free-form text input.

You can already use it using an API: https://kantord.github.io/SeaGOAT/0.27.x/server/#understandi... so probably the best way to add support for more complex queries would be to have additional query parameters, and also to expose those flags/options/features through the CLI

dcastm2y ago

For those curious about it, ChromaDB uses all-MiniLM-L6-v2[0] from Sentence Transformers[1] by default.

[0] https://docs.trychroma.com/embeddings#default-all-minilm-l6-...

[1] https://www.sbert.net/docs/pretrained_models.html

kantordOP2y ago

btw I am also working on a web version of it that will allow you to search in multiple repositories at the same time and you will be able to self host it at work, or run it locally in your machine. https://github.com/kantord/SeaGOAT-web

so that could provide a nicer interactive experience for more complex queries

avindroth2y ago

It’d be cool if it acted against Github repos, then you can save the embeddings and have a unified interface for querying repos.

I had this problem trying to learn a library and figuring out what all the functionalities are. I ended up making a non-ai solution (an emacs pkg), but this seems just a step or two away from your current project imho.

FloatArtifact2y ago· 4 in thread

I would love to plumb this up with a speech recognition engine via commands as well as free dictation. I can see this being useful for navigating code semantically.

kantordOP2y ago

actually I'm also working on a small web gui for it, it could be fairly easy to add speech recognition on the web version!

https://github.com/kantord/SeaGOAT-web

freckletonj2y ago

UniteAI brings together speech recognition and document / code search. The major difference is your UI is your preferred text editor.

https://github.com/freckletonj/uniteai

signa112y ago

thankfully perl is no longer in vogue.

reddit_clone2y ago

:-(

There is still a lot of Perl code around. Something like this would be super useful.

artisanspam2y ago· 3 in thread

What are the limitations on what languages this supports?

kantordOP2y ago

Currently it is hard limited to these file extensions: https://github.com/kantord/SeaGOAT/blob/ebfde263b970ddecdddf...

It is to avoid wasting time processing files that cannot lead to good results. If you want to try it for a different programming language, please fork the repo and try adding your file formats and test if it gives meaningful results, and if it does please submit a pull request.

Other than that one limitation is that it uses a model under the hood that is trained on a specific dataset which is filtered for a specific list of programming languages. So without changing the model as well, the support for other languages could be subpar. At the moment the model is all-MiniLM-L6-v2, here's a detailed summary of the dataset: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...

kantordOP2y ago

also I plan to add features that incorporate a "dumb" analysis of the codebase in order to avoid spamming the results with mostly irrelevant results such as import statements or decorators. Those features would be language dependent, so support would need to be added for each language

tinix2y ago

extensions are configurable or truly hard coded?

2 more replies

hollowpython2y ago· 3 in thread

Does anyone know a tool like this but for arbitrary PDFs?

freedmand2y ago

Semantra! Shared it yesterday on HN https://github.com/freedmand/semantra

freckletonj2y ago

If you're ok working in a text editor, UniteAI works on pdfs, youtube transcripts, code repos, web pages, local documents, etc. The nice thing about the editor is once it's done retrieval, you can hit another keycombo to send retrieved passages to an LLM (local, or chatgpt), and ask questions or favors about it (such as summarization, or formatting changes).

https://github.com/freckletonj/uniteai

kantordOP2y ago

btw pdf support could probably be added to seagoat itself by adding a layer that translates the pdf files to text files and probably some added changed to make sure that the page number is also included in the results

hackncheese2y ago· 2 in thread

My work has 10ish repos we use, looks like this needs to be run in a specific git repo. Is there a way for this tool to run in a parent directory that contains all the repos we use with the same functionality?

reddit_clone2y ago

I had the same question. With modern (!?) microservices type development, functionality is spread all over the place in several repos. It would be great if SeaGOAT supports multiple repos.

kantordOP2y ago

That could be a new added feature, feel free to add a new issue on it

smoe2y ago· 1 in thread

Looks very neat! Currently processing the repo I'm working on.

Can the generated database be easily shared within the team so not everyone has to run the initial processing of the repo which seems that it will take a couple of hours on my laptop?

reddit_clone2y ago

It appears (from a brief glance) you can run it on a shared server. Only client runs on laptop.

nxobject2y ago· 1 in thread

I'm looking forward to playing a little experiment with this: I'm going to run this on the Linux kernel tree, sight unseen, and knowing nothing about the structure of the Linux kernel – will it help me navigate it for the first time?

Edit: processing chunks; see you tomorrow...

Nischalj102y ago

hey! did you get to try it out?

GranPC2y ago

Cool project! Just trying it out now - does it support CUDA acceleration? I'm running it on a rather large project and it claims it's got over 140k "tasks left in the queue", and I see no indicator of activity on nvidia-smi.

jasonjmcghee2y ago

I've been test driving a similar one https://github.com/sturdy-dev/semantic-code-search

But yours has a more permissive license!

I also had to modify it a bit to allow for the line endings I needed and it frustratingly doesn't allow specifying a path, and often returns tests instead of code

1 more reply

m3kw92y ago

Why not embed names of functions and variables to form a vector so you are language agnostic? Are you limited by the language parser that embeds the names?

eddywebs2y ago

Cool beans! Does it work with python based codebase only or other could use it too ? Like java c#

Thank you for sharing.

la647102y ago

Just curious , did you use any LLM to generate code for this? BTW really awesome work!

retrofuturism2y ago

This would make a useful (nvim) Telescope plugin. Looks super interesting.

billconan2y ago

if the code doesn't contain comments, can it still work?

will it generate code comments for indexing using a language model? will that be expensive (assuming using GPT3)?

ithkuil2y ago

Interesting.

What would it take to support other programming languages?

nat07042y ago

Nice! Will try this out

freckletonj2y ago

Hey OP, this looks awesome!

I've done the same but was very disappointed with the stock sentence embedding results. You can get any arbitrary embedding, but then the cosine similarity used for nearest neighbor lookup gives a lot of false pos/negs.

*There are 2 reasons:*

1. All embeddings from these models occupy a narrow cone of the total embedding space. Check out the cos sim of any 2 arbitrary strings. It'll be incredibly high! Even for gibberish and sensical sentences.

2. The dataset these SentenceTransformers are trained on don't include much code, and certainly not intentionally. At least I haven't found a code focused one yet.

*There are solutions I've tried with mixed results:*

1. embedding "whitening" forces all the embeddings to be nearly orthogonal, meaning decorrelated. If you truncate the whitened embeddings, and keep just the top n eigenvalues, you get a sort of semantic compression that improves results.

2. train a super light neural net on your codebase's embeddings (takes seconds to train with a few layers) to improve nearest neighbor results. I suspect this helps because it rebiases learning to distinguish just among your codebase's embeddings.

*There are solutions from the literature I am working on next that I find conceptually more promising:*

1. Chunk the codebase, and ask an LLM on each chunk to "generate a question to which this code is the answer". Then do natural language lookup on the question, and return the code for it.

2. You have your code lookup query. Ask an LLM to "generate a fabricated answer to this question". Then embed it's answer, and use that to do your lookup.

3. We use the AST of the code to further inform embeddings.

I have this in my project UniteAI [1] and would love if you cared to collab on improving it (either directly, or via your repo and then building a dependency to it into UniteAI). I'm actually trying to collab more, so, this offer goes to anyone! I think for the future of AI to be owned by us, we do that through these local-first projects and building strong communities.

[1] https://github.com/freckletonj/uniteai

1 more reply

MisterTea2y ago

Is the naming coincidence or some sort of strange homage because I can't help thinking GOATsea.

j / k navigate · click thread line to collapse

39 comments

36 comments · 18 top-level

jarulraj2y ago· 4 in thread

Neat AI app!

1. What feature extractor is used to derive code embeddings?

2. Would support for more complex queries be useful inside the app?

   --- Retrieve a subset of code snippets
   SELECT name 
   FROM snippets
   WHERE file_name LIKE "%py" AND author_name LIKE "John%"
   ORDER BY
      Similarity(
         CodeFeatureExtractor(Open(query)),
         CodeFeatureExtractor(data)
      )
   LIMIT 5;

kantordOP2y ago

embeddings are done using ChromaDB

support for more complex queries could be useful, but probably not using a query language since that would make it more difficult to use free-form text input.

dcastm2y ago

For those curious about it, ChromaDB uses all-MiniLM-L6-v2[0] from Sentence Transformers[1] by default.

[0] https://docs.trychroma.com/embeddings#default-all-minilm-l6-...

[1] https://www.sbert.net/docs/pretrained_models.html

kantordOP2y ago

so that could provide a nicer interactive experience for more complex queries

avindroth2y ago

It’d be cool if it acted against Github repos, then you can save the embeddings and have a unified interface for querying repos.

FloatArtifact2y ago· 4 in thread

I would love to plumb this up with a speech recognition engine via commands as well as free dictation. I can see this being useful for navigating code semantically.

kantordOP2y ago

actually I'm also working on a small web gui for it, it could be fairly easy to add speech recognition on the web version!

https://github.com/kantord/SeaGOAT-web

freckletonj2y ago

UniteAI brings together speech recognition and document / code search. The major difference is your UI is your preferred text editor.

https://github.com/freckletonj/uniteai

signa112y ago

thankfully perl is no longer in vogue.

reddit_clone2y ago

:-(

There is still a lot of Perl code around. Something like this would be super useful.

artisanspam2y ago· 3 in thread

What are the limitations on what languages this supports?

kantordOP2y ago

Currently it is hard limited to these file extensions: https://github.com/kantord/SeaGOAT/blob/ebfde263b970ddecdddf...

kantordOP2y ago

tinix2y ago

extensions are configurable or truly hard coded?

2 more replies

hollowpython2y ago· 3 in thread

Does anyone know a tool like this but for arbitrary PDFs?

freedmand2y ago

Semantra! Shared it yesterday on HN https://github.com/freedmand/semantra

freckletonj2y ago

https://github.com/freckletonj/uniteai

kantordOP2y ago

hackncheese2y ago· 2 in thread

reddit_clone2y ago

I had the same question. With modern (!?) microservices type development, functionality is spread all over the place in several repos. It would be great if SeaGOAT supports multiple repos.

kantordOP2y ago

That could be a new added feature, feel free to add a new issue on it

smoe2y ago· 1 in thread

Looks very neat! Currently processing the repo I'm working on.

Can the generated database be easily shared within the team so not everyone has to run the initial processing of the repo which seems that it will take a couple of hours on my laptop?

reddit_clone2y ago

It appears (from a brief glance) you can run it on a shared server. Only client runs on laptop.

nxobject2y ago· 1 in thread

Edit: processing chunks; see you tomorrow...

Nischalj102y ago

hey! did you get to try it out?

GranPC2y ago

jasonjmcghee2y ago

I've been test driving a similar one https://github.com/sturdy-dev/semantic-code-search

But yours has a more permissive license!

I also had to modify it a bit to allow for the line endings I needed and it frustratingly doesn't allow specifying a path, and often returns tests instead of code

1 more reply

m3kw92y ago

Why not embed names of functions and variables to form a vector so you are language agnostic? Are you limited by the language parser that embeds the names?

eddywebs2y ago

Cool beans! Does it work with python based codebase only or other could use it too ? Like java c#

Thank you for sharing.

la647102y ago

Just curious , did you use any LLM to generate code for this? BTW really awesome work!

retrofuturism2y ago

This would make a useful (nvim) Telescope plugin. Looks super interesting.

billconan2y ago

if the code doesn't contain comments, can it still work?

will it generate code comments for indexing using a language model? will that be expensive (assuming using GPT3)?

ithkuil2y ago

Interesting.

What would it take to support other programming languages?

nat07042y ago

Nice! Will try this out

freckletonj2y ago

Hey OP, this looks awesome!

*There are 2 reasons:*

2. The dataset these SentenceTransformers are trained on don't include much code, and certainly not intentionally. At least I haven't found a code focused one yet.

*There are solutions I've tried with mixed results:*

*There are solutions from the literature I am working on next that I find conceptually more promising:*

1. Chunk the codebase, and ask an LLM on each chunk to "generate a question to which this code is the answer". Then do natural language lookup on the question, and return the code for it.

2. You have your code lookup query. Ask an LLM to "generate a fabricated answer to this question". Then embed it's answer, and use that to do your lookup.

3. We use the AST of the code to further inform embeddings.

[1] https://github.com/freckletonj/uniteai

1 more reply

MisterTea2y ago

Is the naming coincidence or some sort of strange homage because I can't help thinking GOATsea.

j / k navigate · click thread line to collapse