Changes to Code Search Indexing (opens in new tab)

(github.blog)

82 pointsxPaw5y ago58 comments

58 comments

55 comments · 23 top-level

est315y ago· 5 in thread

This is sad. Anyone working with larger codebases knows that 99% of the time, the code they encounter is years old. It's not uncommon for me to stumble upon files whose last non-trivial change was 3 or 7 years ago... Of course other files get more regular updates, but this translates to entire repositories as well, especially in languages where smaller libraries are encouraged. I'm sure that even among github's gem dependency tree, there are open source dependencies that haven't seen changes less than 1 year ago. To say these are irrelevant is just wrong.

Github's search already sucks quite a lot, but for some things it's extremely useful. For example, when I'm interested where my Rust library is used, I can use the toml filetype restriction and search for the name of my library. It will show up way more results than the projects published to crates.io as those projects are only a tiny subset. These projects might not see extremely regular updates, but I consider them still relevant information. I want it to be my choice whether to discard them or not.

OJFord5y ago

Yep, but note it's requiring repo-level activity, not that it won't index a LoC that hasn't changed for a year.

So, 'anyone working on a larger codebase' has probably caused some activity on it within the last year.

I think it's fair enough, certainly if it were two or three years. Anyone who objects can have a GitHub Action amend an empty commit on an unused branch once a year anyway!

est315y ago

Those are fair points. Larger projects often get split into multiple sub-projects, especially in git-age languages like ruby, js or Rust. The same rules apply there: often such projects get created by their original authors and then work for the most part, so don't see any changes for years. So the effects translate into multi-repo projects as well.

To give an example, in Rust the mime crate has over 14 million downloads, but the last git commit was in January 2020. So it will fall out of Github's search index soon, even though it's used by everyone who uses the reqwest crate, which is a lot of people.

In fact, likely some people will run cargo vendor and then upload the results to github often enough that it'll show up in search results, but instead of the canonical location you'll have to look at vendored repos... not really nice.

sphynxie5y ago

For what it's worth, I know someone who started working at GitHub early this year to "make search not suck". So this might be an early indicator of things to come.

est315y ago

Let's hope it will improve at least when searching on a single repo. That I can easily emulate locally though by using ripgrep. Yes, ripgrep gives better results than Github's search :).

Emulating the "which projects use my library" feature of Github is harder though.

tru3_power5y ago

Seems like a good paid service they could offer.

ballenf5y ago· 5 in thread

This worries me. At least monthly, I will find a repository that is 5ish years old that is attacking some problem I'm dealing with. I may not interact with it, but I review the code and learn a lot.

I guess Google will still index dormant repos.

kevin_thibedeau5y ago

Check it out and index it locally with tools like these:

https://archive.codeplex.com/?p=codeidx

https://github.com/sourcegraph/sourcegraph

https://github.com/ggreer/the_silver_searcher

nirav725y ago

Only useful if you already know about a repo.

reader_10005y ago

> I guess Google will still index dormant repos.

Github's search is inferior compared to google's site search for github (i.e keyword site:github.com).

I don't know why but Github search almost useless if the thing you are searching for is not a exact match. For example I tried just one random search. Search keyword: HashMapBoxing https://github.com/google/guava/search?q=HashMapBoxing

No results for me, however here is one that contains word IdentityHashMapBoxing https://github.com/google/guava/blob/master/guava-tests/test...

So I almost stopped using it. Either I use google search or I just clone the repo and use my IDE.

Edit: Ok, this search also fails with Google search, but still :)

that_guy_iain5y ago

I think this is just for the code and not the descriptions and names. But just thinking that will probably include the readme which has all the important information.

xmprt5y ago

Often times I search through the code once I find the repo after searching by name. So I still would like the code to be indexed in that case.

mrcarruthers5y ago· 5 in thread

AKA the indexes are too large and we don't want to spend the money

sdesol5y ago

I'm sure money played some part, but the fact is, unless you have researched code search, it's quite a difficult problem to solve. The biggest problem is, it's very difficult to define relevancy, since code search is EXTREMELY context driven. And I'm guessing purging inactive repositories, addresses most of the noise issues they have.

However, I'm kind of surprised by a blanket purge like this, as I figured they would take repo stars and other signals into consideration to decide what to purge.

_flux5y ago

If it wasn't about money, surely they could have a second index for old code? And then asynchronously query both, filling the fast results in first.

Or maybe have a keyword for searching old code, or search if current results seem bad, etc.

alkonaut5y ago

This is something I’d pay for. I don’t need any other GitHub services, but this I’d pay for.

Would it be possible to index externally?

chinhodado5y ago

I doubt that's the reason, given Microsoft's resources.

MikeKusold5y ago

A hypothetical conversation:

Eng: "Our elasticsearch database has been growing exponentially for years, and in order to keep search alive we need to completely rearchitect it. We estimate it will take a team of 8 people, 1 year to launch a stable alternative."

Product: "What if we just reduce the size of the index by removing inactive repositories? This will allow the team to focus on revenue generating features."

Eng: "Yeah, that should keep our current solution running for the foreseeable future without significant impact to our customers."

glup5y ago· 4 in thread

This is idiotic. A ML codebase for a repo from 2017 -- like Word2Vec (https://github.com/tmikolov/word2vec) -- won't show up in search anymore?

Guess we can all write cron jobs that commit useless crap yearly.

neovintage5y ago

I apologize the changelog wasn't more clear. For repos like that, they continue to show up in search results. Every time they do, that resets the clock on when the repo ages out of the index. We tried to strike a balance between performance and having _all_ the code in the index.

jasonpeacock5y ago

But once they age out, then they're invisible as they won't show in the search results.

If showing in the search index is what resets the clock (in addition to making commits), how do you search for something that's not in the index?

jacobwg5y ago

> Every time they do, that resets the clock on when the repo ages out of the index.

Sorry, what resets the clock?

1 more reply

gizmo3855y ago

If that repo has shown up in a search result, it will still be indexed

karlicoss5y ago· 4 in thread

I guess https://grep.app (discussed here [0]) becomes even more useful now. Although not sure what exactly are they indexing.

And for anyone using Github to search in their own code -- Ripgrep works really well even if you run it against your whole code directory and gives you instantaneous results (if you usean SSD!). I'm describing my code search setup with ripgrep + emacs here [1]

[0] https://news.ycombinator.com/item?id=22396824

[1] https://beepb00p.xyz/pkm-search.html#code

simonw5y ago

I built a web UI on top of ripgrep a few weeks ago - just a few dozen lines of Python - and it's fantastically useful.

I already find myself using it a lot - recent example search: ehttps://ripgrep.datasette.io/-/ripgrep?pattern=changelog&ign...

More about that project here: https://simonwillison.net/2020/Nov/28/datasette-ripgrep/

SommaRaikkonen5y ago

Man I remember when the thread was active, the OP got contacted by 2 CEOs and a CTO immediately because his search was so blazing fast.

I wish I was half as good at building a program like him. Instead I'm here struggling building my own RPI4 Radio player =(

temikus5y ago

Don’t let your dreams be dreams.

The author did give some hints on how he built it, e.g. indexing/search is driven by Apache Solr on a couple of 20-core machines.

Regarding data ingestion you probably can look at some prior art like this: https://github.com/garysieling/solr-git

There are also some pretty decent books referenced in the Solr docs: https://lucene.apache.org/solr/resources.html

karlicoss5y ago

Yeah, and for even more context, it was CTO of Github! Wonder what happened there ¯\_(ツ)_/¯

q3k5y ago· 3 in thread

Does anyone actually use the GitHub global code search? I've always found it to be pretty much useless.

(compare the results from https://cs.opensource.google/ vs. GH search)

upbeat_general5y ago

All the time. Finding obscure uses of libraries/APIs or just how certain code snippets are used, or even just repos about a general topic/implementation of a thing. It’s very sensitive and oftentimes doesn’t work well but I’d prefer to spend 30 seconds trying a couple different variations of my search than get no results at all. Google’s open source search is probably better but until they index Github it’s practically useless for me. Github search may be a bad search tool but by nature of having so much code it’s still useful for me.

jan_Inkepa5y ago

I use google github global code search all the time looking for example uses of poorly-documented APIs.

E.g. if I search for 'haxeflixel' (a game engine) I get nothing on cs.opensource.google, but 532 repos appear on github search.

q3k5y ago

Right, but that's just a quantity thing. The quality, from my experience, is horrible. Once I know which repository I'm interested in, I have much better results cloning the code and grepping for things locally.

1 more reply

jan_Inkepa5y ago· 2 in thread

I noticed that something was amiss a year or so ago (IIRC) when they disabled global search for people not logged in.

This is sad, TBH. I've found global search very useful for searching for example-uses of rarely-used libraries. Having all that at my fingertips was useful.

I understand legacy examples aren't useful for everyone, but for me they often were, and now a lot of code will be completely unfindable. :/

Think of niche programming languages and the like that've passed their heyday - I guess tagging might help, but a lot of people don't bother tagging projects.

Also, I guess this includes one's own projects? I have about a hundred repos of various ages, and it's easy to lose track of them. Not being able to search through my own code sounds like a bit of a bummer (Though I don't have an intuition for how often I search for stuff in my own repos, TBH).

Pity there wasn't a better solution available to solve their problems.

rurban5y ago

The proper technical solution would be to use a proper search backend, not solr, more like xapian, and sort by relevance. Newer and more relevant first. Even google codesearch can do that properly. And still tons faster than a dynamic ripgrep, which has to read every file.

Skipping old code might improve the noise, but people will always complain

temikus5y ago

I think this will not affect third-party search like grep.app since they have their own index. It’s not as large but should help.

1 more reply

chinhodado5y ago· 2 in thread

This sucks. I use code search from time to time to find example usages of obscure APIs. With this change it's more or less useless now.

Ciantic5y ago

Yes. My use case for global search have been obscure APIs also. They should have two different code searches if they can't maintain one.

More simpler version for global code searches, which all I care could be more literal, like matching only exact substrings. It would still allow discovering usages of obscure APIs.

And more comprehensive for searching inside a selected repo. Not sure which one is the problem for them at the moment.

the_only_law5y ago

I’ve found some shady but incredibly fascinating repos during some obscure projects that probably will be lost to time with code search.

neovintage5y ago· 1 in thread

Hey everyone. I had a lot of folks reach out to me about this change. We've heard y'all loud and clear that you want a better code search. The goal of this change was to balance performance and search relevance while we work on a new code search backend. The vast majority of folks shouldn't notice a change in their search results. The folks that will are those that are using code search to count things across all the repositories in GitHub. That's a use case that we don't expressly support as part of code search but we know folks are doing it anyway. My goal for publishing this was to be open about the change and for those folks that are using code search for analytics not have to guess at what happened.

chinhodado5y ago

> The vast majority of folks shouldn't notice a change in their search results.

I'm not sure what you meant by this. How is this the case when repos with no commit within the last year (which is extremely common) are no longer indexed?

import5y ago· 1 in thread

Gonna write "show-my-repos-in-search-results.sh" script which I can run every year.

scary-size5y ago

That’s what I thought as well. What if some would scrape all repos and make them show up in search at least once a year.

jchw5y ago

This kind of sucks. Sometimes I use Code Search to try to find things that are particularly obscure, such as usages of new or obscure APIs, or what have you. No matter what the rule is, not indexing all of Github breaks this.

I suppose I take it for granted that ridiculously huge search indices are a solved problem, but it turns out they aren’t.

upbeat_general5y ago

I saw the title and immediately I thought “yesss Github is finally fixing its terrible code search”….Instead they’re making it worse.

To be clear Github search has been very useful, just extremely sensitive and finicky

luhn5y ago

The announcement is a bit ambiguous: Does this include the per-repo search, or just the global code search? Does code search include issues and pull requests, or just code?

Honestly I've always hated Github's full text search on code. Give me an amped-up per-repo grep and scrap global code search entirely. Maybe I'm lacking in imagination, but I can't see how full text search is useful to anybody.

superasn5y ago

This is very bad. A lot of time when I'm stuck with some issue like how to use an API and can't find answers in docs or SO, I usually turn to usage of the said functions with github code search. Most often they are some 3 old project with 1 star that shows me how they used it and what values to use, etc and that really serves a good purpose.

adamnemecek5y ago

If anyone from GitHub is reading this, can you add dedup? I spend a lot of time searching for things on GitHub but sometimes out of 100 pages of results, 80 might be the same file included in different projects.

fahrradflucht5y ago

Does this also affect repos in payed teams? This is an essential feature I often use to find usage of something throughout the org. Especially the old forgotten code is what I am looking for there...

bredren5y ago

This is good idea for default search, lots of bloat and some languages and frameworks are changing incredibly quickly.

However, deeper searches should still be made available.

This could be resolved as simply as advanced time based flags like has issue updated.

enriquto5y ago

Ah, these Microsoft guys are so funny...

They purportedly developed a search engine that can grep the entire internet; but in reality they have trouble indexing a single one of their own websites.

forrestthewoods5y ago

This is very unfortunate. I regularly run into obscure issues but am able to find work arounds by searching GitHub for ancient projects that encountered similar edge cases.

andrewstuart5y ago

I would have found it much more useful if it continued to index all repositories, but instead removed the vast number of duplicate results, which make github search very challenging to get value from.

Old code that was not recently active is still valuable to search through.

throwaway8899005y ago

So surely with all that extra power and storage freed up, they can start indexing code in forks that are actually actively maintained?

hansvm5y ago

Suppose somebody wanted to write their own code search backend; is anyone maintaining a common crawl of all github repos?

The_rationalist5y ago

I wonder how that's going to affect codota (the best code search engine to my knowledge) https://www.codota.com/code

j / k navigate · click thread line to collapse

58 comments

55 comments · 23 top-level

est315y ago· 5 in thread

OJFord5y ago

Yep, but note it's requiring repo-level activity, not that it won't index a LoC that hasn't changed for a year.

So, 'anyone working on a larger codebase' has probably caused some activity on it within the last year.

I think it's fair enough, certainly if it were two or three years. Anyone who objects can have a GitHub Action amend an empty commit on an unused branch once a year anyway!

est315y ago

sphynxie5y ago

For what it's worth, I know someone who started working at GitHub early this year to "make search not suck". So this might be an early indicator of things to come.

est315y ago

Let's hope it will improve at least when searching on a single repo. That I can easily emulate locally though by using ripgrep. Yes, ripgrep gives better results than Github's search :).

Emulating the "which projects use my library" feature of Github is harder though.

tru3_power5y ago

Seems like a good paid service they could offer.

ballenf5y ago· 5 in thread

This worries me. At least monthly, I will find a repository that is 5ish years old that is attacking some problem I'm dealing with. I may not interact with it, but I review the code and learn a lot.

I guess Google will still index dormant repos.

kevin_thibedeau5y ago

Check it out and index it locally with tools like these:

https://archive.codeplex.com/?p=codeidx

https://github.com/sourcegraph/sourcegraph

https://github.com/ggreer/the_silver_searcher

nirav725y ago

Only useful if you already know about a repo.

reader_10005y ago

> I guess Google will still index dormant repos.

Github's search is inferior compared to google's site search for github (i.e keyword site:github.com).

No results for me, however here is one that contains word IdentityHashMapBoxing https://github.com/google/guava/blob/master/guava-tests/test...

So I almost stopped using it. Either I use google search or I just clone the repo and use my IDE.

Edit: Ok, this search also fails with Google search, but still :)

that_guy_iain5y ago

I think this is just for the code and not the descriptions and names. But just thinking that will probably include the readme which has all the important information.

xmprt5y ago

Often times I search through the code once I find the repo after searching by name. So I still would like the code to be indexed in that case.

mrcarruthers5y ago· 5 in thread

AKA the indexes are too large and we don't want to spend the money

sdesol5y ago

However, I'm kind of surprised by a blanket purge like this, as I figured they would take repo stars and other signals into consideration to decide what to purge.

_flux5y ago

If it wasn't about money, surely they could have a second index for old code? And then asynchronously query both, filling the fast results in first.

Or maybe have a keyword for searching old code, or search if current results seem bad, etc.

alkonaut5y ago

This is something I’d pay for. I don’t need any other GitHub services, but this I’d pay for.

Would it be possible to index externally?

chinhodado5y ago

I doubt that's the reason, given Microsoft's resources.

MikeKusold5y ago

A hypothetical conversation:

Product: "What if we just reduce the size of the index by removing inactive repositories? This will allow the team to focus on revenue generating features."

Eng: "Yeah, that should keep our current solution running for the foreseeable future without significant impact to our customers."

glup5y ago· 4 in thread

This is idiotic. A ML codebase for a repo from 2017 -- like Word2Vec (https://github.com/tmikolov/word2vec) -- won't show up in search anymore?

Guess we can all write cron jobs that commit useless crap yearly.

neovintage5y ago

jasonpeacock5y ago

But once they age out, then they're invisible as they won't show in the search results.

If showing in the search index is what resets the clock (in addition to making commits), how do you search for something that's not in the index?

jacobwg5y ago

> Every time they do, that resets the clock on when the repo ages out of the index.

Sorry, what resets the clock?

1 more reply

gizmo3855y ago

If that repo has shown up in a search result, it will still be indexed

karlicoss5y ago· 4 in thread

I guess https://grep.app (discussed here [0]) becomes even more useful now. Although not sure what exactly are they indexing.

[0] https://news.ycombinator.com/item?id=22396824

[1] https://beepb00p.xyz/pkm-search.html#code

simonw5y ago

I built a web UI on top of ripgrep a few weeks ago - just a few dozen lines of Python - and it's fantastically useful.

I already find myself using it a lot - recent example search: ehttps://ripgrep.datasette.io/-/ripgrep?pattern=changelog&ign...

More about that project here: https://simonwillison.net/2020/Nov/28/datasette-ripgrep/

SommaRaikkonen5y ago

Man I remember when the thread was active, the OP got contacted by 2 CEOs and a CTO immediately because his search was so blazing fast.

I wish I was half as good at building a program like him. Instead I'm here struggling building my own RPI4 Radio player =(

temikus5y ago

Don’t let your dreams be dreams.

The author did give some hints on how he built it, e.g. indexing/search is driven by Apache Solr on a couple of 20-core machines.

Regarding data ingestion you probably can look at some prior art like this: https://github.com/garysieling/solr-git

There are also some pretty decent books referenced in the Solr docs: https://lucene.apache.org/solr/resources.html

karlicoss5y ago

Yeah, and for even more context, it was CTO of Github! Wonder what happened there ¯\_(ツ)_/¯

q3k5y ago· 3 in thread

Does anyone actually use the GitHub global code search? I've always found it to be pretty much useless.

(compare the results from https://cs.opensource.google/ vs. GH search)

upbeat_general5y ago

jan_Inkepa5y ago

I use google github global code search all the time looking for example uses of poorly-documented APIs.

E.g. if I search for 'haxeflixel' (a game engine) I get nothing on cs.opensource.google, but 532 repos appear on github search.

q3k5y ago

1 more reply

jan_Inkepa5y ago· 2 in thread

I noticed that something was amiss a year or so ago (IIRC) when they disabled global search for people not logged in.

This is sad, TBH. I've found global search very useful for searching for example-uses of rarely-used libraries. Having all that at my fingertips was useful.

I understand legacy examples aren't useful for everyone, but for me they often were, and now a lot of code will be completely unfindable. :/

Think of niche programming languages and the like that've passed their heyday - I guess tagging might help, but a lot of people don't bother tagging projects.

Pity there wasn't a better solution available to solve their problems.

rurban5y ago

Skipping old code might improve the noise, but people will always complain

temikus5y ago

I think this will not affect third-party search like grep.app since they have their own index. It’s not as large but should help.

1 more reply

chinhodado5y ago· 2 in thread

This sucks. I use code search from time to time to find example usages of obscure APIs. With this change it's more or less useless now.

Ciantic5y ago

Yes. My use case for global search have been obscure APIs also. They should have two different code searches if they can't maintain one.

More simpler version for global code searches, which all I care could be more literal, like matching only exact substrings. It would still allow discovering usages of obscure APIs.

And more comprehensive for searching inside a selected repo. Not sure which one is the problem for them at the moment.

the_only_law5y ago

I’ve found some shady but incredibly fascinating repos during some obscure projects that probably will be lost to time with code search.

neovintage5y ago· 1 in thread

chinhodado5y ago

> The vast majority of folks shouldn't notice a change in their search results.

I'm not sure what you meant by this. How is this the case when repos with no commit within the last year (which is extremely common) are no longer indexed?

import5y ago· 1 in thread

Gonna write "show-my-repos-in-search-results.sh" script which I can run every year.

scary-size5y ago

That’s what I thought as well. What if some would scrape all repos and make them show up in search at least once a year.

jchw5y ago

I suppose I take it for granted that ridiculously huge search indices are a solved problem, but it turns out they aren’t.

upbeat_general5y ago

I saw the title and immediately I thought “yesss Github is finally fixing its terrible code search”….Instead they’re making it worse.

To be clear Github search has been very useful, just extremely sensitive and finicky

luhn5y ago

The announcement is a bit ambiguous: Does this include the per-repo search, or just the global code search? Does code search include issues and pull requests, or just code?

superasn5y ago

adamnemecek5y ago

fahrradflucht5y ago

Does this also affect repos in payed teams? This is an essential feature I often use to find usage of something throughout the org. Especially the old forgotten code is what I am looking for there...

bredren5y ago

This is good idea for default search, lots of bloat and some languages and frameworks are changing incredibly quickly.

However, deeper searches should still be made available.

This could be resolved as simply as advanced time based flags like has issue updated.

enriquto5y ago

Ah, these Microsoft guys are so funny...

They purportedly developed a search engine that can grep the entire internet; but in reality they have trouble indexing a single one of their own websites.

forrestthewoods5y ago

This is very unfortunate. I regularly run into obscure issues but am able to find work arounds by searching GitHub for ancient projects that encountered similar edge cases.

andrewstuart5y ago

Old code that was not recently active is still valuable to search through.

throwaway8899005y ago

So surely with all that extra power and storage freed up, they can start indexing code in forks that are actually actively maintained?

hansvm5y ago

Suppose somebody wanted to write their own code search backend; is anyone maintaining a common crawl of all github repos?

The_rationalist5y ago

I wonder how that's going to affect codota (the best code search engine to my knowledge) https://www.codota.com/code

j / k navigate · click thread line to collapse