Github's search already sucks quite a lot, but for some things it's extremely useful. For example, when I'm interested where my Rust library is used, I can use the toml filetype restriction and search for the name of my library. It will show up way more results than the projects published to crates.io as those projects are only a tiny subset. These projects might not see extremely regular updates, but I consider them still relevant information. I want it to be my choice whether to discard them or not.
So, 'anyone working on a larger codebase' has probably caused some activity on it within the last year.
I think it's fair enough, certainly if it were two or three years. Anyone who objects can have a GitHub Action amend an empty commit on an unused branch once a year anyway!
To give an example, in Rust the mime crate has over 14 million downloads, but the last git commit was in January 2020. So it will fall out of Github's search index soon, even though it's used by everyone who uses the reqwest crate, which is a lot of people.
In fact, likely some people will run cargo vendor and then upload the results to github often enough that it'll show up in search results, but instead of the canonical location you'll have to look at vendored repos... not really nice.
Emulating the "which projects use my library" feature of Github is harder though.
I guess Google will still index dormant repos.
https://archive.codeplex.com/?p=codeidx
Github's search is inferior compared to google's site search for github (i.e keyword site:github.com).
I don't know why but Github search almost useless if the thing you are searching for is not a exact match. For example I tried just one random search. Search keyword: HashMapBoxing https://github.com/google/guava/search?q=HashMapBoxing
No results for me, however here is one that contains word IdentityHashMapBoxing https://github.com/google/guava/blob/master/guava-tests/test...
So I almost stopped using it. Either I use google search or I just clone the repo and use my IDE.
Edit: Ok, this search also fails with Google search, but still :)
However, I'm kind of surprised by a blanket purge like this, as I figured they would take repo stars and other signals into consideration to decide what to purge.
Or maybe have a keyword for searching old code, or search if current results seem bad, etc.
Would it be possible to index externally?
Eng: "Our elasticsearch database has been growing exponentially for years, and in order to keep search alive we need to completely rearchitect it. We estimate it will take a team of 8 people, 1 year to launch a stable alternative."
Product: "What if we just reduce the size of the index by removing inactive repositories? This will allow the team to focus on revenue generating features."
Eng: "Yeah, that should keep our current solution running for the foreseeable future without significant impact to our customers."
Guess we can all write cron jobs that commit useless crap yearly.
If showing in the search index is what resets the clock (in addition to making commits), how do you search for something that's not in the index?
Sorry, what resets the clock?
And for anyone using Github to search in their own code -- Ripgrep works really well even if you run it against your whole code directory and gives you instantaneous results (if you usean SSD!). I'm describing my code search setup with ripgrep + emacs here [1]
I already find myself using it a lot - recent example search: ehttps://ripgrep.datasette.io/-/ripgrep?pattern=changelog&ign...
More about that project here: https://simonwillison.net/2020/Nov/28/datasette-ripgrep/
I wish I was half as good at building a program like him. Instead I'm here struggling building my own RPI4 Radio player =(
The author did give some hints on how he built it, e.g. indexing/search is driven by Apache Solr on a couple of 20-core machines.
Regarding data ingestion you probably can look at some prior art like this: https://github.com/garysieling/solr-git
There are also some pretty decent books referenced in the Solr docs: https://lucene.apache.org/solr/resources.html
(compare the results from https://cs.opensource.google/ vs. GH search)
E.g. if I search for 'haxeflixel' (a game engine) I get nothing on cs.opensource.google, but 532 repos appear on github search.
This is sad, TBH. I've found global search very useful for searching for example-uses of rarely-used libraries. Having all that at my fingertips was useful.
I understand legacy examples aren't useful for everyone, but for me they often were, and now a lot of code will be completely unfindable. :/
Think of niche programming languages and the like that've passed their heyday - I guess tagging might help, but a lot of people don't bother tagging projects.
Also, I guess this includes one's own projects? I have about a hundred repos of various ages, and it's easy to lose track of them. Not being able to search through my own code sounds like a bit of a bummer (Though I don't have an intuition for how often I search for stuff in my own repos, TBH).
Pity there wasn't a better solution available to solve their problems.
Skipping old code might improve the noise, but people will always complain
More simpler version for global code searches, which all I care could be more literal, like matching only exact substrings. It would still allow discovering usages of obscure APIs.
And more comprehensive for searching inside a selected repo. Not sure which one is the problem for them at the moment.
I'm not sure what you meant by this. How is this the case when repos with no commit within the last year (which is extremely common) are no longer indexed?
I suppose I take it for granted that ridiculously huge search indices are a solved problem, but it turns out they aren’t.
To be clear Github search has been very useful, just extremely sensitive and finicky
Honestly I've always hated Github's full text search on code. Give me an amped-up per-repo grep and scrap global code search entirely. Maybe I'm lacking in imagination, but I can't see how full text search is useful to anybody.
However, deeper searches should still be made available.
This could be resolved as simply as advanced time based flags like has issue updated.
They purportedly developed a search engine that can grep the entire internet; but in reality they have trouble indexing a single one of their own websites.
Old code that was not recently active is still valuable to search through.