GitHub search can't even search for a literal string, let alone a regex. It can't search a subdirectory. Ranking is indistinguishable from random. It's been this way for years. How about building an actual, usable, basic code search and then getting all fancy with your machine learning?
I almost built my own "online git grep for GitHub" last year.
Disclaimer: no affiliation, just love the team and product.
P.S. I'm not affiliated, we just use Sourcegraph at a company I work for.
Currently it runs on a fairly slow machine, so regex-heavy requests will take some time on big package repositories like Rubygems, but I plan to get a nicer machine soon.
If you know Scala, you can even contribute (wink wink), just ping me. A lot of tasks we have at this stage are pretty basic.
A good example is that GitHub's own repo for their CommonMark implementation isn't searchable, because it's a fork of cmark: https://github.com/github/cmark/
Take for example,
for(int i=0;i<100;i++)
And then a search for i++ Due to the way almost every search tool works that would be split into tokens "for int i 0 100" which are not very useful. Even if you include the characters = ; < + ( ) in the search you break the ability to do things such as boolean queries or fuzzy search term~1Its totally possible to solve these issues using tweaks of the input into your index, which is what I did with searchcode.com or with a different approach which is what Google Code Search did. However neither have a requirement to be 100% in sync with the repository which I suspect is something that the github team value.
All the code search tools suffer from this in some way. At small scale its possible to just brute force the search. At scale you can do it by tweaking your algorithm and sacrificing accuracy. My feeling is that the github team chose accuracy.
[0] https://github.com/etsy/hound`{search query} -site:github.com/{repo}/{file i want to target}`
Its much clearer and concise.
My biggest gripe is that the other results show in seems to be totally random. For example, if I have a Java class called A and I search "class A" in code search, the actual A.java doesn't tend to show up anywhere near the front. I just tried this in a repo and the actual A.java file was on the last page of results when I searched "class A". The vast majority of the results before it didn't even have the words "class" and "A" next to each other, which A.java does...
Maybe I'm doing something wrong (I'd welcome any input on how to use code search correctly!), but it just feels like they're jumping the gun on trying to make their code search more advanced when the basic functionality doesn't work that well.
The search appears to be configured for natural language documents, not code. The stopwords are not right and search appears to strip all sigils. They could get pretty far just by parsing documents and changing their lucene/elasticsearch configuration.
And of course good old regex search.
"is:pr is:open ( author:bob OR author:jim )"
The lack of this pretty basic functionality makes issue & PR search much less useful than it could be.
1. exact or close string searches for code that involves ![]{}_-*() etc characters
2. searches across past commits (e.g. find a line that used to be in the code)
4. search across pull request + comments (not just issues and commit messages)
5. advanced search operators -- there should be a full filtering UI with ands and ors etc
Because of this I often find my self grepping locally, or (more often) totally out of luck.
GitHub is used by programmers. Surprisingly, they tend to be very good at telling computers precisely what they want, in the computers’ own language.
Natural language search is the exact opposite of this, invented for mom & pops who start their search phrase with “Dear Google, I’d like to search for ...”.
This is an incredible waste of time and resources that could be spent making the existing search far better with very minor tweaks. A perfect example of big company project management where nobody seems to know what their users actually want.
Please build search that lets me actually find a given file by name.
You are busy building a space rocket when all we want is a bicycle. Impressive, but useless for just popping down to the shops.
Love,
The rest of the world's developers
• 2 openings - Business Systems
• 2 openings - Communications
• 38 openings - Engineering
• 3 openings - Finance
• 1 opening - Internal Communications
• 4 openings - Legal
• 8 openings - Marketing
• 2 openings - People Operations
• 1 opening - Policy
• 7 openings - Product
• 8 openings - Sales
• 9 openings - Security
• 1 opening - Services
• 3 openings - Support
Sure, they may not be addressing your/my specific concerns, but the product is changing.