GitHub's language detection is broken (opens in new tab)

(github.com)

79 pointsAllan_Smithee12y ago88 comments

88 comments

Why don't they just let project maintainers say, "This project contains x, y, and z" or something? That'd at least let them get a leg up on doing the categorization right and I don't think many people would mind having that capability.

mindcrime12y ago

+1000. Github routinely detects the wrong language for my projects, and there is no way to manually override it. My take is this: If you want to auto-detect the language, fine... but let the owner of the repo override your detection when it's wrong.

It's probably also a bug to even have the notion of "a language" for a repo given the burgeoning polyglot programming trend. So many repos these days contain multiple languages, especially when you consider javascript, that I question if it even makes sense to say 'This project is in language X' at all.

Like you say, the best option really would be to let the repo owners / maintainers just specify this stuff. They are, after all, the ones who know.

itcmcgrath12y ago

I wish I had more up buttons. Sometimes you can be too smart of your own good, and the good old fashioned way is superior...

Note: I'm not saying they shouldn't have the auto-detection, because it definitely helps if the maintainer doesn't do it, but for those that want to help classify things - let them!

ancarda12y ago

Actually, a way to turn off that feature would be nice. It adds very little value at the cost of it taking days to update. It also marks my dotfiles repo as "VimL" which means any auto-resume tool will assume I know VimL, when I don't. Funny thing is it marks my .vimrc as Perl, not VimL.

elwell12y ago

I disagree; I think the process should be as streamlined as possible. However, I could see auto-detection balanced with a confidence threshold; which, when not met, would ask user:

"Sorry, I couldn't determine if you had C code in your repo or is that Limbo code?"

Allan_SmitheeOP12y ago

You mean like the other source code repository hosts do?

pedalpete12y ago

I think the idea of automated language detection is pretty cool, but why doesn't github just give you the option of correcting it, or labelling it with the language you prefer?

For example, I've got a javascript modules in repositories. For each module, I make a demo version to show what the module does, and that demo includes a bunch of css. Apparently, there is more css than their is Javascript, so GitHub labels the module as css, but the important part isn't css, the important part is the javascript. In order to resolve this, I've had to move the css into a different repository, and ignore it in the javascript repository. Seems like a long way around, when all I want to do is correct them and say that the module is actually a javascript module.

ZoFreX12y ago

I actually much prefer BitBucket's way of doing things, for exactly this reason. It doesn't even try to detect - it just asks me. Sometimes the simplest solutions are the best.

michaelmior12y ago

Language detection as discussed in the link is per-file. I don't think overriding individual files makes sense since it's likely to be more trouble than it's worth. But I can understand the desire to change the detected language of the project.

Allan_SmitheeOP12y ago

How 'bout a project-specific property list that looks something like this:

.rb=RealBasic .m=Mercury .pl=Prolog .js=SomeCrapOrOther …

1 more reply

01312y ago

This 'lewellyn' person seems to be complaining about the lack of support for the language Limbo, a language for the Inferno OS. Both seem quite outdated and out of use. He also complains about how Github is focusing on 'cool' kid languages. Which I am guessing refer to modern, popular languages (If this is the definition of cool, then yes, they are.) Which, if I was Github, I would do the same. It's called priorities. I kind of get the vibe that lewellyn is some kind of 'hiptser'. His obscure language is better than the 'cool' kids simply because he's using it. I also would phrase it as "GitHub's language detection is broken", it's merely missing a feature/language.

choult12y ago

I suspect his - rather labored - point is that there are multiple use cases to show that the design of Linquist's configuration is flawed as a rule and not an exception, and the lack of attention paid to this particular issue is perhaps indicative of a more general Github attitude towards the less trendy languages and technologies out there.

scott_s12y ago

Which I think is an uncharitable way of saying "Github prioritizes working on things that will impact the most people."

hackcasual12y ago

From my reading of the issue it seems that he's complaining about a limitation of the tool Linguist. There's a suggested fix that doesn't, as far as I can tell, involve changing how Inferno code is written. My understanding is that primary_extension is simply used to short circuit analysis when it's unique to a language. In this case if the primary extension was .inferno and .m was in other extensions it seems that the sample code would be used by the classifier to distinguish between inferno, MATLAB, and obj-c.

To me this comes off as assuming the worst intentions on behalf of the github developers.

nox_12y ago

> My understanding is that primary_extension is simply used to short circuit analysis when it's unique to a language.

No, the primary_extension is only used in a gists_helper.rb file outside the Linguist repos. Note that the feature is deprecated anyway.

https://github.com/github/linguist/blob/master/lib/linguist/...

DCKing12y ago

Nobody's arguing that GitHub shouldn't prioritize popular languages. I don't think that the responses to this pull request show 'prioritization' however, they show incompetence and close-mindedness instead.

skywhopper12y ago

More likely a lack of time. If you'd like to see it fixed, start contributing to the fix.

2 more replies

apetresc12y ago

Yeah, kind of a self-important hipster too.

> Basically, Github needs to be accepting of programmers of all stripes, or they are destined to be irrelevant (or at least doing lots of scrambling) once the trendy kids move on from the trendy things they're doing and the currently-popular languages start falling out of style with a reversion to a previous status quo. Github needs to accept that there is a vast wealth of code out there which predates it and which will easily postdate it.

Okay there, buddy. I don't think lack of Lingo support is going to be GitHub's eventual downfall.

jperkin12y ago

Their language detection is indeed terrible. I have a repository (https://github.com/jperkin/pilights) which is entirely composed of shell scripts and a single markdown README. GitHub's analysis?

  Perl 83.5%	  Shell 16.5%

There is not a single .pl or .pm file, nor a single mention of 'perl' anywhere in the repository, and all scripts begin with #!/bin/sh.

A number of my other repositories have similar problems, but this one is by far the worst.

LeonidasXIV12y ago

Sorry, I see a huge blue bar saying 100% shell.

theOnliest12y ago

Earlier this morning when I looked it looked like OP said...it looks like something has changed in the three hours or so intervening.

1 more reply

bru12y ago

Inciminated comment: https://github.com/github/linguist/pull/748#issuecomment-374...

> if you'd like Mercury language detection on GitHub then with the current implementation of Linguist you need to pick a different (unique as Objective-C already defines this) primary_extension and add .m to the extensions array which will force Linguist into using the other detection methods mentioned above.

moron4hire12y ago

what, then, is the point of the primary_extension field?

EDIT: or as I like to yell at Github for Windows when it can't revert out of a merge conflict "WHAT IS EVEN THE POINT OF YOU?!"

skywhopper12y ago

It's just a design error in the original implementation where linguist assumes that the "primary_extension" for any particular language will be unique among all primary_extensions. Obviously that was a mistake, but that's where we are. The comment that set people off was perhaps poorly worded, but it was an honest suggestion to work around the design bug.

1 more reply

RyanZAG12y ago

Probably if the classifier can't determine the language, it will fall back to the primary_extension field - as it should, if the classifier can't determine if a .m file is objc or mercury, it should and will default to objc.

Classification is never 100% accurate.

EDIT: Exact method that it is used is reported here: https://github.com/github/linguist/pull/748#issuecomment-374...

1 more reply

eCa12y ago

I have a few Perl projects on Github that uses Bootstrap. Main language (according to Github): Javascript.

I expect that Javascript's github popularity ranking is (a little bit) inflated due to such issues.

mindcrime12y ago

I expect it's a lot inflated. I also have repos that are primarily Groovy, but show up as "Javascript" due to the presence of JQuery, Bootstrap, etc.

Allan_SmitheeOP12y ago

Unfortunately, until this core issue is fixed, users can't really submit further pull requests to fix the other issues which would correct the "inflation" we all know and hate.

natebrennand12y ago

They actually have a set of files and libraries that they ignore (.DS_Store/jquery/boostrap/etc)

https://github.com/github/linguist/blob/master/lib/linguist/...

cl8ton12y ago

I’ve been in the programming trenches since early 90’s fluent in 5 languages at the production level and have to say I have never heard of the language 'Limbo’. I don’t fault GitHub one bit.

I suppose I could Google it and act like I know… naw

Allan_SmitheeOP12y ago

I don't think the point was to complain about Limbo being missing. I think the point was to show that saying "Objective-C is the only language which can ever use .m as its primary extension" affects more than just the two languages listed earlier in the PR. The PR itself is about Mercury, after all.

Are people even reading the context of the rest of the PR?

rkangel12y ago

I'm pretty sure there's no one with an encyclopaedic knowledge of programming languages. The industry is enormous, and just because you haven't heard of it doesn't mean that it's irrelevant. "Niche" is not the same as "irrelevant".

And even if it WAS irrelevant and only important to a very small number of people, that doesn't mean it can be ignored.

nahname12y ago

>And even if it WAS irrelevant and only important to a very small number of people, that doesn't mean it can be ignored.

I don't follow. That sounds like the exact criteria for something to be ignored.

1 more reply

akerl_12y ago

Given infinite time and developer bandwidth, sure. But we don't live in that world, so "do the work that gives the most benefit to the most users" remains the preferable real-world strategy.

1 more reply

kalleboo12y ago

From the thread it looks like there are over 6 languages that use .m as the filename extension (including both MATLAB and Mathematica which you may have heard of), meaning the whole concept of a unique "primary_extension" is kind of ludicrous.

fastball12y ago

I thought that Mathematica uses .nb extension?

1 more reply

KingMob12y ago

True. But I'll bet you've heard of Matlab, which also uses .m, and is just as old as Obj-C. Matlab is everywhere in scientific computing.

Allan_SmitheeOP12y ago

Five languages across two decades?

Stand back, gents! This one is a champion!

Aqwis12y ago

I don't really care about Limbo, but GitHub seems to think my .m files are all M(UMPS) files, and not Matlab files, the most obvious choice. Highly annoying.

RyanZAG12y ago

I don't get it - why is a bunch of people trolling the github project with fairly irrelevant arguments interesting? Could someone who upvoted this explain the logic?

girvo12y ago

How is having a differing opinion "trolling"? Seriously, this word has lost all meaning at this point.

akerl_12y ago

Ignoring the suggested workarounds (setting a unique primary extension and then having the correct extension in the array, for instance) and continuing to rampage in the comments in an attempt to stir up the masses seems like the canonical example of trolling an online community.

1 more reply

KingMob12y ago

Seriously? I have an open-source Matlab project from my time in academia that's been misclassified as an Obj-C project in the past. Less popular languages are used all over, especially for more niche industries.

jbranchaud12y ago

While it is unfortunate that a pull request on this project has been around for 5 months without much progress, I think the commenter is being a bit dramatic. He is acting as if GitHub is blocking all commits with Limbo code. The language can still be under version control, it just might not have syntax highlighting and its own color in the repo stats bar.

GitHub isn't discriminating against certain programmers. Stay calm and keep coding!

bjz_12y ago

> it just might not have syntax highlighting and its own color in the repo stats bar.

It is discriminating, and harmful to all programmers. We need to be able to easily search for these lesser known languages – they are important cultural works. The commenter points out: "Limbo ... seems to have heavily inspired Go (which is currently extremely fashionable)". We are worse off for not having our history readily accessible.

mehwoot12y ago

All they are asking is to arbitrarily specify some other extension as the "primary" extension and have ".m" as another extension. Users will still see the same end result.

Allan_SmitheeOP12y ago

Unless they use gist.

kalleboo12y ago

I miss Mac OS Classic Filetype/Creator codes... Filename extensions are such an ugly hack.

hyperpape12y ago

Great example of worse is better in action.

deutronium12y ago

Could they use Bayesian classifiers? trained on a corpus of different languages, primarily concentrating on the symbols used in the language.

dclowd990112y ago

I think if I was writing a language detector, it would have these features:

- learning heuristics based on user suggestion.

- extension filtering to differentiate similar languages.

- the algo would use prominence and placement of white space and non-word characters to create the DNA of a language. If the language scores below a threshold against the DNA, it doesn't presume, it asks the user. If a language scores high against this DNA, it still allows used override. Whenever a user would submit their indicator, its file source would be used to train the heuristic.

Allan_SmitheeOP12y ago

This is because you likely think before you code.

awalton12y ago

> My esoteric programming language isn't properly supported by the popular kids' web tool that I'm likely not even paying to use in the first place. I'm OUTRAGED!

Yep, seems about right.

hk__212y ago

Also, this is untrue. Omgrofl is supported on GitHub, even if nobody uses it.

johnduhart12y ago

And posting it to HN of all places is hilarious.

joeblau12y ago

What's interesting about this PR is that this case was actually one of the reasons that I created http://www.gitignore.io. GitHub's original repo for .gitignore templates had nearly 1000 open PRs until around Oct 2013 so I built my own repo that would actually accept PRs. Since then, a few employees have worked on accepting PRs, but I had a similar feeling of frustration. Unfortunately, the OP can't just fork this repo because its features are integral to how GitHub works, where as I was able to hack around the system and create a separate product.

skywhopper12y ago

The rant linked to appears to misunderstand the problem and the workaround. @arfon admits there's what amounts to a design bug in Linguist, and so to identify ".m" files, you have to identify a different extension as the "primary" and put the real extension into the "alternate" list. That's a hacky workaround, but it would make the pull request work.

The alternative is to fix the design issue. But that's going to be a lot harder and require more than a few days.

nbouscal12y ago

@arfon doesn't admit that there's a bug, rather he says that "requiring a unique primary_extension isn't really a 'bug', rather it's a consequence of how language detection works in Linguist."

The work to fix the design issue was already done by @nox, who submitted a pull request which is still open: https://github.com/github/linguist/pull/985

hyperpape12y ago

I guess the charitable reading is "this isn't a bug, it's more of a bad design choice, and we can't just fix it overnight".

But I honestly can't tell if that's what he meant, or if it was more of a "not my problem" type of response.

1 more reply

eXpl0it3r12y ago

I can agree that the detection is broken. C++ gets often recognized as C. PHP with some CSS file gets recognized as mostly CSS, etc.

Personally I'd like to have a fixed language that I can set and that the search will use. Next to that, it would be fine for me to statically show what the repository contains, but please use a better language detection, just going by extensions is quite naive.

nox_12y ago

> C++ gets often recognized as C.

The disambiguation test for C++ headers is ridiculous:

      matches << Language["C++"] if data.include?("#include <cstdint>")

Allan_SmitheeOP12y ago

Well, I expect that's why so much C++ is misrecognized. Not enough people write valid C++, in Github's narrow world view. :)

mcovey12y ago

I wish I could pick the language so I could upload shell scripts without extensions, but it doesn't even read the shebang line.

johnduhart12y ago

Sorry, but was there an actually something useful in that comment? I couldn't tell over the 6 paragraphs of childish moaning.

moron4hire12y ago

Okay, but the automatically updating comments view is pretty cool. I didn't know Github did that. That is pretty awesome.

Allan_SmitheeOP12y ago

And that would be part of the problem with Github. Emphasis on "pretty cool" visual flair while letting fundamental architecture fly out that is flatly, and very obviously, just plain broken.

akerl_12y ago

Considering that the comment you're replying to said "that feature is pretty cool", and didn't even need to address the actual linked rant, it seems that not everyone agrees with your "this is just plain broken" viewpoint.

I use Github for the visual flair and cool features. If I wanted to run my own fundamental architecture, I'd be doing that.

2 more replies

j / k navigate · click thread line to collapse

88 comments

DannoHung12y ago

mindcrime12y ago

Like you say, the best option really would be to let the repo owners / maintainers just specify this stuff. They are, after all, the ones who know.

itcmcgrath12y ago

I wish I had more up buttons. Sometimes you can be too smart of your own good, and the good old fashioned way is superior...

Note: I'm not saying they shouldn't have the auto-detection, because it definitely helps if the maintainer doesn't do it, but for those that want to help classify things - let them!

ancarda12y ago

elwell12y ago

I disagree; I think the process should be as streamlined as possible. However, I could see auto-detection balanced with a confidence threshold; which, when not met, would ask user:

"Sorry, I couldn't determine if you had C code in your repo or is that Limbo code?"

Allan_SmitheeOP12y ago

You mean like the other source code repository hosts do?

pedalpete12y ago

I think the idea of automated language detection is pretty cool, but why doesn't github just give you the option of correcting it, or labelling it with the language you prefer?

ZoFreX12y ago

I actually much prefer BitBucket's way of doing things, for exactly this reason. It doesn't even try to detect - it just asks me. Sometimes the simplest solutions are the best.

michaelmior12y ago

Allan_SmitheeOP12y ago

How 'bout a project-specific property list that looks something like this:

.rb=RealBasic .m=Mercury .pl=Prolog .js=SomeCrapOrOther …

1 more reply

01312y ago

choult12y ago

scott_s12y ago

Which I think is an uncharitable way of saying "Github prioritizes working on things that will impact the most people."

hackcasual12y ago

To me this comes off as assuming the worst intentions on behalf of the github developers.

nox_12y ago

> My understanding is that primary_extension is simply used to short circuit analysis when it's unique to a language.

No, the primary_extension is only used in a gists_helper.rb file outside the Linguist repos. Note that the feature is deprecated anyway.

https://github.com/github/linguist/blob/master/lib/linguist/...

DCKing12y ago

skywhopper12y ago

More likely a lack of time. If you'd like to see it fixed, start contributing to the fix.

2 more replies

apetresc12y ago

Yeah, kind of a self-important hipster too.

Okay there, buddy. I don't think lack of Lingo support is going to be GitHub's eventual downfall.

jperkin12y ago

Their language detection is indeed terrible. I have a repository (https://github.com/jperkin/pilights) which is entirely composed of shell scripts and a single markdown README. GitHub's analysis?

  Perl 83.5%	  Shell 16.5%

There is not a single .pl or .pm file, nor a single mention of 'perl' anywhere in the repository, and all scripts begin with #!/bin/sh.

A number of my other repositories have similar problems, but this one is by far the worst.

LeonidasXIV12y ago

Sorry, I see a huge blue bar saying 100% shell.

theOnliest12y ago

Earlier this morning when I looked it looked like OP said...it looks like something has changed in the three hours or so intervening.

1 more reply

bru12y ago

Inciminated comment: https://github.com/github/linguist/pull/748#issuecomment-374...

moron4hire12y ago

what, then, is the point of the primary_extension field?

EDIT: or as I like to yell at Github for Windows when it can't revert out of a merge conflict "WHAT IS EVEN THE POINT OF YOU?!"

skywhopper12y ago

1 more reply

RyanZAG12y ago

Classification is never 100% accurate.

EDIT: Exact method that it is used is reported here: https://github.com/github/linguist/pull/748#issuecomment-374...

1 more reply

eCa12y ago

I have a few Perl projects on Github that uses Bootstrap. Main language (according to Github): Javascript.

I expect that Javascript's github popularity ranking is (a little bit) inflated due to such issues.

mindcrime12y ago

I expect it's a lot inflated. I also have repos that are primarily Groovy, but show up as "Javascript" due to the presence of JQuery, Bootstrap, etc.

Allan_SmitheeOP12y ago

Unfortunately, until this core issue is fixed, users can't really submit further pull requests to fix the other issues which would correct the "inflation" we all know and hate.

natebrennand12y ago

They actually have a set of files and libraries that they ignore (.DS_Store/jquery/boostrap/etc)

https://github.com/github/linguist/blob/master/lib/linguist/...

cl8ton12y ago

I’ve been in the programming trenches since early 90’s fluent in 5 languages at the production level and have to say I have never heard of the language 'Limbo’. I don’t fault GitHub one bit.

I suppose I could Google it and act like I know… naw

Allan_SmitheeOP12y ago

Are people even reading the context of the rest of the PR?

rkangel12y ago

And even if it WAS irrelevant and only important to a very small number of people, that doesn't mean it can be ignored.

nahname12y ago

>And even if it WAS irrelevant and only important to a very small number of people, that doesn't mean it can be ignored.

I don't follow. That sounds like the exact criteria for something to be ignored.

1 more reply

akerl_12y ago

Given infinite time and developer bandwidth, sure. But we don't live in that world, so "do the work that gives the most benefit to the most users" remains the preferable real-world strategy.

1 more reply

kalleboo12y ago

fastball12y ago

I thought that Mathematica uses .nb extension?

1 more reply

KingMob12y ago

True. But I'll bet you've heard of Matlab, which also uses .m, and is just as old as Obj-C. Matlab is everywhere in scientific computing.

Allan_SmitheeOP12y ago

Five languages across two decades?

Stand back, gents! This one is a champion!

Aqwis12y ago

I don't really care about Limbo, but GitHub seems to think my .m files are all M(UMPS) files, and not Matlab files, the most obvious choice. Highly annoying.

RyanZAG12y ago

I don't get it - why is a bunch of people trolling the github project with fairly irrelevant arguments interesting? Could someone who upvoted this explain the logic?

girvo12y ago

How is having a differing opinion "trolling"? Seriously, this word has lost all meaning at this point.

akerl_12y ago

1 more reply

KingMob12y ago

jbranchaud12y ago

GitHub isn't discriminating against certain programmers. Stay calm and keep coding!

bjz_12y ago

> it just might not have syntax highlighting and its own color in the repo stats bar.

mehwoot12y ago

All they are asking is to arbitrarily specify some other extension as the "primary" extension and have ".m" as another extension. Users will still see the same end result.

Allan_SmitheeOP12y ago

Unless they use gist.

kalleboo12y ago

I miss Mac OS Classic Filetype/Creator codes... Filename extensions are such an ugly hack.

hyperpape12y ago

Great example of worse is better in action.

deutronium12y ago

Could they use Bayesian classifiers? trained on a corpus of different languages, primarily concentrating on the symbols used in the language.

dclowd990112y ago

I think if I was writing a language detector, it would have these features:

- learning heuristics based on user suggestion.

- extension filtering to differentiate similar languages.

Allan_SmitheeOP12y ago

This is because you likely think before you code.

awalton12y ago

> My esoteric programming language isn't properly supported by the popular kids' web tool that I'm likely not even paying to use in the first place. I'm OUTRAGED!

Yep, seems about right.

hk__212y ago

Also, this is untrue. Omgrofl is supported on GitHub, even if nobody uses it.

johnduhart12y ago

And posting it to HN of all places is hilarious.

joeblau12y ago

skywhopper12y ago

The alternative is to fix the design issue. But that's going to be a lot harder and require more than a few days.

nbouscal12y ago

@arfon doesn't admit that there's a bug, rather he says that "requiring a unique primary_extension isn't really a 'bug', rather it's a consequence of how language detection works in Linguist."

The work to fix the design issue was already done by @nox, who submitted a pull request which is still open: https://github.com/github/linguist/pull/985

hyperpape12y ago

I guess the charitable reading is "this isn't a bug, it's more of a bad design choice, and we can't just fix it overnight".

But I honestly can't tell if that's what he meant, or if it was more of a "not my problem" type of response.

1 more reply

eXpl0it3r12y ago

I can agree that the detection is broken. C++ gets often recognized as C. PHP with some CSS file gets recognized as mostly CSS, etc.

nox_12y ago

> C++ gets often recognized as C.

The disambiguation test for C++ headers is ridiculous:

      matches << Language["C++"] if data.include?("#include <cstdint>")

Allan_SmitheeOP12y ago

Well, I expect that's why so much C++ is misrecognized. Not enough people write valid C++, in Github's narrow world view. :)

mcovey12y ago

I wish I could pick the language so I could upload shell scripts without extensions, but it doesn't even read the shebang line.

johnduhart12y ago

Sorry, but was there an actually something useful in that comment? I couldn't tell over the 6 paragraphs of childish moaning.

moron4hire12y ago

Okay, but the automatically updating comments view is pretty cool. I didn't know Github did that. That is pretty awesome.

Allan_SmitheeOP12y ago

And that would be part of the problem with Github. Emphasis on "pretty cool" visual flair while letting fundamental architecture fly out that is flatly, and very obviously, just plain broken.

akerl_12y ago

I use Github for the visual flair and cool features. If I wanted to run my own fundamental architecture, I'd be doing that.

2 more replies

j / k navigate · click thread line to collapse