one of the biggest challenges I've found in implementing ML projects is I don't have a great sense of when I've really gotten the most info out of the data. I'm not particularly competitive but the contest format is great for this. When you see that a solution you'd normally be happy with ranks in the lower half of the answer you're really pushed to improve your solution.
This is leads you to learn your tools and algorithms better. For a couple of contests I took seriously I ended up learning tons about R, spent most of my nights reading academic papers on various newer techniques, and also read through a few books. On top of all that you really should spend time reading up on how past winner have won which gives a bunch of practical insight into approaching different ML problems.
In one contest I tried the hardest in I actually placed terribly after the final results were calculated, but looking over what went wrong I was amazed to see that I actually did progress really far with my understanding of ml. I'd say a month of seriously competing is easily worth a semester long grad class.
Would I be wasting my time attempting these with such a basic level of knowledge?
When I started I was in a similar position to you and just wanted to see if I could even tread water with some of the really knowledgeable members of the community. I ended up placing in the top 5 for one of the contests I was in (with btw a really simple model).
They usually give you some starter code in either R or Python which will give you the results for a benchmark, start there and then use cross-validaton to see if you can beat that bench mark, and if you do submit. It's very addictive and you'll come away knowing a lot more than you started with.
So let me ask you, where do you read up on how past winners have won?
How did you decide on algorithms to try out on a contest? How did you find promising academic papers?
For algorithms, just try whatever you know best/is fastest to implement. If you're using R I highly recommend the Caret package.
For papers: the best place to get started is to begin browsing the forums or any similar contests, the community there is actually pretty awesome and will frequently post papers. After that google scholar (or even just google) for particular problems will yield nice results.
Also checkout the wiki: http://www.kaggle.com/wiki/Home
Additionally for contests like Heritage Health [0], I believe the necessary goal of RMSLE of less than 0.4 is not considered possible (I came across this in the forums but never verified), so even if the contestants just inch past 0.4 it would still be something impressive.
It's like the opposite of a professional organization. I suppose the libertarians approve. It drives down the cost of labor and therefore might make the market more efficient. Yet I'm suspicious.
I'd like to propose a counter-organization. Analysts can band together and offer a contest. We collaborate to create a tool that gives your company an X% increase in value. Companies bid for the rights to that tool. I'd expect that the value to the laborer would be greater than $Y/n. I guess that just described a consulting company.
Perhaps the situation is not so unique. Art also provides much value in the act of production and many organizations hold art contests similar in design to Kaggle competitions. Open-source software often doesn't even have a competition sponsor.
It'd be ludicrous to imagine holding a contest to offer the best legal advice or diagnosis. I'm not saying that I agree with the restrictions that the American Medical Association has placed over the ability to attend medical school, but the free market is harsh enough competition.
Kaggle does promote the value of the field as a whole. I worry that it commoditizes rather than professionalizes.
The moderator situation on Stackoverflow is getting out of control. I see a Q&A site as having three main groups:
1. People who ask questions;
2. People who answer questions; and
3. People who edit/moderate questions.
Even 2+ years ago there was a lot of lip service paid to the value of (3). I disagreed then and it's only been reaffirmed by subsequent events. To be clear: it's not that I think these functions have no value, it's that they are, at best, secondary to content creation.
The problem is that these roles without diligent oversight attract the wrong kinds of people (eg [1] [2] and a scandal a few years about an admin black list that I can't seem to find right now).
Take this question from Stackoverflow: Database development mistakes made by application developers [3], a question I spent some time answering and that people seemed to appreciate the answer to (based on comments and 1000+ upvotes). It is closed as "not constructive". This is hardly a unique phenomenon. We've all seen many interesting questions posted here that are now closed or locked and who knows how many have been deleted.
The kind of person you end up is overly pedantic and a real stickler for an arbitrary set of rules.
Editors/moderators are the bureaucrats of the Internet.
As Oscar Wilde said, “The bureaucracy is expanding to meet the needs of the expanding bureaucracy.” [4]. These sorts of people just invent work for themselves in the absence of anything to do.
Joel needs to make some changes to Stackoverflow. It's rapidly going the way of the old Usenet days when anything interesting gets shot down and anything else gets closed and the OP lambasted for not having found the 17 previous duplicates. Not good.
The biggest problem I see is an extreme interpretation of what is "subjective". "What language should I learn?" is an obviously subjective question. In the absence of any concrete criteria, it's hard to give a useful answer.
But consider a question like "What are the pros and cons of Sinatra vs Rails?" This sort of question (IMHO) absolutely has value as someone experienced with both could enumerate the relative merits of each in a pretty objective fashion without making an absolute determination. This is something that absolutely could have value to anyone evaluating Ruby Web frameworks.
So, back to this post, what are the odds of any particular question being closed? it seems to be positively correlated with how much time has passed (since SO's inception) and how interesting the question is.
[1]: http://www.nbcnews.com/technology/technolog/wikipedia-admins...
[2]: http://www.searchenginepeople.com/blog/most-notorious-wikipe...
[3]: http://stackoverflow.com/questions/621884/database-developme...
[4]: http://www.goodreads.com/quotes/130452-the-bureaucracy-is-ex...
I guess, but Zookeepers could also potentially talk about "What are the pros and cons of Gorillas vs Sharks?"
http://blog.stackoverflow.com/2011/08/gorilla-vs-shark/
> Database development mistakes made by application developers
This is a discussion, not a question. The entire text of said "question" is, quite literally, "What are common database development mistakes made by application developers?" If it can have infinite answers, is it really a question?
http://stackoverflow.com/questions/621884/database-developme...
Great post, indeed, but it belongs on your blog.
One of the biggest misconceptions about Stack Exchange is this idea that discussion is, in and of itself, a net good to the world -- and therefore we are monsters for not allowing discussion. I do not believe this to be true. There is, and will always be, an infinity of discussion. Like Jay Leno once said about Doritos, "type all you want, we'll make more". If something can be had in infinite amounts, what is its value?
Stack Exchange supports only the minimal subset of discussion necessary to get practical, useful answers to specific questions. The goal is not discussion, but science-in-the-small. Back up your claims. Show us references. Show us data. Share your specific experiences.
Otherwise you end up with Quora, a system where everything is a discussion, and all answers are opinions. Thus they can only be evaluated based on how famous the poster is, or how compelling a yarn they can spin.
Nothing against Robert Scoble and great storytelling (I used to work with Joel Spolsky, after all), but I've seen where that system leads. Given a choice, I'll always take tiny science. You should too.
It is a question, an open question intended to provoke debate and teach about a subject, in effect it's a request for an FAQ. Now perhaps SO is not intended to be for that sort of question, and that is of course for SO to decide.
I suppose the reason many people come to SO to read questions is that they'd like to learn about a subject area, and the reason many people come is write answers is that they'd like to teach a little about a subject, and this sort of open-ended questions offer the opportunity for someone to answer questions the asker didn't even know they had - like should I add an index to my db, if so when? Should I use natural keys? etc. To in effect tell them to unask all those questions they would otherwise have asked in groping their way to familiarity with the subject. It functions as an FAQ for that particular subject, to prevent beginners from making the same mistakes/asking the same questions over and over.
So that sort of question can be very useful for someone starting out, for the kind of person your site targets. Maybe that sort of question belongs on some other site though, a sort of training site rather than a question/answer site, or maybe SO should just expand to encompass that sort of FAQ function?
I'm not so convinced that for this category of question there is a clear line between 'db x breaks when I do y, what should I do?', 'Do I need to use db transactions in db x?', and 'What are the common db mistakes?', and that one sort of question/answer is rational, and another narrative - is the division really that clear? Are there not many many borderline questions which solicit opinion (of someone who knows more about the subject), and yet are useful for others too? Are not many of these smaller 'science-in-the-small' questions actually answerable in many different ways, each of which may be somewhat valid and none of which is actually 'right' in some categorical way?
For example this question, which is equally open-ended, remains open (rightly so I think as it could be a useful discussion :)
http://stackoverflow.com/questions/327199/what-will-we-do-af...
And on an appropriate forum with as many readers as SO/SE you'd likely get well-thought out responses going far into detail regarding purchase cost, habitat maintenance, prevalence of skilled keepers, etc. Someone who actually was deciding whether to add a Gorilla enclosure or a shark aquarium would find it a very enlightening post.
I understand you created SO/SE and want to see it move in a certain direction, but closing and deleting questions you don't like because it threatens your "science-in-the-small" goal is, I think, a terrible way to go about it.
(FTR, I understand Jeff is not personally going around deleting stuff on the site)
This is a discussion, not a question. The entire text of said "question" is, quite literally, "What are common database development mistakes made by application developers?" If it can have infinite answers, is it really a question?
http://stackoverflow.com/questions/621884/database-developme....
Yes, it really is a question. I don't mean to come off as sarcastic, but it's got a question mark at the end of it - one that you, yourself, put there.
You're the one that's imposing esoteric semantics and restrictions on this.
Great post, indeed, but it belongs on your blog.
Except that:
1) no blogs have the visibility and user base that SO has - not even yours or Joel's. 2) a blog post isn't crowdsourced - at least not to the extent that SO is
One of the biggest misconceptions about Stack Exchange is this idea that discussion is, in and of itself, a net good to the world -- and therefore we are monsters for not allowing discussion. I do not believe this to be true. There is, and will always be, an infinity of discussion. Like Jay Leno once said about Doritos, "type all you want, we'll make more". If something can be had in infinite amounts, what is its value?
Nutpicking and a false dichotomy. Ease up on the defensiveness and try to see it from the point of view of the many people that want/need to know the answer to that question.
For Pete's sake, at least 568 people upvoted the question, and at least 1004 people upvoted the first answer alone.
The community has spoken - they see this as valuable content.
Stack Exchange supports only the minimal subset of discussion necessary to get practical, useful answers to specific questions. The goal is not discussion, but science-in-the-small. Back up your claims. Show us references. Show us data. Share your specific experiences.
This is an overly narrow, baffling and frustrating definition of "question"
(And the goal may not be discussion, but discussion is a characteristic of most answers. It's a community, after all.)
This is a rant that's needed and one that SO needs to open themselves to receiving. Their stance on this is just wrong.
I'd actually bookmarked several of your posts (among others) because they were so valuable. So it enrages me to no end to click those bookmarks and find that the entire discussion is simply gone.
To recap - a high quality contribution whose value was validated by dozens of individuals (or perhaps even more) was simply deleted.
The problem starts at the top. As you can see from Jeff Atwood's post below (codinghorror) , even the founders of Stack Overflow don't understand the value of their own platform to their customers. They had a preconceived notion of what StackOverflow Is and they are going to stick to it, users be damned.
I don't mean the bash them, but if nerds want to see a prime example of why they take orders from non-nerds (whom we like to think of as "less intelligent" than us), this is exhibit "A". And if you want to know why StackExchange will fail everywhere else, this is exhibit "A". No other group of users will put up with that crap.
Which is why the site is hugely successful. If you want discussion then do it here or on Reddit. I don't want the questions I ask about why something doesn't work in jQuery or C# or C++ or whatever drowned out by Ruby Vs Python posts or stuff like the question cletus mentioned above, or "what have I got in my pocket" mysteries.
The brilliance of Stack Overflow is how quickly one can get answers to "specific programming problems" because everyone on the site is focused on answering these types of questions, not participating in discussion and navel gazing.
an algorithm that predicts whether (and for what reason) a question will be closed.
which raises all sorts of questions as to what the reasons are for which questions [should][1] be closed - this is a grey area, and there is much argument on SE itself over what sort of questions are considered acceptable, to the extent that they have started retroactively disabling lots of content which doesn't fit with an arbitrary set of rules about what a Q&A site should have on it. Without some moderation obviously the site would descend into chaos, but with very heavy handed and arbitrary moderation it will atrophy and the very people creating the quality content they want will leave, leaving them with moderators (robot or not), new users and trolls, and not much else.
Personally I think it's a huge mistake for SE to start banning questions on the basis of them being not constructive. All their other categories of problem questions make sense, but trying to ban questions that are too controversial or involve opinions is IMHO unwise - that's exactly the sort of question which leads to engaging content on SO, even if some of it verges on a troll. This has been the problematic area for them and has lead to them marking lots of useful posts as not constructive even though they clearly are constructive and informative, just because they fall on the wrong side of a line decided retrospectively to declare certain questions unquestions, and others valuable.
It will be very interesting to see if any of their robot moderators are useful in delineating this more problematic area of questions which are controversial or involve opinions - potentially that is every question/answer set more complex than 'what is 2 + 2?', and that line can vary dramatically depending on the moderator, and their opinions.
So I think this raises an interesting (though perennial) question about how heavy handed community moderation should be; in some ways related to those questions raised recently about quality and moderation on HN.
But seriously, Stack Exchange thrives because of its focus (which is maintained in large part by its culture of moderation, though only a small part of that is done by actual "moderators"; the community self-polices pretty well).
This contest is aimed at catching posts that would be closed before they're posted. Ideally we'll get a classifier that lets us give guidance to the askers. Better education of new/confused/troublesome users, leading to better posts, and fewer closed questions overall.
(Disclaimer: SE Inc. employee, yadda yadda yadda)
But it's OK because they say "we're making the Internet a better place," :) That's kind of an annoying trope at this point too but that's for another discussion.
probabilityOfClosing = (question) ->
text = question.text.toLowerCase()
return (text.length / (text.indexOf('jquery') + 2)) / 100Also, the result is not bound to 0..1
I just hope the winning entry will prompt the developers to remove that stupid filter[1] that prevents you from referring to the Halting Problem in question titles.
[1]: http://meta.stackoverflow.com/questions/107989/using-the-wor...