story

Analyzing Hacker News Users’ Join Dates, Karma, and Profiles (opens in new tab)

breckyunits.com

36 pointskradic18y ago19 comments

19 comments

Linear fits? Why? Why?

Karma is a function of time since joining, participation, and quality of contributions. And it starts at 1. 'Participation' can be determined by looking at contributions per length of time. Quality is average score of each submission -- separating it from participation is a useful way to extend this to a more complicated model taking into account the fact that people stop using HN. So the line of best fit should be something closer to 1 + t * q * p, or the sum of 1 + (t0 * q * p0) + .... (tn * q * pn) to describe folks who are off-and-on contributors.

timr18y ago

If nothing else, it's legitimate to filter non-participants before doing a regression analysis on the rest. It's technically true that you can't predict karma over time, but that's not really an interesting statement until you eliminate the large number of people who sign up, then never post or comment.

My instinct is that once you filter these people, you'll see a much stronger linear relationship between time and karma, since karma isn't normalized by the number of contributions, and number of contributions is probably a poisson process.

breck18y ago

Removing all 1's and 2's improves the relationship somewhat (moreso with log(k)) but still not a whole lot.

1 more reply

DougBTX18y ago

> And it starts at 1.

That's just a constant offset, it matters more whether the next karma level is 2 or 10. When fitting trend lines a "linear fit" would normally satisfy y=mx+c, without limiting yourself to c=0. Note that the posted linear fit has c = -1... apparently everyone starts with -1 karma. Lies, damned lies and statistics I say!

breck18y ago

Agreed. If I had data for the quality of contributions I could have done that.

DougBTX18y ago

Could you post "General Composition of the Dataset" with a logarithmic vertical scale please? Should help compensate for the outliers so we can see more detail at the bottom of the graph.

breck18y ago

Done.

edw51918y ago

"I didn’t really expect to find a whole lot of interesting things, and found what I expected."

Which is a great way to conduct research! Nice work.

This reminded me of my senior project in number theory, when I manipulated a large data set, wondering what I'd find. Eventually, I found quite a bit.

Also reminded me of this quote by Wernher von Braun:

"Basic research is what I am doing when I don't know what I'm doing."

xirium18y ago

You may want to take into account differences in file timestamps because the data was collated over many days.

breck18y ago

Ahh, I didn't notice that. That would affect things. The timestamps and counts are: ('04/15', 630), ('04/16', 1993), ('04/17', 1994), ('04/18', 491), ('04/20', 270), ('04/21', 1049), ('04/22', 59), ('04/24', 33), ('04/25', 342), ('04/26', 29), ('05/03', 165), ('05/06', 86), ('05/07', 23). So most of the members have older join dates than I figured.

wallflower18y ago

The interesting thing about karma.. I find that I can't/I get tired of posting insightful comments day after day..and take breaks and lurk..edw519 I don't know how you do it. I probably won't make the "leaders" (but I don't think its important)

thaumaturgy18y ago

| (but I don't think its important)

I think that's the crux of it. Somebody could monitor all the various news sites and spend an hour a day here posting comments and stories and so forth, but I suspect most folks would rather spend that time doing something else.

That said, edw519 is a pretty cool guy.

nostrademons18y ago

I tend to post in bursts, so most of my karma comes over periods of a week or two when I'm posting several times a day. It's actually a bad sign, because it means I'm not working on my startup. ;-) Then there are periods where I'll post like twice a week and my RescueTime log'll show that I'm spending like 80-90% of my computer time coding. So yeah, it's a tradeoff, and ultimately the code is more important, but I find that I burn out if I spend too much coding.

pierrefar18y ago

Nice work. I love manipulating large data sets :)

Which program did you use to produce the plots?

breck18y ago

JMP

mooneater18y ago

Scatterplots are too dense!

1 more reply

omfut18y ago

Just curios, how does the karma point work?

j / k navigate · click thread line to collapse

19 comments

byrneseyeview18y ago

Linear fits? Why? Why?

timr18y ago

breck18y ago

Removing all 1's and 2's improves the relationship somewhat (moreso with log(k)) but still not a whole lot.

1 more reply

DougBTX18y ago

> And it starts at 1.

breck18y ago

Agreed. If I had data for the quality of contributions I could have done that.

DougBTX18y ago

Could you post "General Composition of the Dataset" with a logarithmic vertical scale please? Should help compensate for the outliers so we can see more detail at the bottom of the graph.

breck18y ago

Done.

edw51918y ago

"I didn’t really expect to find a whole lot of interesting things, and found what I expected."

Which is a great way to conduct research! Nice work.

This reminded me of my senior project in number theory, when I manipulated a large data set, wondering what I'd find. Eventually, I found quite a bit.

Also reminded me of this quote by Wernher von Braun:

"Basic research is what I am doing when I don't know what I'm doing."

xirium18y ago

You may want to take into account differences in file timestamps because the data was collated over many days.

breck18y ago

wallflower18y ago

thaumaturgy18y ago

| (but I don't think its important)

That said, edw519 is a pretty cool guy.

nostrademons18y ago

pierrefar18y ago

Nice work. I love manipulating large data sets :)