> Even if the model gets extremely good at predicting final_score_if_it_hits_front_page, there’s still the inherent randomness of probability_of_hitting_front_page that is fundamentally unpredictable.
In addition to date, you might want to include three fields:
- day of week (categorical)
- is weekend/holiday (boolean)
- hour or time of the day (categorical, you can have 24 of them or morning/afternoon/etc.).
The probability of a post hitting the front page is usually affected by these things so it can really help the model.
It's counterintuitive, but if you post at a really popular time, you're competing with a lot of other submissions. If you post at a really slow time, you'll get fewer votes, but it will take fewer to reach the front page and you'll have less competition.
In the end, it kinda evens out. The number of votes it takes to get to the front page and the number of competing submissions are both correlated to your fields above.
Somehow this reminded me of someone datamining spiegel.de (german news site) and using the timestamps of the posted articles to extrapolate the writers religion (holidays) and relationships (shared vacations) among dozens of other data points from several years of publicly available data. I think no AI was involved back then.
https://media.ccc.de/v/33c3-7912-spiegelmining_reverse_engin...
This has been studied multiple times on HN posts, most seem to have link-rotted. Web Archive them if looking for insights - https://hn.algolia.com/?q=best+time+to+post
I generally find these posts pretty boring, and most comments on them are people recounting their own stories about how that (or a similar) service screwed them over. I suppose they can be a decent way to warn people off of a particular product (scammy, terrible customer support, whatever), but that's not what I come to HN for.
Model correlation is decent here but there's certainly more to do to use its outputs predictively.
My point is I don't think people seek out outrage. Social media's algorithms may not explicitly reward it as transparently as `if (post.outrage > 100) post.boost()`, but outrage isn't some default rule of interaction.
Give people the way to repost / retweet / boost, and your feed suddenly turns into mostly negativity, even if your algorithm is "show posts from my followers only, newest to oldest"
See also https://en.wikipedia.org/wiki/Negativity_bias
We're just built like that.
Regarding text platforms suffering more than non-text platforms, I think it's because of the lack of social cues that are otherwise there. You can infer a lot from the way someone talks, or from their body language. You can't infer much from text, which is partly why Poe's law exists -- sarcasm doesn't translate well.
I'm sure it's just human psyche but I'm trying to overcome it and make my life more positive again
https://scikit-learn.org/dev/modules/generated/sklearn.isoto...
I also agree with your intuition that if your output is censored at 0, with a large mass there, it's good to create two models, one for likelihood of zero karma, and another expected karma, conditional on it being non-zero.
> it's good to create two models, one for likelihood of zero karma, and another expected karma, conditional on it being non-zero.
Another way to do this is to keep a single model but have it predict two outputs: (1) likelihood of zero karma, and (2) expected karma if non-zero. This would require writing a custom loss function which sounds intimidating but actually isn't too bad.
If I were actually putting a model like this into production at HN I'd likely try modeling the problem in that way.
The reason I think of this as censoring is that there are are some classical statistical models that model a distribution with a large mass at a minimum threshold, e.g. "tobit" censored regression.
(Fully dictated, no edits except for this)
* 1 had a score that was reasonably close (8.4%) to what the model predicted
* 4 had scores wildly lower than the model predicted
* 2 had scores wildly higher than the model predicted
* the remaining 3 were not wildly off, but weren't really that close either (25%-42% off)
Then there's a list of 10 submissions that the model predicted would have scores ranging from 33 to 135, but they all only received a score of 1 in reality.
The graph shown paints a bit of a better picture, I guess, but it's still not all that compelling to me.
Broadly, the main use case for this model (in the RL context) will be to take two different versions of the same post, and predict which of the two is more likely to be upvoted. So what matters isn't that it gets the exact number of upvotes correctly, but that it correctly predicts the relative difference in likely upvote count between two variants.
Now it still doesn't do a great job at that (the correlation is only 0.53 after all) but it still does a good enough job to provide some useful signal.
But the number of comments depends on the time posted more than the story itself and that information isn't in the model.
Did you ever figure out what happened in 2016?
It’s still outside the hn mainstream to use both in the same submission, so that might be biasing the model in strange ways.
> The correlation is actually not bad (0.53), but our model is very consistently over-estimating the score at the low end, and underestimating it at the high end. This is surprising; some variation on any given data point is expected, but such a consistent mis-estimation trend isn’t what we’d expect.
This is a consequence on the model objective. If you don't know what is really happening, a good way of reducing the overall error is to do that. If you instead try to exactly predict the very highs and very lows, you can see that you will get very high errors on those, resulting in a bigger overall error.
Appart from that, I want to comment on AI alignment here. For me the objective of "most up votes" is not fully correlated with where I get the most value on HN. Most of the time, the most up voted I would have found them anyway on other platforms. It's the middle range what I really like. So be careful implementing this algorithm at scale, it could turn the website into another platform with shitty AI recommendations.
Yes, this is a fantastic point. I'm curious if there's some other measurable proxy metric for "things I get the most value out of on HN"? Upvotes seems like the most natural but optimizing for it too strongly would definitely take HN down a dark path.
Supervised learning you train on pairs of (x, y) where x is your input (title/post text/metadata) and y is the output score.
Naively, it's a linear regression model, Y = b0 + b1x1 + b2x2 + b3x3. Where b0 is your bias ("a floor for score points"), and b1, b2, and b3 are bias terms for the actual data of the post. You can solve this, closed form, and find the b1/b2/b3 that minimize the error of fitting to Y.
How do these equations change with RL? I always assumed RL was a multi-step process where actions are taken to get to a reward. If there is only 1 step/decision, to produce a "random" score, it feels much like supervised learning.
Such a model can be used as the "reward model" for the "reinforcement learning from human feedback" (RLHF) method.
If the reward model is indeed smart enough to be able to take that into account you could actually use it to plan the optimal time of day to post a specific story! You could just use the reward model to compute a predicted score for 8 different versions of your content, holding the post title/text constant across them all and just changing the date. Based on the differences in scores, you can determine which posting time the RM thinks is most likely to make your post successful!
You see this on Reddit pretty commonly.
Someone posts original content at an off time and get a small/moderate amount of upvotes. Then some time later (could be hours, days, or weeks) a bot/karma account will post the content at an optimal time to farm upvotes.
Everything else in the model before that final layer is exactly identical, architecture-wise.
In the case of a reward model, are you streaming in the list of tokens; if so, what is the output after each token? Or are you feeding in all of the tokens in one shot, with the predicted reward as the output?
What is your take on this?
And all the graphs for the blog are from this notebook: https://github.com/OpenPipe/best-hn/blob/main/blog-figures.i...
Lots of other good stuff in that repo, although it's only organized to a "working researcher" standard I'm afraid.
Based on the later analysis in the post (which I agree with), the total score of a comment is disproportionately tied to whether it hits the front page, and of course how long it stays there. Regardless of the quality of the average post starting in 2015, the sheer quantity would make it impossible for all but a few to stay on the front page for very long. Hacker News got more popular, so each story got less prime time.
You would do better to leave out dates and authors.
Do you really want the model to hone in on dates & authors? If you just trained on those would it create anything useful?
It can’t for dates, since it isn’t getting any future date examples to prepare for future dates. I suppose you could argue that month & day matter. But surely that would be a much lower quality discriminator than forcing the model to stay focused on title & content.
Similarly with author. You can find out which authors produce content with the most upvotes with a simple calculation.
But again, is that the discriminator you want the model to use? Or the title & content? Because it will use the easiest discriminator it can.
Maybe the reputation of the poster is also a factor?
Well, thanks HN, you were good while it lasted...
this is dangerous talk.
it doesn't understand anything at all.
Reminder: We are more prone to anthromorphizing LLMs than to humanizing suffering humans.