undefined | Better HN

0 pointsdntrkv8y ago0 comments

I've been saying this for years, thumbs up/down is the only system that makes sense to me.

Foursquare uses it and I've found their scores to be way more useful than Yelp's.

The biggest problem with star ratings is that it's so arbitrary. What is the difference between 3 and 3.5? What is a 1 vs a 2? 3/5 is 60%, that's almost failing when you think about it on a grading scale, if I scored something as a 3/5 I would never use that product or service again, yet, many of the best restaurants are rated 3/5 on Yelp.

Unless the user has some scoring system in place for different qualities of the product or service, there is no way you can get anything resembling an accurate score.

I would never trust a user to accurately assess a score given 10 different options (.5-5) but I would be way more likely to trust a user to say either "I like this product" or "I do not like this product."

But yes, the Wirecutter approach works great, but it just doesn't scale.

0 comments

crazygringo8y ago

Counterpoint: I almost solely rely on the stars histogram in Yelp (available only on the website, not the app), completely ignoring whatever Yelp's calculated "average" is.

If a place has more 5-star ratings than 4-star ratings, it's generally amazing. If it has more 4-star ratings than 5-star ratings, it's generally fine but not something particularly special.

Just thumbs up/down would eliminate what is, to me, the single most useful aspect of Yelp.

It doesn't matter that star ratings are arbitrary -- when you average enough of them out, a clear signal overrides the noise. You can distrust any given user, while still trusting the aggregate.

(Curiously enough, I don't find any equivalent value on Amazon. On Yelp, you're really evaluating an overall experience along a whole set of dimensions, so there's a lot more to discriminate on. On Amazon, it does seem to be more of a binary evaluation -- does the product work reliably or not?)

BoiledCabbage8y ago

I used to think the same thing until I realized the most accurate and consistent ratings I use on a regular basis is rotten tomatoes. And they're based on strict thumbs up/ down.

It ensures votes hold equal weight and that "extreme polar" voters don't skew things. It also avoids the opposite problem of "everything is neutral" vote unless horrible/incredible.

RT also handles high brow and low brow well. You get less voting of "eh I didn't love it, but it's sophisticated so I'll give it an extra star."

I'm sold on simple up/down.

stinkytaco8y ago

Rotten Tomatoes is good and predicting a movie I (or others) like, but not really at "ranking". Zootopia, one of their top movies of 2016 and a 98% rating, is a good movie, but one I'm unlikely to pursue again. The Godfather (with a 99%) rating, is a movie I will pick up on Blu Ray and revisit many times. It's far more than 1% better than Zootopia.

So RT is good at predicting "should I watch this movie I haven't watched before", but bad at predicting more sophisticated habits or preferences. I wouldn't buy the Blu Ray off a RT prediction, but I would rent.

So it becomes a question of what are you trying to accomplish? For some issues up/down is a good way to solve a problem, for others it isn't.

2 more replies

ehhnetfliz8y ago

In my experience, my favorite movies I find via glowing reviews. Rotten tomato completely obscures this view: if all the reviewers kind of like it, it'll get 100%, whereas polarizing films always suffer. I'll take "kids" over "star wars" any day for a better movie. Why? I'm gonna see star wars because i want to, not because I expect a meaningful aesthetic. But Rotten Tomatoes takes the opposite tact, pushing me towards crowd favorites rather than what i might rate highly.

Really this comes down to how terrible one dimensional comparisons are: it only measure popularity, which is a terrible filter for quality.

alexilliamson8y ago

I used to religiously research movies on RT - with a lot of success in my mind. With the user rating, the critic rating, and the "top" critic rating, you can infer a surprising amount about who is going to like any given film, and you learn over time where you fall on the critic/top critic/audience graph.

Recently, however, it seems like more (imo undeserving) movies that are "just ok" - like decent, but nothing special, romantic comedies and big blockbusters - are scoring above 90%. I might be being curmudgeonly about it, but I've nearly stopped checking it because it feels like there's no information there. My theory is that this started happening once Roger Ebert died... without such a leader in the field, no one is willing to say they didn't like a film unless it's obviously very bad.

albedoa8y ago

I pay a lot of attention to histograms when there are many high-rated options for the same Amazon product type. A histogram that curves sharply in its number of 5-star reviews to almost nothing on the other end is the product you want (ignoring fake reviews for the sake of this conversation).

Amassing a bunch of 4- and 5-star ratings is easy, but leaving nothing for even the most habitual of complainers to complain about? That's an monumental achievement.

ghaff8y ago

For things like books, I also find that reading the middling reviews often gives the best S/N ratio. It weeds out the fanboys and weeds out those who were clearly not the audience for the book (or just have some ax to grind). You're more likely to get the "I really love this author in general but I didn't care for this book because 1.) 2.) 3.)."

VanillaCafe8y ago

Agreed. For products in Amazon above a certain star threshold (say, 3+), I evaluate given the shape of the review histogram, particularly minimizing the size of the bump down at 1-star and 2-star.

stinkytaco8y ago

If the provider is in a position to provide a prediction, then the rating system is useful. For example, on Netflix I used the Hated It, Didn't Like It, Liked It, Really Liked It and Loved It system. When they predicted a star rating, it was pretty close. When they said we predict you'll give this a three star rating(which is probably well below the "average") that was generally a movie I liked.

ghaff8y ago

Which, in practice, tends to devolve to what's effectively a four-star rating of some sort: Want two hours of my life back, OK/meh, Good, Excellent

A humorous take: https://xkcd.com/1098/

stinkytaco8y ago

But my point is that for me, it didn't. Netflix's system was good enough to take into account that people have different systems. Thus when Netflix says "we predict you'll give this 3 stars", that means it was a movie I would like. That might mean you gave it 4 stars or 2 stars or whatever, even though you liked it as much as me. They made my system the only one that matters, as long as I was consistent. Reviews in aggregate are pretty much meaningless, but a good system weighs that problem in.

beefsack8y ago

Perhaps the issue isn't the granularity of a single dimensional rating scale, but the lack of expressive options when in reality your feeling about something is complex and multifaceted.

I've been really interested in the idea of emotive reviews as an alternative to single dimensional scores. The best idea I have at the moment is something akin to emoji reactions like you see on GitHub issues, finding a way to encode some feelings relevant to product reviews in a mechanism like that seems really intriguing to me.

evincarofautumn8y ago

I envision a panel of emoticons akin to the Facebook reaction set, but where the user can select as many as they want to quickly convey different combinations of their reactions:

    (thumbs up)      I liked this
    (heart)          I loved this
    (thumbs down)    I didn’t like this
    (smiling face)   This made me happy or satisfied
    (frowning face)  This made me sad or disappointed
    (surprised face) This made me surprised or impressed
    (angry face)     This made me angry or frustrated

Of course, it gets complicated. Did Sam U. Zerr give that product an (angry face) because they used it and didn’t like it, or because they’re offended that you would recommend it, or what?

If you’re only using icons to make recommendations to an individual user based on their own history, maybe you don’t need to infer the actual meanings; you can add all sorts of icons without any particular meaning and just make recommendations by correlation:

    (thinking face)  I’m considering this / I’m confused by or dubious of this
    (gear)           This was useful / this made me think
    (fire)           This album was great / this sauce was spicy
    (heart eyes)     I really want this / this is adorable
    ...

E.g. a recommendation for me might be “(thumbs up)(gear)(heart eyes)” because some product or content is similar, by some hidden metrics, to other things that I’ve reacted to in those ways.

Just brainstorming here. There are obviously many possible approaches in this space.

astrobe_8y ago

Put differently, a set of binary choices: amusing, interesting, sad, ... It's a bit difficult to come up with a good set to rate any thing, but I can see it working for specific topics, like movies or games.

Or, one could just let users tag the subject and the interface would display the "weights" of the tags.

1 more reply

xaedes8y ago

How about vision based emotion recognition of viewers with cameras in the televisions and monitors? Sure sounds creepy and behaving different when observed etc. But I believe people will forget they are "observed" so the effect dimishs after a time. Than we would have a quite honest emotional feedback for movies. Even for specific scenes, for advertisment, etc

throwaway6138348y ago

To be honest, the fact that 60% is a failing grade is a failure of the grading system, not a fact to take for granted. We've basically lost the entire dynamic range of 0-60% for no good reason.

andrewflnr8y ago

I would actually say it's often not strict enough. In what serious field is it acceptable to only know, say, 70% of the material? Do you want to drive on a bridge designed by an engineer who only got 70% on their exams? It depends on how the test is structured, really, but unless it was one of those tests designed to bring smart people to their knees, I'd rather not.

adrianratnapala8y ago

We probably cross bridges designed by engineers who only got 70% on their exams all the time. That was pretty satisfactory score when I was in Uni.

1 more reply

logfromblammo8y ago

So you're saying that the measurement of student mastery has no noise floor?

waterhouse8y ago

It depends on how things are graded. On a multiple-choice test with four choices per question, someone with no knowledge who guesses randomly will get ~25%. On a true-false test, someone with no knowledge gets ~50%. On a project graded by a human, or a worksheet whose answers are real numbers, someone with no knowledge and a hard-eyed grader might well get 0%. Different classes will have different proportions of these things that contribute to the overall grade (at least, I haven't heard of any requirement that classes have the same proportions of such). The simple approach of summing total points achieved over each graded item, divided by total points possible, is straightforward to calculate, but I think there's no mathematical justification for choosing one percentage-based grading scale and applying it uniformly to all classes.

1 more reply

ThrustVectoring8y ago

Yup. You can't assume that what you think a 3/5 means is the same as what someone else thinks a 3/5 means. You can basically assume that for thumbs up/down. And really, the question you want answered is "how likely am I to like this", and thumbs-up % of the overall population is a decent proxy for that for good reason.

ehhnetfliz8y ago

What does a thumbs up mean, though? In a netflix context, am i recommending it to others? Trying to train the recommendation for my own taste? Making sure i rewatch it if I don't remember watching it the first time? What do I do if i like a movie but it's objectively terrible? All of the above questions weigh heavily, and the end result is I just avoid binary voting systems (including voting on hn) and it becomes feature bloat with little use.

Strangely, it gets even harder with the thumbs down—there are vanishingly few things i actively wish didn't exist. Why downvote at all?

ThrustVectoring8y ago

If I see an approve/disapprove button, I try to click it if it's for something I've chosen to consume (watch, buy, visit, etc). If it's a decision I'm glad I made, I thumb it up. If it's a decision I regret making, I thumb it down. People and systems will read that input for one of two ways: either optimizing stuff for my preferences, or using that data to make choices further in line with my preferences. Either way, the world is marginally more like I like it.

1 more reply

Houshalter8y ago

Just normalize ratings. If the average rating is in the 50th percentile of all ratings on the site, convert the rating to 50%. That way it carries the maximum possible information. If someone rates something 60% that just means it's better than 60% of similar products.

School grading systems serve a completely different purpose and are a terrible comparison.

kfriede8y ago

What about a thumbs up, thumbs down, and a neutral? In the case of restaurants, there are plenty of places I've eaten where I wouldn't give them a thumb's up "best place ever", but also not deserving of a thumb's down "terrible."

dredmorbius8y ago

This really depends on how thick or thin the data are.

If any given option only gets a small handful of votes, then you might see a strong bias (favourable or otherwise) where neutral would be appropriate.

In Likert scale design (where favourability options >2), there's a strong debate over even or odd choices -- should someone be able to give a "meh" rating, or do you want to force a positive or negative, if even slight.

Hence, 3, 4, 5, 6, and 7 point (typically) scales.

j / k navigate · click thread line to collapse

0 comments

crazygringo8y ago

Counterpoint: I almost solely rely on the stars histogram in Yelp (available only on the website, not the app), completely ignoring whatever Yelp's calculated "average" is.

If a place has more 5-star ratings than 4-star ratings, it's generally amazing. If it has more 4-star ratings than 5-star ratings, it's generally fine but not something particularly special.

Just thumbs up/down would eliminate what is, to me, the single most useful aspect of Yelp.

It doesn't matter that star ratings are arbitrary -- when you average enough of them out, a clear signal overrides the noise. You can distrust any given user, while still trusting the aggregate.

BoiledCabbage8y ago

I used to think the same thing until I realized the most accurate and consistent ratings I use on a regular basis is rotten tomatoes. And they're based on strict thumbs up/ down.

It ensures votes hold equal weight and that "extreme polar" voters don't skew things. It also avoids the opposite problem of "everything is neutral" vote unless horrible/incredible.

RT also handles high brow and low brow well. You get less voting of "eh I didn't love it, but it's sophisticated so I'll give it an extra star."

I'm sold on simple up/down.

stinkytaco8y ago

So it becomes a question of what are you trying to accomplish? For some issues up/down is a good way to solve a problem, for others it isn't.

2 more replies

ehhnetfliz8y ago

Really this comes down to how terrible one dimensional comparisons are: it only measure popularity, which is a terrible filter for quality.

alexilliamson8y ago

albedoa8y ago

Amassing a bunch of 4- and 5-star ratings is easy, but leaving nothing for even the most habitual of complainers to complain about? That's an monumental achievement.

ghaff8y ago

VanillaCafe8y ago

Agreed. For products in Amazon above a certain star threshold (say, 3+), I evaluate given the shape of the review histogram, particularly minimizing the size of the bump down at 1-star and 2-star.

stinkytaco8y ago

ghaff8y ago

Which, in practice, tends to devolve to what's effectively a four-star rating of some sort: Want two hours of my life back, OK/meh, Good, Excellent

A humorous take: https://xkcd.com/1098/

stinkytaco8y ago

beefsack8y ago

Perhaps the issue isn't the granularity of a single dimensional rating scale, but the lack of expressive options when in reality your feeling about something is complex and multifaceted.

evincarofautumn8y ago

I envision a panel of emoticons akin to the Facebook reaction set, but where the user can select as many as they want to quickly convey different combinations of their reactions:

    (thumbs up)      I liked this
    (heart)          I loved this
    (thumbs down)    I didn’t like this
    (smiling face)   This made me happy or satisfied
    (frowning face)  This made me sad or disappointed
    (surprised face) This made me surprised or impressed
    (angry face)     This made me angry or frustrated

Of course, it gets complicated. Did Sam U. Zerr give that product an (angry face) because they used it and didn’t like it, or because they’re offended that you would recommend it, or what?

    (thinking face)  I’m considering this / I’m confused by or dubious of this
    (gear)           This was useful / this made me think
    (fire)           This album was great / this sauce was spicy
    (heart eyes)     I really want this / this is adorable
    ...

E.g. a recommendation for me might be “(thumbs up)(gear)(heart eyes)” because some product or content is similar, by some hidden metrics, to other things that I’ve reacted to in those ways.

Just brainstorming here. There are obviously many possible approaches in this space.

astrobe_8y ago

Or, one could just let users tag the subject and the interface would display the "weights" of the tags.

1 more reply

xaedes8y ago

throwaway6138348y ago

To be honest, the fact that 60% is a failing grade is a failure of the grading system, not a fact to take for granted. We've basically lost the entire dynamic range of 0-60% for no good reason.

andrewflnr8y ago

adrianratnapala8y ago

We probably cross bridges designed by engineers who only got 70% on their exams all the time. That was pretty satisfactory score when I was in Uni.

1 more reply

logfromblammo8y ago

So you're saying that the measurement of student mastery has no noise floor?

waterhouse8y ago

1 more reply

ThrustVectoring8y ago

ehhnetfliz8y ago

Strangely, it gets even harder with the thumbs down—there are vanishingly few things i actively wish didn't exist. Why downvote at all?

ThrustVectoring8y ago

1 more reply

Houshalter8y ago

School grading systems serve a completely different purpose and are a terrible comparison.

kfriede8y ago

dredmorbius8y ago

This really depends on how thick or thin the data are.

If any given option only gets a small handful of votes, then you might see a strong bias (favourable or otherwise) where neutral would be appropriate.

Hence, 3, 4, 5, 6, and 7 point (typically) scales.

j / k navigate · click thread line to collapse