Foursquare uses it and I've found their scores to be way more useful than Yelp's.
The biggest problem with star ratings is that it's so arbitrary. What is the difference between 3 and 3.5? What is a 1 vs a 2? 3/5 is 60%, that's almost failing when you think about it on a grading scale, if I scored something as a 3/5 I would never use that product or service again, yet, many of the best restaurants are rated 3/5 on Yelp.
Unless the user has some scoring system in place for different qualities of the product or service, there is no way you can get anything resembling an accurate score.
I would never trust a user to accurately assess a score given 10 different options (.5-5) but I would be way more likely to trust a user to say either "I like this product" or "I do not like this product."
But yes, the Wirecutter approach works great, but it just doesn't scale.
If a place has more 5-star ratings than 4-star ratings, it's generally amazing. If it has more 4-star ratings than 5-star ratings, it's generally fine but not something particularly special.
Just thumbs up/down would eliminate what is, to me, the single most useful aspect of Yelp.
It doesn't matter that star ratings are arbitrary -- when you average enough of them out, a clear signal overrides the noise. You can distrust any given user, while still trusting the aggregate.
(Curiously enough, I don't find any equivalent value on Amazon. On Yelp, you're really evaluating an overall experience along a whole set of dimensions, so there's a lot more to discriminate on. On Amazon, it does seem to be more of a binary evaluation -- does the product work reliably or not?)
It ensures votes hold equal weight and that "extreme polar" voters don't skew things. It also avoids the opposite problem of "everything is neutral" vote unless horrible/incredible.
RT also handles high brow and low brow well. You get less voting of "eh I didn't love it, but it's sophisticated so I'll give it an extra star."
I'm sold on simple up/down.
So RT is good at predicting "should I watch this movie I haven't watched before", but bad at predicting more sophisticated habits or preferences. I wouldn't buy the Blu Ray off a RT prediction, but I would rent.
So it becomes a question of what are you trying to accomplish? For some issues up/down is a good way to solve a problem, for others it isn't.
Really this comes down to how terrible one dimensional comparisons are: it only measure popularity, which is a terrible filter for quality.
Recently, however, it seems like more (imo undeserving) movies that are "just ok" - like decent, but nothing special, romantic comedies and big blockbusters - are scoring above 90%. I might be being curmudgeonly about it, but I've nearly stopped checking it because it feels like there's no information there. My theory is that this started happening once Roger Ebert died... without such a leader in the field, no one is willing to say they didn't like a film unless it's obviously very bad.
Amassing a bunch of 4- and 5-star ratings is easy, but leaving nothing for even the most habitual of complainers to complain about? That's an monumental achievement.
A humorous take: https://xkcd.com/1098/
I've been really interested in the idea of emotive reviews as an alternative to single dimensional scores. The best idea I have at the moment is something akin to emoji reactions like you see on GitHub issues, finding a way to encode some feelings relevant to product reviews in a mechanism like that seems really intriguing to me.
(thumbs up) I liked this
(heart) I loved this
(thumbs down) I didn’t like this
(smiling face) This made me happy or satisfied
(frowning face) This made me sad or disappointed
(surprised face) This made me surprised or impressed
(angry face) This made me angry or frustrated
Of course, it gets complicated. Did Sam U. Zerr give that product an (angry face) because they used it and didn’t like it, or because they’re offended that you would recommend it, or what?If you’re only using icons to make recommendations to an individual user based on their own history, maybe you don’t need to infer the actual meanings; you can add all sorts of icons without any particular meaning and just make recommendations by correlation:
(thinking face) I’m considering this / I’m confused by or dubious of this
(gear) This was useful / this made me think
(fire) This album was great / this sauce was spicy
(heart eyes) I really want this / this is adorable
...
E.g. a recommendation for me might be “(thumbs up)(gear)(heart eyes)” because some product or content is similar, by some hidden metrics, to other things that I’ve reacted to in those ways.Just brainstorming here. There are obviously many possible approaches in this space.
Or, one could just let users tag the subject and the interface would display the "weights" of the tags.
Strangely, it gets even harder with the thumbs down—there are vanishingly few things i actively wish didn't exist. Why downvote at all?
School grading systems serve a completely different purpose and are a terrible comparison.
If any given option only gets a small handful of votes, then you might see a strong bias (favourable or otherwise) where neutral would be appropriate.
In Likert scale design (where favourability options >2), there's a strong debate over even or odd choices -- should someone be able to give a "meh" rating, or do you want to force a positive or negative, if even slight.
Hence, 3, 4, 5, 6, and 7 point (typically) scales.