Elo sucks – better multiplayer rating systems for smaller games (2019) (opens in new tab)

(medium.com)

161 pointsbrownbat5y ago155 comments

155 comments

103 comments · 25 top-level

dvt5y ago· 33 in thread

Elo is great for what it was built for: ranking chess players. Chess is (1) extremely low-variance, (2) has an extremely high skill ceiling, and (3) is 1-on-1. Elo works great for chess, but it would never work for something like Poker. Let's briefly go over these three points.

Most games aren't chess -- where the only variance is picking who's black and who's white -- in fact, they might include dozens of RNG mechanics (from critical strikes to ability rolls, to spawn points). These mechanics (while fun and well-designed) might pollute your "idealized" model. There's also the problem of RPS (rock-paper-scissors) mechanics or pick-counter-pick mechanics which will also heavily skew win rates. For instance, given a slow combo Magic deck, you will most likely auto-concede to mono red aggro (regardless of skill level). If you're using Elo, this will pollute your model. (Hint: you shouldn't be using Elo.)

Most games also don't have chess' high skill ceiling. Chess has such a high skill ceiling for a number of reasons -- it's one of the oldest games still being actively played, for one. Suppose your "game" is simply the flip of a coin (everyone wins 50% of the time). Zero skill involved. Trying to model win-loss-ratios using a sigmoid curve is silly. Obviously, no game is going to be a coin flip, but there's a world of difference between chess and DOTA.

TruSkill attempts to fix (3) by using clever Bayesian updating on a player-by-player basis[1] but in reality, it's a shit-show. Using Elo (or variants thereof) for team-based games where the team isn't really a team (more like 3-5 random people plopped together for one match) is incredibly misguided, but continues to be implemented in just about every modern multiplayer game (to the players' frustration). Of course, mixing and matching pre-made groups with non pre-made groups creates as many issues as you might imagine.

In short, why so many game devs are enamored with Elo when it comes to ranking is a bit bizarre.

[1] https://www.microsoft.com/en-us/research/wp-content/uploads/...

CWuestefeld5y ago

My wife was a champion table tennis player. This sport uses Elo as well, and I know from watching the sport over time that the rating system has real problems. It doesn't suffer from the weaknesses that you cite, but even so, the problem of "rating inflation" is widely discussed.

It seems that much of the problem comes from rating points brought in by newbie players (and note that, contra TFA, the problem isn't with experienced players losing to newbies, but the opposite).

A newbie is started off with some nominal rating; I forget the number, but let's say it's 800. Most likely that newbie is going to lose his first matches, and some proportion of those newbies will get frustrated and quit. For the ones that stay in the game, things probably work out in the long run. But for those that got discouraged and quit, in the course of their loss they caused a few points (not many, because they're likely way overmatched, but definitely more than 0) to be credited to their opponents. When they quit the sport, they're never going to reclaim any of the rating points that they lost initially. But those points are still in the system, having been added to their winning opponents.

It's hard to quantify because the Elo system is the only objective comparison we have, but over the course of the almost 30 years I've been watching my wife play, the Elo rating enjoyed by a player of a given hypothetical skill level has increased dramatically. Many are saying that for someone of the upper echelons, their rating is maybe 200 points higher than it would have been 30 years ago.

So back in 1991, my wife was in the top 30 women in the USA with a rating in the mid-1700s. Today, someone with that rating isn't even going to be in the top brackets of serious tournament.

Despite all that, the usefulness of the rating system keeps it in use as a valuable tool. It seems that the ability to match players who have never seen each other before, ensuring interesting matches, is part of keeping the game competitive for those in it. And table tennis is also, because of this, one of what I believe is few sports where men and women often play head-to-head (even though men generally have much higher ratings, on account of the sport requiring far more strength than you might suspect).

lemagedurage5y ago

I don't think there's an expectation that a skill rating is comparable throughout 20 years, because both individual players and how the game is played (the meta) changes continuously.

But if that's true, then why would rating inflation be a problem?

gizmo6865y ago

The game itself has not changed, so it still makes sense to compare players across time. It would be nice if we had a quantitative way of doing this; so we can make statements like 'the average proffessional player today is better than 20 years ago, a typical modern pro would win 60% of the time again one from 20 years ago).

In some sense, it is not surprising that we do not have a system that accomplishes this. Since it is impossible to see the results of a game between players living in different time periods, we cannot get any data to prevent drift. You can still try to normalize the rankings. However, unless you have some independent way of measuring skill, you would need to make an assumption about the relative strength of players. Assuming the average skill of a proffesional is constant across time is probably not accurate, but closer to reality than what you get with unchecked inflation.

3 more replies

Aerroon5y ago

This might not be great for a sporty-sport, but I think that for a video game this would actually be an advantage. This kind of a rating inflation would mean that long-term players would see some numerical progress without really doing much better.

1 more reply

chongli5y ago

That seems like a simple problem to fix. When somebody quits, just subtract 800 points from the remaining ranked players, scaled accordingly such that their relative win probabilities remain the same.

Of course, the other issue is if the number of active players increases over time. In that case, it's not so easy to fix unless you start scaling down the number of starting points given to new players.

Perhaps a better thing to do would be to construct a model of the rating inflation over time and use that to correct for historical comparisons. It's still not particularly meaningful though, because you have no way to measure actual skill inflation.

asdgagbiobnio5y ago

You don't have to formally quit the game to stop playing. I played one ranked chess tournament in high school, quit for ten years, and then picked it back up. What would you do with my points?

If you choose to delete them, that means that everyone will have constantly eroding ratings unless they keep playing.

1 more reply

dvt5y ago

> It doesn't suffer from the weaknesses that you cite, but even so, the problem of "rating inflation" is widely discussed.

Ah yes! Inflation is also a problem I've seen in competitive online games. Rating inflation was a serious issue with World of Warcraft PvP arenas circa 10 years ago (iirc Blizzard hard capped arena ratings at 3000 during WotLK). I don't follow chess much, and I'm not exactly sure how chess avoids it (or even if it does).

2 more replies

dmurray5y ago

There's also an inherent deflation effect. Players tend to get better over time. In the simplest case, if we start with a pool of players rated 800 and let them play for a year, at the end they'll be better players but still rated 800 on average.

Most chess Elo systems have an inflationary component where young or new players (who are overall faster improvers than the player pool at large) gain and lose points faster than established players (in detail, either using performance ratings or increased k-factors or both). In a balanced rating system, the sources of inflation and deflation are roughly equal. You can tweak the parameters to keep it this way, though it's not trivial to tell whether there is "real" inflation over the years or whether players are simply playing better - or indeed, what's the difference.

k__5y ago

Why don't they increase the bar for newbies to get into such a system?

If they know that some people just play a few games and then quit, let's say they only can get Elo when they played a specific amount of time or won at least n games etc.

BSTRhino5y ago

There is a minimum of 10 games before people start being ranked. People who quit early don't get ranked. People who have played 10 games gain a new long-term goal.

keeganpoppen5y ago

all of this, plus an additional observation that i've had about games w/ tiers/divisions: player skill is assumed to be normally distributed when that is just so demonstrably not the case-- there is a fairly high skill floor to be able to play the game at all, and the right tail (high skill) of the distribution is WAY fatter than the left.

especially with well-established, popular games-- Chess, League of Legends, Overwatch, etc. (where there is even a financial interest in being a top player to boot), the skill levels of the people at the absolute top simply profoundly dwarf players that would even superficially seem "comparable" by the standard of being in adjacent, or even within the same tier.

in League of Legends, for example, it is often claimed that the differences between players in high/low Challenger, high/low Master's, and even high/low "high diamond" (low d2 vs high d1) all constitute distinct "tiers" of player quality that are as substantial as the full-tier jumps closer to the median (e.g. silver -> gold, gold -> platinum), but because of this shoehorned prior about skill distribution it leads to this compression at the very top.

_dps5y ago

> player skill is assumed to be normally distributed when that is just so demonstrably not the case

This is close but not exactly right, and the small difference matters. Elo does not assume that skill is normally distributed, but rather that "quality of play" in a single game is normally distributed around some average quality level for the player. Obviously this too is an approximation but it's a much smaller one.

keeganpoppen5y ago

hmm, interesting. i did mean to say that this is a problem more in the context of games that add tiers/divisions to their ranked ladders, but i hadn't really thought about elo making assumption about the normal-distributed-ness of player deviation from their "true" skill level. does that not just fall out directly from the Central limit theorem (given the taking of large #s samples (game W/L observations vs. predicted P(win|my elo, their elo)) of means, etc.)?

1 more reply

Someone5y ago

“player skill is assumed to be normally distributed”

I would think player skill level (at best; there easily can be cases where P typically beats Q, Q beats R and R beats P) is an ordinal (https://en.wikipedia.org/wiki/Ordinal_data), so one can’t say “player P is twice as good as player Q”, “player P is as much better than player Q as player R is better than player S”, and certainly can’t prove or disprove whether skill is being normally distributed. It is customary to assume that, though.

Also, if one assigns numbers to skill levels, those can be normally distributed. It probably is possible to design an ELO-like system that, given enough games, guarantees that the set of skill level numbers of all players approaches being normally distributed.

Aerroon5y ago

Another thing to consider with a lot of these games is that they're not static. The game changes and this can boost one player's rating up when their preferred champions/heroes/whatever are strong at that time. Even if the game didn't change, there are so many different characters that's play differently enough that the player's results with them could end up at a rather different rating.

Godel_unicode5y ago

Here's season 13 of rocket league. Free red delicious apple for the first person to correctly identify the shape of the curve:

https://2p1ipt36o1g23g1tt6ba5nou-wpengine.netdna-ssl.com/wp-...

2 more replies

lmm5y ago

> Most games aren't chess -- where the only variance is picking who's black and who's white -- in fact, they might include dozens of RNG mechanics (from critical strikes to ability rolls, to spawn points). These mechanics (while fun and well-designed) might pollute your "idealized" model. There's also the problem of RPS (rock-paper-scissors) mechanics or pick-counter-pick mechanics which will also heavily skew win rates. For instance, given a slow combo Magic deck, you will most likely auto-concede to mono red aggro (regardless of skill level). If you're using Elo, this will pollute your model. (Hint: you shouldn't be using Elo.)

None of which matters? All that means is that the results of individual games are a bit higher variance. Elo handles that by design. If you lose a certain proportion of Magic games to less-skilled players then this should be considered a reflection of your skill, because the only reasonable definition of skill at the came is the rate at which you actually win it; anything else can be gamed and so should be ignored.

> Most games also don't have chess' high skill ceiling. Chess has such a high skill ceiling for a number of reasons -- it's one of the oldest games still being actively played, for one. Suppose your "game" is simply the flip of a coin (everyone wins 50% of the time). Zero skill involved. Trying to model win-loss-ratios using a sigmoid curve is silly. Obviously, no game is going to be a coin flip, but there's a world of difference between chess and DOTA.

That's also something that Elo handles just fine? If every game is a coin flip then everyone will end up with the same Elo. If player A has x more Elo points than player B, then they win y% of their games. If your game has a skill ceiling where even a complete beginner always wins, say, 20% of their games, then that just means no-one will ever be able to rise above a corresponding Elo rating.

dvt5y ago

> That's also something that Elo handles just fine? If every game is a coin flip then everyone will end up with the same Elo. If player A has x more Elo points than player B, then they win y% of their games. If your game has a skill ceiling where even a complete beginner always wins, say, 20% of their games, then that just means no-one will ever be able to rise above a corresponding Elo rating.

That's not how it works. The distribution you end up with will not be uniform, it will look like this (just ran Elo with a coinflip; 11 players, 1000 matches): https://imgur.com/9O82pRj

On the long term, I think this will tend to a geometric distribution with a low p value.

lmm5y ago

Show your working?

If you're matchmaking players against equal-ranked players, then each match is just +/- 50 points, you'll get a binomial distribution which tends to normal as n gets large (assuming a large player pool so each player's results are independent). If players play players with different ratings then that will tend to push their rating back towards neutral. You certainly don't get a geometric distribution because the rating algorithm is completely symmetric.

1 more reply

Wohlf5y ago

>Why so many game devs are enamored with Elo when it comes to ranking is a bit bizarre.

Probably just Occam's Razor, they don't know of or care to make something better and can just pull Elo off the shelf.

tmpz225y ago

Another example would be competitive overwatch where the developer's stated goal was an equal distribution of rated players throughout the various ranks (bronze/silver/gold/platinum/diamond/masters/grandmasters). They tweaked the variables until they got their desired distribution, lumping the majority of players in gold and plat. Ranking up became an exercise in either playing hundreds of hours or starting a brand new account with fresh MMR.

Predictably this led to an explosion in boosting and win-trading services.

nl5y ago

> the developer's stated goal was an equal distribution of rated players throughout the various ranks

and

> They tweaked the variables until they got their desired distribution, lumping the majority of players in gold and plat

seem to be opposites. What am I missing here?

canofbars5y ago

It also probably doesn't even matter since the main problem is players intentionally losing to rank down and then smash lower rank players.

bsder5y ago

I also suspect that Chess is exponential because of the "one mistake and you die" nature when playing good players.

Ben Finegold (a Chess Grandmaster) talks about this all the time--"The reason why I'm higher rated than you is that I can play 100 moves without a major mistake and at some point you will hang a piece. The reason why Magnus Carlsson is rated higher than me is that he will play 100 moves that are slightly better than mine and I will lose."

keeganpoppen5y ago

also, what system(s) do you prefer / know of that handle multiplayer matchmaking well? it seems to me that a good system might be necessarily game-specific to some extent, although i'm sure the state of the art is much better than what i've experienced gaming to date xD.

dvt5y ago

> also, what system(s) do you prefer / know of that handle multiplayer matchmaking well?

None, and actually I don't think it's particularly healthy for the game. For example, I had plenty of fun casually pubbing Counter Strike in the early 2000s. When I wanted to take the game more seriously, I made a team and joined a league which might include group play, single/double elimination, and exhibition games. Actual competitive play (scrims, matches, tournaments) is fundamentally different than what today we call "matchmaking."

keeganpoppen5y ago

yeah, that strikes me as a pretty fair proscription, unfortunately-- the skill gap from coordinated team play in any team game makes it to where teams that play often together are matched against ad-hoc teams of individually more skilled players to make things "balanced", which were it even be possible to do this in the "50% win probability for each team" sense still leads mostly to unfun matches one way or the other. and, of course, queueing with friends you don't play with often, or with high skill variation amongst them just completely screws you from a balance/rank perspective (but hey, at least you get to lose together with all your friends! :).

lemagedurage5y ago

Do you believe multiplayer games would be better off without matchmaking at all?

1 more reply

Godel_unicode5y ago

In your magic example you seem to be arguing that which kind of deck you pick is not part of your skill, which is of course totally incorrect. Picking "fun" decks over "obvious/OP" decks means you're worse at winning games. Or at least that you generally play with a handicap, which is easy to account for in elo.

To your coin-flip example, if you model a league in excel you'll find that elo actually results in a rank distribution very consistent with what your intuition would expect (given enough players and enough matches, of course).

dvt5y ago

> To your coin-flip example, if you model a league in excel you'll find that elo actually results in a rank distribution very consistent with what your intuition would expect (given enough players and enough matches, of course).

This is incorrect. If you simulate Elo with a coin-flip, you'll get something that looks like this (11 players, 1000 matches): https://imgur.com/9O82pRj -- I think this will tend to a geometric distribution (not sure what the p is though, probably depends on the constants).

Godel_unicode5y ago

>> given enough players and enough matches, of course

1 more reply

juped5y ago

Elo, not ELO, after Árpád Élő.

dvt5y ago

Correct, fixed :)

defertoreptar5y ago· 20 in thread

The author didn't benchmark to see if this system is actually any better at predicting outcomes than vanilla Elo. That's how you determine if your implied win probabilities are accurately being derived from rating differences. The author seems to be under the impression that there's something fixed and concrete about an 1800 rating, but when you change the system, you also change what an 1800 rating means in the first place.

Some of these complaints are solved by existing systems, namely Glicko. For example, rating deviation helps with experienced players (low RD) losing points to newer players (high RD). It also has a built-in way to discourage inactivity. Players' RD increase over periods of inactivity, so they can be excluded from the leaderboard after reaching a certain point. That allows us to maintain their rating without decreasing it. After all, that's our best guess of the player's skill. It's just a less reliable guess over time.

BSTRhino5y ago

There have been four rating systems, including Glicko and TruSkill, received lots and lots of complaints for both those systems. This new system receives few complaints. Tested across 135000 players. If the players had not complained so much, we would still be on Glicko. Those are the facts. The theories as to why that is are up to you.

jsnell5y ago

Optimizing a rating system for minimal complains, maximum player engagement, or some similar metric is of course totally valid. It reminds me of Sirlin's story of being hired to design a rating system for Starcraft 2, and optimizing for totally different things that Blizzard wanted [0].

If the author (you?) had just described it in those terms, it'd be hard to object. But the article goes further, and makes claims about the system being more accurate due to a different rating curve. That's the claim that would need to be justified by actually comparing whether the predictions the new system makes really are better.

[0] http://sirlingames.squarespace.com/blog/2010/7/24/analyzing-...

an_opabinia5y ago

Part of a matchmaking algorithm that increases user engagement is telling a story about how it's more fair though.

Like our tolerance for losing is acquired. Most normal people losing in League for the first time stop playing, usually forever. Just randomly visit your friend's match histories in League, frequent players have many days of long losing streaks.

If you're just conditioned to play despite losing, great, in a Darwinian way (surviving, being around to be measured) you will be representative of the average player in League. And there are so many League players with such long retention you cannot possibly argue that skill-based matchmaking is the core component of user engagement.

His dataset is interesting because it will necessarily overrepresent people who kept playing despite the old system. That sort of refutes its importance - I mean sure people complain but they keep playing, so was it really that important? So what if complaints go down?

Those are important goals, and also, it's still an interesting twist in multiplayer game design. You just gotta interpret it as a commentary on a whole system even if it doesn't narrowly talk about a scientific objective like performance prediction.

1 more reply

lanius5y ago

The funny thing about SC2 is that the player's MMR (matchmaking rating) is decoupled from rank, due to design decisions such as demotions not occurring midseason. So a gold league player with a low enough MMR may get matched against bronze league players, despite ostensibly being ranked higher. Actually for the longest time after release, player MMR was not visible. It took 6 years and 2 expansions before it was finally displayed in-game.

2 more replies

Natsu5y ago

What you said makes me wonder about a totally different way of using a metric of how likely the person is to win a given match.

That is, perhaps the system could be engineered to maintain a more even win/loss ratio so that people don't go on super-long win (or loss) streaks in general by adjusting who they get matched with.

It probably wouldn't work that well towards the edges, but around the middle it might work well enough.

1 more reply

mcnamaratw5y ago

The "different rating curve" appears to be the actual historical probability data, not a formula. I think. If that's right then it this estimate of the probability of winning is not a new discovery.

On the other hand I suspect the historical data really is the "best fit" to the historical data.

cortesoft5y ago

You have to decide what the purpose of the rating system is. Using it as a reward system for players to feel accomplishment is a different use case then trying to correctly predict the likely outcome of a game.

Personally, if I was designing a rating system, I would use two separate systems.

One would be like the one in this game; publicly viewable, pleases the players, and gives a sense of accomplishment.

Then, I would have a second, internal only rating system that players can't see but is used for matchmaking to make sure people are matched up to players with as close to equivalent skill as possible.

dmos625y ago

I find similar solutions experientially two-faced and frustrating. Imagine the matchmaking engine was a person behind a desk that you interact with. If he consistently told you one thing and then did something else you'd be displeased.

nobodyandproud5y ago

Some games do this already, and players are very unpleased by the results.

1 more reply

ajuc5y ago

That's what xp is in sc2. For several years xp was the only number that was visible - your true mmr was hidden.

Peple knew this and ignored xp altogether.

mcnamaratw5y ago

My understanding was that the system consists of using the historical odds of winning (given the rating difference). If you benchmark that using only past data, I think it is by definition the most accurate system. (The data is always a better fit to itself than a theoretical fit is.)

Naturally future data is much harder to deal with than past data. But even for future data it's not obvious that ELO (or any other theoretical fit to the odds of winning) will be more accurate than the historical odds.

BSTRhino5y ago

Yes, the best fit for the data is the data itself, it's a tautology. Nothing wrong with Elo's exponential curve, it just can't beat the actual data.

You raise a good point in that I could've created a training set and a test set, that probably would be a better validation. But I don't know, I'm not doing science, I'm making a game.

On the topic of whether the future matches the past, the predictions were based on a rolling database of the past 100000 matches, which is approximately the number of matches played per 7 days. So my theory is that the data is quite recent and up-to-date and so should match, in general.

Of course I never tested this. In the end, I'm not doing science, I'm making a game. If the retention goes up, complaints are down, then I can't keep working on the rating system, there are 1000 other things to do.

1 more reply

roenxi5y ago

> I think it is by definition the most accurate system

By gum, an opportunity to quibble semantics on the internet. That is true if benchmark using means 'only admit to knowing' and accuracy means 'must be numerically quantifiable given existing data'. It is false otherwise, especially if accuracy means 'conforming to truth' and we have a model for how the numbers are being generated.

Obviously if I generate a set of numbers by sampling a normal distribution then the most accurate model is a normal distribution, no matter what empirical data I use for benchmarking.

That is to say, if we know how the data was generated (sans noise) we can reject empirical distributions as the most accurate, because we can directly know the distribution of the data.

mcnamaratw5y ago

Ok, that is a legitimate ... quibble. Let's assume that we don't already know the correct distribution. In that case we're going to judge each theoretical fit by how close it comes to the historical data. (Or else we're going to get that wrong, which is another common approach.) ELO is much more prestigious and credible than some guy who made a game, but it is less credible than data, for some number of data points N. (Although I think a theory can be more prestigious than data almost independent of N.)

ponker5y ago

Well, it’s like the question of what is better: a restaurant with 4.5 stars on 4 reviews or one with 4.2 stars on 1,500 reviews?

1 more reply

prionassembly5y ago

Is the point really predicting outcomes? FIDE (chess) Elo is useful because I can compare machines to humans who have never matched each other.

Generally speaking the "rating structure" is a lattice where you can, for any two players A and B, tell whether A is a better player than B or the other way around. Elo, Glicko, etc. are embeddings of this lattice on the real line (much like the utility functions of microeconomics are real embeddings of preference lattices).

mlyle5y ago

> Generally speaking the "rating structure" is a lattice where you can, for any two players A and B, tell whether A is a better player than B or the other way around.

Not really. You tend to have cycles, where person A can beat person B who beats person C who beats person A.

There's a guy who was even with the guys playing Go who were 2 or 3 stones weaker than me that would tend to beat me because of some of the unorthodox things he did. (Eventually I strengthened my game against these things).

Considering ratings to be a total ordering is a useful approximation.

defertoreptar5y ago

> Is the point really predicting outcomes?

Others have pointed out how there is a psychological aspect of rating systems, and no developer wants to constantly field complaints. That said, I believe the answer is yes. A rating system derives meaningfulness from its predictive power. In other words, people want to know how good they actually are compared to one another.

1 more reply

naravara5y ago

For a lot of online games I think matchmaking based on pushing you towards a 50/50 win rate is kind missing the point of games. It gives you fair odds of winning, but it doesn’t necessarily give you even odds of having a fun or competitive game.

At high skill levels players are skilled enough to where it might, but with most online multiplayers the overwhelmingly vast majority of players are lacking in basic fundamentals to varying degrees. At that level, ELO based matchmaking mostly just results in one person getting rolled or doing the rolling. They’re not really competitive games in my experience.

Godel_unicode5y ago

If two players of similar skill general roll one another, that's a game design problem not a rankings problem.

edaemon5y ago· 11 in thread

The "newbie suppression" mechanic doesn't make much sense to me. If you play against someone substantially lower in rating than you and lose, shouldn't you lose a significant amount of points? After all, you lost to someone you should have easily beaten.

ganonm5y ago

I agree, and the proposed solution which is to limit point gains/losses to one point per game feels like throwing the baby out with the bathwater. Specifically, convergence takes a long time, the result of which is that a very good player on e.g. a new account (smurf) will end up being the cause of a lot of unbalanced games for an awful long time.

Having played a lot of ranked LoL, I saw a few recurring but irrational gripes players had with the Elo based system:

- "I get matched with bad teammates and they drag me down". On average your teammates are the same Elo as you. All players get their fair share of games where they are/aren't the underdog side. On average, it averages out. Deal with it.

- "I've been stuck at the same Elo for ages but I should be higher". Nope, Elo only cares if you win or lose. It doesn't care about kill/death ratio, creep score or how many ganks you pull off. Focus on winning more. Incidentally, focusing on winning instead of secondary metrics like kills/CS was one of the biggest mindset differences between high/low Elo players.

"I should be higher Elo but I play support roles so can't climb". It may be true that you climb slower but here's the rub - think of your matchups as you being compared to the enemy team's support player. The other four roles on each team are actually a constant factor (by symmetry arguments you could not consistently find that your four teammates are any better/worse than the enemy support player's teammates). As a result, the only remaining factor in the statistical equation is you weighed up against the enemy support player. If you can provide even a slight statistical advantage towards winning vs them then you will climb the Elo ladder.

aaronblohowiak5y ago

> As a result, the only remaining factor in the statistical equation is you weighed up against the enemy support player. If you can provide even a slight statistical advantage towards winning vs them then you will climb the Elo ladder.

An alternative explanation is that the skill ceiling is lower for support players.

ecdavis5y ago

Being a low skill player playing exclusively with and against other low skill players sucks. Imagine playing doubles tennis where all four players hit the ball directly into the net 80% of the time. Win or lose it would be an unpleasant experience. I think that's the root of many people's frustration with ranked matchmaking in e-sports games.

I mentioned this in another comment, but I think if e-sports wants to become truly culturally significant the games will need to figure out how to more elegantly bridge the gap between ranked and unranked play, and how to make it fun to play at low skill levels. I don't think it's a coincidence that Fortnite has done both.

mlyle5y ago

It doesn't make sense as part of a rating system, but it may be good for community. Chess is pretty toxic with people refusing to play lower rated people, sometimes, because of danger to rating. Also, people with very low ratings are more likely to have their rating not represent their true ability (my son lost a game against a girl rated 300 that played a very accurate 1400-ish looking game against him... turns out she hadn't been in a rated game for 3 years despite being very active with her local chess club for that time; meanwhile it takes two tournaments or more to get those rating points back).

cortesoft5y ago

This is why you have two rating systems; one you show to people and is geared towards making players happy, and one that is used internally to create fair matches.

edaemon5y ago

It seems reasonable to cap the number of points you can lose, but it strikes me as odd that you'd lose fewer points in an upset than in an even match-up.

mlyle5y ago

Agreed on that point.

A better mechanism, IMO, even though it doesn't catch all the cases: provisional ratings, where you don't affect other peoples' ratings much for the first n games (and where your rating moves faster, too).

Effectively it acknowledges you have less of a prior when you've played few games.

1 more reply

gizmo6865y ago

There are 2 issues being solved by "newbie suppression".

1) Cases where the ranking of the opponent is not well known. Either because they are on a new account, or recently had a massive change in skill level (say, practicing on a different account; or not playing and loosing ranking points due to the decay mechanic).

2) Inconsistent play. Ranking systems generally assume skill level is mostly static, with gradual changes over time. In practice, people have bad days. Limiting the influence of any single game reduces the noise introduced by inconsistent play at the expense of making convergence a slower process.

edaemon5y ago

I agree that those are real problems that need to be dealt with. I don't think suppressing the effects of a loss against a lower-rated player solves them very well, though. The suppression could just be applied in those cases instead of in every case where there's a large rating disparity.

There are plenty of cases where a highly-rated player loses to a low-rated player fair and square. In those cases there should be significant rating adjustments -- at least more significant than a normal game between similarly-rated opponents -- but this system removes those to combat edge cases. I think it would be more effective to deal with those edge cases directly.

krallja5y ago

Newbies have a low rating because they’re new, not because they definitely suck.

edaemon5y ago

Sure, but the suppression described doesn't apply only to new players, it applies to "someone substantially lower in rating than you."

HideousKojima5y ago· 6 in thread

The more obvious solution is to bring back custom lobbies and private servers and forget about ranking players at all. Gets rid of a lot of bad behavior too because servers can police their own communities and players won't get frustrated when a crappy teammate is dragging their ranking down

LoSboccacc5y ago

idk that makes extremely hard to find matches in games with a smaller player base

see war thunder, the simulation queue is a desert, high tier ships a wasteland, unless all the available player get forcibly lumped together matches will just not happen

compare with stormworks too, most servers are empty in my timezone and the populated one as password protected or spawn limited, it wouldn't take much to get known and partecipare in their community but for working games the time commitment is simply impossible.

same with arma3 I'd love to get into shack tac but timezone and commitments make it unavailable to me, and since most of the good players are sucked up in teams the public server are a mess of "what's left" of the community

HideousKojima5y ago

Matchmaking without custom servers/lobbies makes finding a match even harder, since a minimum amount of users in a specific ranking/skill level/ship tier/whatever must all be online and searching for a match at the same time. Custome servers and lobbies allow just one or two players to start, and it advertises to other players that they are available to play. The initial players just need to wait until more people show up, and can play more casual game modes or with bots or whatever until more people arrive.

cortesoft5y ago

The purpose of ranking systems is to try to create fair matches.

I think ideally you wouldn't show the ranking to the player, just use it to create the match.

With a large population, everyone should end up winning about half their games. That would be the sign of a successful ranking system.

ecdavis5y ago

You're conflating ranking and matchmaking systems.

A ranking system measures relative skill/performance levels. You can have a ranking system without using it for matchmaking.[0]

I don't agree that a typical 50% win rate[1] indicates a successful matchmaking system. For one thing, creating fair matches is _a_ purpose of a matchmaking system, but not necessarily the _sole_ purpose. For another, that people win half their games on average says nothing about how fair the matches were.

I think that fairness often gets prioritized over fun. Playing sports should be fun at all levels, but it's particularly important at the lower skill levels that the participants enjoy themselves. That's how sports grow and become cultural institutions. Being a low skill player in a silo of other low skill players is a decidedly un-fun experience that drives a lot of new players away from e-sports. A ranked matchmaking system could be designed with the express purpose of helping low skill players have fun and naturally develop into average skill players.[2] I wonder what such a system would look like.

[0] See FiveThirtyEight's Elo ratings for NFL teams: https://projects.fivethirtyeight.com/2019-nfl-predictions/

[1] However that's measured.

[2] Under such a system fairness might be relegated to the seeding process for tournament play.

cortesoft5y ago

> I don't agree that a typical 50% win rate

Assuming there are no ties, and teams have an even number of players, a multiplayer competitive video game is going to be a zero sum game; for every winner, there is going to be a loser.

While I agree that you it isn't the SOLE purpose of a matchmaking system, I do think a fair matchmaking system will end up with most players having a 50% win rate (with a few people at either ends of the skill spectrum having lower or higher win rates). If you are winning more of your games, you should play better at better players until you start losing again (and vice versa). You should eventually hit an equilibrium where you are playing people you have about a 50% chance of beating.

1 more reply

ColeyG5y ago

In certain communities, players will choose ranked vs. unranked almost always. I agree that a ranked + custom lobby model should exist though.

IshKebab5y ago· 4 in thread

TrueSkill definitely has a time decay term and I'm fairly sure it lets you fit the model to previous games. I wonder if the author actually tried it. (Though to be fair I'm not sure if there are open source versions of the latest version of TrueSkill.)

BSTRhino5y ago

Yes, tried Glicko then TrueSkill, both generated huge amounts of complaints. New system produced few complaints. If the community had liked it, would've stuck with TrueSkill.

IshKebab5y ago

TrueSkill 1 presumably?

oli56795y ago

https://pypi.org/project/trueskill/

karlding5y ago

The parent is talking about TrueSkill 2 [0], while the trueskill Python library you linked currently only supports the original TrueSkill algorithm [1][2]. TrueSkill 2 takes into account individual scores of players in order to weigh the contribution of each player to each team. The idea is that this allows TrueSkill 2 to converge faster for new players.

[0] https://www.microsoft.com/en-us/research/publication/trueski...

[1] https://github.com/sublee/trueskill/issues/27

[2] https://www.microsoft.com/en-us/research/project/trueskill-r...

BSTRhino5y ago· 1 in thread

Wow, I wrote this article ages ago, didn't expect to see it posted here today.

I just want to clarify the point of the article:

Why would you fit a curve to the data when you can just use the actual data?

That's the point of the article.

We're in the age of big data, we should use it to make better win rate predictions. Elo's exponential curve is fine, it's approximately right, it's just now we can have databases of millions of games and we can just do better. Elo was invented before the big data age and it is limited by that.

That's all I'm saying.

I shouldn't have included all the other stuff in the article, it just distracts from the point.

OisinMoran5y ago

Thanks for writing the article and sharing your work with the world, I really enjoyed it! I think the central point you make is very interesting.

I'd be interested to know what fit you used for the red "line of best fit", why not a straight line? My main question here is do you actually expect a player ~210 points above another to win _less_ than if they were only ~190 points above? (the first dip in the red graph)

dcl5y ago· 1 in thread

If you're interested in evaluating and rating/ranking agents, it might be worthwhile checking out DeepMind's multidimensional Elo rating system (https://arxiv.org/abs/1806.02643) which attempts to solve some of the issues with Elo and Glicko. Most notably, the ability to handle non-transitive interactions (like rock, paper, scissors) and the presence of redundant duplications of matches that might erroneously inflate ratings.

Shameless plug, I've created an R implementation of it here: https://dclaz.github.io/mELO/

sali05y ago

This is fantastic, thank you for bringing this up.

dang5y ago· 1 in thread

I recall at least one large previous thread about Elo but can't find it. Anyone?

jsnell5y ago

Maybe https://news.ycombinator.com/item?id=16255910

letmeinhere5y ago· 1 in thread

Isn't that a logarithmic curve?

CodesInChaos5y ago

It's a sigmoid, which converges to exponential far from 0 and is somewhat linear near 0.

CodesInChaos5y ago

1. The sigmoid function is the closest thing to linear that makes sense on probabilities⁺. A purely linear function would cross 0/100% which, while the sigmoid flattens exponentially as it approaches the extreme values.

2. The fit isn't as bad as the author claims. It looks like the biggest difference between the graphs is that the point differences are scaled differently (400 pts for 90% in elo vs 800 pts in the second graph).

A quick and dirty overlay of the two graphs shows a reasonable fit: https://ibb.co/0YwYH9z

3. I like observations about player psychology. Satisfying the players is more important than having the mathematically best ranking system.

4. Personally I like Whole History Ranking (https://www.remi-coulom.fr/WHR/), but it's unlikely to be popular with players (the psychological criticisms the article makes apply to it as well, with some additional problems, like rank drifting without playing). KGS which uses ranking system similar to WHR (but more primitive) certainly draws a lot of criticism for its ranking system.

If I had to design a mathematically optimal ranking system, I'd start with WHR and make parts of it trainable/fittable.

----

⁺ Bayes' theorem turns into addition when applied to logarithmic probabilities and the sigmoid function converts from logarithmic probabilities to normal probabilities. This property is why it (or its multi category equivalent softmax) is used when predicting probabilities using logistic regression or neural networks.

IanGabes5y ago

Creating a custom system to suit your situations needs sounds great and the thought process was fun to read, but some of the claims lobbed here are pretty questionable.

Specifically, the claim that Dota's matchmaking system is "probably wrong" because the model chosen doesn't match your own findings feels like a reach. Sibling commenters have pointed out how skill variance is important to allow the ELO system to function in games like chess. Additionally, someone else pointed out that the sigmoid function is similar to a linear funciton close to zero.

It seems at least as likely that Acolytefight doesn't have a high enough level of skill expression present in the game to see top players "curve out" weaker players, rather than exponential functions mapping player skill to be useless or wrong.

Does elo suck? Maybe, but this hasn't convinced me.

jrek5y ago

Elo might or mightn't suck (imo it's a great ranking system). But the article sucks. Vanilla elo is built around chess and some adjustments to the scale and/or K-factor might be necessary to fit the circumstance. A quick change of scale to E = (1 / 1 + 10 ^ ((Ra - Rb) / 800)) and all of a sudden ELO very accurately reflects the games actual results: https://imgur.com/a/rFP5U0g

Meaning just that skill is a weaker factor in this game than in chess...

Edit: The 'actual' curve includes a correction for the obvious anomaly of ~55% win expectation at 0 point delta.

runarberg5y ago

I remember a bit back the Go server that I play most of my go these days [OGS](https://online-go.com) changed their ratings from Elo to Glicko-2.

You can read their rationally for it in this forum: https://forums.online-go.com/t/ogs-has-a-new-glicko-2-based-...

The key takeaway is this:

> Most of the shortcomings [of Elo] can be traced back to the fact that the system is too slow to find a player’s correct rank, and too slow to adapt when jumps in strength occur.

> The problem of slow moving ratings is a well-known problem with Elo implementations. In response to this, Prof. Mark Glickman developed the Glicko, and later Glicko-2, rating systems which address this problem very well and are fairly widely used

A few weeks ago they then made an update to their implementation of Glicko-2, where—during the announcements they summarized many interesting statistics on how the system has panned out for them: https://forums.online-go.com/t/2020-rating-and-rank-tweaks-a...

noctilux5y ago

I'm curious about whether the author tried to optimize Elo's K factor. It's often left at 32, which is not reasonable for all contests. It's essentially related to the standard deviation of player skills: if there is a large range of skills, it should be large, and if there is a small range, it should be small. It's easy to tune by optimisation, and it has a huge effect on predictive ability.

im3w1l5y ago

> If we take a top-level player, and make them fight a high-level, mid-level and low-level player repeatedly until we can become statistically confident of their win rates against each, there is no reason why their win rates would fit an exponential curve.

When I first read this, I thought to myself "well we get to pick the scores, so it's exponential by definition". The problem becomes more clear when you express it without any reference to the scores.

If Player A wins 80% of the time against Player B, and Player B wins 80% of the time against Player C, how often does Player A win against Player C? This is a question purely in terms of observables. Elo makes a prediction here (94.1% of the time) and it can be either right or wrong. If it's wrong, then there is no valid assignment of scores.

gverrilla5y ago

Isn't a qualitative system possible? It would be really complex to create for a game such as dota2 or cs:go, but maybe not for a simpler game. I will give cs:go as an example only because I know it very well.. It would be possible, I believe, in theory, to measure player knowledge towards specific ingame-skills. New cs players for instance wouldn't know how to control recoil effectively. And 100% of global elite/pro players would be above a certain threshold regarding recoil control. On the other hand, you could say with a lot of confindence that a player that tries to achieve a high ground pressing only +jump multiple times with no success, when he would need a crouch jump instead because of height, is a noob. Elo or something similar could then be used to measure ranks within specific clusters only. And some form of mood/form on top of this, to allow for better experience (even though I have played cs for 20y now, it could happen that I abandon the game for a few months, or that I have a really bad focus because of external events).

I'm not sure if this makes sense, but what I know for sure is that as an experienced player, I can watch a player play a single game (sometimes a few rounds), and access his average rank/skill level with high confidence, with no need of information from his prior games whatsoever, or detailed statistics of his gameplay.

There's something else to remember for high skill-cieling games: winrate is not what really matters. A lot of times I will play a very good, balanced and fun game and lose. Sometimes it will even happen with very uneven scores like 16-5 or soomething...

closed5y ago

I am pretty sure the author is describing a well understood limitation of Elo, they just need a tiny bit of connecting to models.

Elo can be thought of as an approximation to item response theory models [1]. These describe skill as normally distributed, and whether one person will win using a logistic function (not exponetial).

I think what the author has keyed in on is that afaik in simple Elo there is no slope coefficient for the logistic, but in general IRT models there is (called item discrimination). So in Elo you can't learn that flatter curve they show.

[1]: http://hvandermaas.socsci.uva.nl/Homepage_Han_van_der_Maas/P...

duaoebg5y ago

Repeated Bernoulli trials give rise to Gaussian distributions which is where the e exponential comes from.

This an assumption and an approximation and is not necessarily a good fit. Pulling from actual probabilities would generally perform better.

The rest is massaging to better fit the different objectives.

Godel_unicode5y ago

If your curve is linear, it's because your game isn't that hard (or more formally, where winning and skill are less strongly correlated). This is tough for people to hear if their game is "designed to be a high-skill game".

The curve being linear means essentially that skill in the game confers less of a relative advantage. Chess is a good counterexample here, also rocket league. Both are games where difference in MMR is very strongly correlated with outcome, and both are games where skill is easily measured and highly correlated with ranking.

sytelus5y ago

Take a look at TrueSkill, a much better mathematically grounded, created at Microsoft Research and being used at scale in Xbox: https://en.m.wikipedia.org/wiki/TrueSkill

neolefty5y ago

How about coop games — what would you use to rate players where the goal is to win together?

EGreg5y ago

Wait why don’t we use a deep learning thingy on this dataset and just back out a formula that predicts the wins based on just the relative numbers of the people?

musicale5y ago

Nonsense - they're in the Rock and Roll Hall of Fame after all! Jeff Lynne is a musical genius.

philliphaydon5y ago

Elo was in Age of Empires back when zone .com was a thing.

It worked and worked well. Points were calculated for each person. However Dots2 and lol don’t implement Elo the same way, points are calculated for the team. So if you’re Low score and you win against high people. In Dota and lol you won’t gain many points.

I believe this is done to avoid being carried but it doesn’t work because it just results in you being stuck in a Low tier for ages.

TLDR: elo works and it’s great. No one implements it right.

Edit: In Age of Empires / Zone, if you had a 4v4, it used all 8 players to calculate the ELO on an individual player, so if you had in your team 1750 elo, 1550 elo, and anything in between. The 1750 may gain only 1 point, while the 1550 may gain 16 points (the highest gain lowered the more people who played) While on the losing side the lowest elo will lose the lowest amount of points and the highest will lose the highest amount of points.

dota / lol don't do this, the winning/losing team gains/loses the same amount of points. This is wrong.

This means a high elo player has the potential to farm points from low elo players with little risk. While low elo players get stuck not playing people in their own range.

afwaller5y ago

This is useful to increase plays by reducing “ladder anxiety”

j / k navigate · click thread line to collapse

155 comments

103 comments · 25 top-level

dvt5y ago· 33 in thread

In short, why so many game devs are enamored with Elo when it comes to ranking is a bit bizarre.

[1] https://www.microsoft.com/en-us/research/wp-content/uploads/...

CWuestefeld5y ago

It seems that much of the problem comes from rating points brought in by newbie players (and note that, contra TFA, the problem isn't with experienced players losing to newbies, but the opposite).

So back in 1991, my wife was in the top 30 women in the USA with a rating in the mid-1700s. Today, someone with that rating isn't even going to be in the top brackets of serious tournament.

lemagedurage5y ago

I don't think there's an expectation that a skill rating is comparable throughout 20 years, because both individual players and how the game is played (the meta) changes continuously.

But if that's true, then why would rating inflation be a problem?

gizmo6865y ago

3 more replies

Aerroon5y ago

1 more reply

chongli5y ago

asdgagbiobnio5y ago

You don't have to formally quit the game to stop playing. I played one ranked chess tournament in high school, quit for ten years, and then picked it back up. What would you do with my points?

If you choose to delete them, that means that everyone will have constantly eroding ratings unless they keep playing.

1 more reply

dvt5y ago

> It doesn't suffer from the weaknesses that you cite, but even so, the problem of "rating inflation" is widely discussed.

2 more replies

dmurray5y ago

k__5y ago

Why don't they increase the bar for newbies to get into such a system?

If they know that some people just play a few games and then quit, let's say they only can get Elo when they played a specific amount of time or won at least n games etc.

BSTRhino5y ago

There is a minimum of 10 games before people start being ranked. People who quit early don't get ranked. People who have played 10 games gain a new long-term goal.

keeganpoppen5y ago

_dps5y ago

> player skill is assumed to be normally distributed when that is just so demonstrably not the case

keeganpoppen5y ago

1 more reply

Someone5y ago

“player skill is assumed to be normally distributed”

Aerroon5y ago

Godel_unicode5y ago

Here's season 13 of rocket league. Free red delicious apple for the first person to correctly identify the shape of the curve:

https://2p1ipt36o1g23g1tt6ba5nou-wpengine.netdna-ssl.com/wp-...

2 more replies

lmm5y ago

dvt5y ago

That's not how it works. The distribution you end up with will not be uniform, it will look like this (just ran Elo with a coinflip; 11 players, 1000 matches): https://imgur.com/9O82pRj

On the long term, I think this will tend to a geometric distribution with a low p value.

lmm5y ago

Show your working?

1 more reply

Wohlf5y ago

>Why so many game devs are enamored with Elo when it comes to ranking is a bit bizarre.

Probably just Occam's Razor, they don't know of or care to make something better and can just pull Elo off the shelf.

tmpz225y ago

Predictably this led to an explosion in boosting and win-trading services.

nl5y ago

> the developer's stated goal was an equal distribution of rated players throughout the various ranks

and

> They tweaked the variables until they got their desired distribution, lumping the majority of players in gold and plat

seem to be opposites. What am I missing here?

canofbars5y ago

It also probably doesn't even matter since the main problem is players intentionally losing to rank down and then smash lower rank players.

bsder5y ago

I also suspect that Chess is exponential because of the "one mistake and you die" nature when playing good players.

keeganpoppen5y ago

dvt5y ago

> also, what system(s) do you prefer / know of that handle multiplayer matchmaking well?

keeganpoppen5y ago

lemagedurage5y ago

Do you believe multiplayer games would be better off without matchmaking at all?

1 more reply

Godel_unicode5y ago

dvt5y ago

Godel_unicode5y ago

>> given enough players and enough matches, of course

1 more reply

juped5y ago

Elo, not ELO, after Árpád Élő.

dvt5y ago

Correct, fixed :)

defertoreptar5y ago· 20 in thread

BSTRhino5y ago

jsnell5y ago

[0] http://sirlingames.squarespace.com/blog/2010/7/24/analyzing-...

an_opabinia5y ago

Part of a matchmaking algorithm that increases user engagement is telling a story about how it's more fair though.

1 more reply

lanius5y ago

2 more replies

Natsu5y ago

What you said makes me wonder about a totally different way of using a metric of how likely the person is to win a given match.

That is, perhaps the system could be engineered to maintain a more even win/loss ratio so that people don't go on super-long win (or loss) streaks in general by adjusting who they get matched with.

It probably wouldn't work that well towards the edges, but around the middle it might work well enough.

1 more reply

mcnamaratw5y ago

The "different rating curve" appears to be the actual historical probability data, not a formula. I think. If that's right then it this estimate of the probability of winning is not a new discovery.

On the other hand I suspect the historical data really is the "best fit" to the historical data.

cortesoft5y ago

Personally, if I was designing a rating system, I would use two separate systems.

One would be like the one in this game; publicly viewable, pleases the players, and gives a sense of accomplishment.

dmos625y ago

nobodyandproud5y ago

Some games do this already, and players are very unpleased by the results.

1 more reply

ajuc5y ago

That's what xp is in sc2. For several years xp was the only number that was visible - your true mmr was hidden.

Peple knew this and ignored xp altogether.

mcnamaratw5y ago

BSTRhino5y ago

Yes, the best fit for the data is the data itself, it's a tautology. Nothing wrong with Elo's exponential curve, it just can't beat the actual data.

You raise a good point in that I could've created a training set and a test set, that probably would be a better validation. But I don't know, I'm not doing science, I'm making a game.

1 more reply

roenxi5y ago

> I think it is by definition the most accurate system

Obviously if I generate a set of numbers by sampling a normal distribution then the most accurate model is a normal distribution, no matter what empirical data I use for benchmarking.

That is to say, if we know how the data was generated (sans noise) we can reject empirical distributions as the most accurate, because we can directly know the distribution of the data.

mcnamaratw5y ago

ponker5y ago

Well, it’s like the question of what is better: a restaurant with 4.5 stars on 4 reviews or one with 4.2 stars on 1,500 reviews?

1 more reply

prionassembly5y ago

Is the point really predicting outcomes? FIDE (chess) Elo is useful because I can compare machines to humans who have never matched each other.

mlyle5y ago

> Generally speaking the "rating structure" is a lattice where you can, for any two players A and B, tell whether A is a better player than B or the other way around.

Not really. You tend to have cycles, where person A can beat person B who beats person C who beats person A.

Considering ratings to be a total ordering is a useful approximation.

defertoreptar5y ago

> Is the point really predicting outcomes?

1 more reply

naravara5y ago

Godel_unicode5y ago

If two players of similar skill general roll one another, that's a game design problem not a rankings problem.

edaemon5y ago· 11 in thread

ganonm5y ago

Having played a lot of ranked LoL, I saw a few recurring but irrational gripes players had with the Elo based system:

aaronblohowiak5y ago

An alternative explanation is that the skill ceiling is lower for support players.

ecdavis5y ago

mlyle5y ago

cortesoft5y ago

This is why you have two rating systems; one you show to people and is geared towards making players happy, and one that is used internally to create fair matches.

edaemon5y ago

It seems reasonable to cap the number of points you can lose, but it strikes me as odd that you'd lose fewer points in an upset than in an even match-up.

mlyle5y ago

Agreed on that point.

Effectively it acknowledges you have less of a prior when you've played few games.

1 more reply

gizmo6865y ago

There are 2 issues being solved by "newbie suppression".

edaemon5y ago

krallja5y ago

Newbies have a low rating because they’re new, not because they definitely suck.

edaemon5y ago

Sure, but the suppression described doesn't apply only to new players, it applies to "someone substantially lower in rating than you."

HideousKojima5y ago· 6 in thread

LoSboccacc5y ago

idk that makes extremely hard to find matches in games with a smaller player base

see war thunder, the simulation queue is a desert, high tier ships a wasteland, unless all the available player get forcibly lumped together matches will just not happen

HideousKojima5y ago

cortesoft5y ago

The purpose of ranking systems is to try to create fair matches.

I think ideally you wouldn't show the ranking to the player, just use it to create the match.

With a large population, everyone should end up winning about half their games. That would be the sign of a successful ranking system.

ecdavis5y ago

You're conflating ranking and matchmaking systems.

A ranking system measures relative skill/performance levels. You can have a ranking system without using it for matchmaking.[0]

[0] See FiveThirtyEight's Elo ratings for NFL teams: https://projects.fivethirtyeight.com/2019-nfl-predictions/

[1] However that's measured.

[2] Under such a system fairness might be relegated to the seeding process for tournament play.

cortesoft5y ago

> I don't agree that a typical 50% win rate

Assuming there are no ties, and teams have an even number of players, a multiplayer competitive video game is going to be a zero sum game; for every winner, there is going to be a loser.

1 more reply

ColeyG5y ago

In certain communities, players will choose ranked vs. unranked almost always. I agree that a ranked + custom lobby model should exist though.

IshKebab5y ago· 4 in thread

BSTRhino5y ago

Yes, tried Glicko then TrueSkill, both generated huge amounts of complaints. New system produced few complaints. If the community had liked it, would've stuck with TrueSkill.

IshKebab5y ago

TrueSkill 1 presumably?

oli56795y ago

https://pypi.org/project/trueskill/

karlding5y ago

[0] https://www.microsoft.com/en-us/research/publication/trueski...

[1] https://github.com/sublee/trueskill/issues/27

[2] https://www.microsoft.com/en-us/research/project/trueskill-r...

BSTRhino5y ago· 1 in thread

Wow, I wrote this article ages ago, didn't expect to see it posted here today.

I just want to clarify the point of the article:

Why would you fit a curve to the data when you can just use the actual data?

That's the point of the article.

That's all I'm saying.

I shouldn't have included all the other stuff in the article, it just distracts from the point.

OisinMoran5y ago

Thanks for writing the article and sharing your work with the world, I really enjoyed it! I think the central point you make is very interesting.

dcl5y ago· 1 in thread

Shameless plug, I've created an R implementation of it here: https://dclaz.github.io/mELO/

sali05y ago

This is fantastic, thank you for bringing this up.

dang5y ago· 1 in thread

I recall at least one large previous thread about Elo but can't find it. Anyone?

jsnell5y ago

Maybe https://news.ycombinator.com/item?id=16255910

letmeinhere5y ago· 1 in thread

Isn't that a logarithmic curve?

CodesInChaos5y ago

It's a sigmoid, which converges to exponential far from 0 and is somewhat linear near 0.

CodesInChaos5y ago

A quick and dirty overlay of the two graphs shows a reasonable fit: https://ibb.co/0YwYH9z

3. I like observations about player psychology. Satisfying the players is more important than having the mathematically best ranking system.

If I had to design a mathematically optimal ranking system, I'd start with WHR and make parts of it trainable/fittable.

----

IanGabes5y ago

Creating a custom system to suit your situations needs sounds great and the thought process was fun to read, but some of the claims lobbed here are pretty questionable.

Does elo suck? Maybe, but this hasn't convinced me.

jrek5y ago

Meaning just that skill is a weaker factor in this game than in chess...

Edit: The 'actual' curve includes a correction for the obvious anomaly of ~55% win expectation at 0 point delta.

runarberg5y ago

I remember a bit back the Go server that I play most of my go these days [OGS](https://online-go.com) changed their ratings from Elo to Glicko-2.

You can read their rationally for it in this forum: https://forums.online-go.com/t/ogs-has-a-new-glicko-2-based-...

The key takeaway is this:

> Most of the shortcomings [of Elo] can be traced back to the fact that the system is too slow to find a player’s correct rank, and too slow to adapt when jumps in strength occur.

noctilux5y ago

im3w1l5y ago

gverrilla5y ago

closed5y ago

I am pretty sure the author is describing a well understood limitation of Elo, they just need a tiny bit of connecting to models.

Elo can be thought of as an approximation to item response theory models [1]. These describe skill as normally distributed, and whether one person will win using a logistic function (not exponetial).

[1]: http://hvandermaas.socsci.uva.nl/Homepage_Han_van_der_Maas/P...

duaoebg5y ago

Repeated Bernoulli trials give rise to Gaussian distributions which is where the e exponential comes from.

This an assumption and an approximation and is not necessarily a good fit. Pulling from actual probabilities would generally perform better.

The rest is massaging to better fit the different objectives.

Godel_unicode5y ago

sytelus5y ago

Take a look at TrueSkill, a much better mathematically grounded, created at Microsoft Research and being used at scale in Xbox: https://en.m.wikipedia.org/wiki/TrueSkill

neolefty5y ago

How about coop games — what would you use to rate players where the goal is to win together?

EGreg5y ago

Wait why don’t we use a deep learning thingy on this dataset and just back out a formula that predicts the wins based on just the relative numbers of the people?

musicale5y ago

Nonsense - they're in the Rock and Roll Hall of Fame after all! Jeff Lynne is a musical genius.

philliphaydon5y ago

Elo was in Age of Empires back when zone .com was a thing.

I believe this is done to avoid being carried but it doesn’t work because it just results in you being stuck in a Low tier for ages.

TLDR: elo works and it’s great. No one implements it right.

dota / lol don't do this, the winning/losing team gains/loses the same amount of points. This is wrong.

This means a high elo player has the potential to farm points from low elo players with little risk. While low elo players get stuck not playing people in their own range.

afwaller5y ago

This is useful to increase plays by reducing “ladder anxiety”

j / k navigate · click thread line to collapse