Most games aren't chess -- where the only variance is picking who's black and who's white -- in fact, they might include dozens of RNG mechanics (from critical strikes to ability rolls, to spawn points). These mechanics (while fun and well-designed) might pollute your "idealized" model. There's also the problem of RPS (rock-paper-scissors) mechanics or pick-counter-pick mechanics which will also heavily skew win rates. For instance, given a slow combo Magic deck, you will most likely auto-concede to mono red aggro (regardless of skill level). If you're using Elo, this will pollute your model. (Hint: you shouldn't be using Elo.)
Most games also don't have chess' high skill ceiling. Chess has such a high skill ceiling for a number of reasons -- it's one of the oldest games still being actively played, for one. Suppose your "game" is simply the flip of a coin (everyone wins 50% of the time). Zero skill involved. Trying to model win-loss-ratios using a sigmoid curve is silly. Obviously, no game is going to be a coin flip, but there's a world of difference between chess and DOTA.
TruSkill attempts to fix (3) by using clever Bayesian updating on a player-by-player basis[1] but in reality, it's a shit-show. Using Elo (or variants thereof) for team-based games where the team isn't really a team (more like 3-5 random people plopped together for one match) is incredibly misguided, but continues to be implemented in just about every modern multiplayer game (to the players' frustration). Of course, mixing and matching pre-made groups with non pre-made groups creates as many issues as you might imagine.
In short, why so many game devs are enamored with Elo when it comes to ranking is a bit bizarre.
[1] https://www.microsoft.com/en-us/research/wp-content/uploads/...
It seems that much of the problem comes from rating points brought in by newbie players (and note that, contra TFA, the problem isn't with experienced players losing to newbies, but the opposite).
A newbie is started off with some nominal rating; I forget the number, but let's say it's 800. Most likely that newbie is going to lose his first matches, and some proportion of those newbies will get frustrated and quit. For the ones that stay in the game, things probably work out in the long run. But for those that got discouraged and quit, in the course of their loss they caused a few points (not many, because they're likely way overmatched, but definitely more than 0) to be credited to their opponents. When they quit the sport, they're never going to reclaim any of the rating points that they lost initially. But those points are still in the system, having been added to their winning opponents.
It's hard to quantify because the Elo system is the only objective comparison we have, but over the course of the almost 30 years I've been watching my wife play, the Elo rating enjoyed by a player of a given hypothetical skill level has increased dramatically. Many are saying that for someone of the upper echelons, their rating is maybe 200 points higher than it would have been 30 years ago.
So back in 1991, my wife was in the top 30 women in the USA with a rating in the mid-1700s. Today, someone with that rating isn't even going to be in the top brackets of serious tournament.
Despite all that, the usefulness of the rating system keeps it in use as a valuable tool. It seems that the ability to match players who have never seen each other before, ensuring interesting matches, is part of keeping the game competitive for those in it. And table tennis is also, because of this, one of what I believe is few sports where men and women often play head-to-head (even though men generally have much higher ratings, on account of the sport requiring far more strength than you might suspect).
But if that's true, then why would rating inflation be a problem?
In some sense, it is not surprising that we do not have a system that accomplishes this. Since it is impossible to see the results of a game between players living in different time periods, we cannot get any data to prevent drift. You can still try to normalize the rankings. However, unless you have some independent way of measuring skill, you would need to make an assumption about the relative strength of players. Assuming the average skill of a proffesional is constant across time is probably not accurate, but closer to reality than what you get with unchecked inflation.
That seems like a simple problem to fix. When somebody quits, just subtract 800 points from the remaining ranked players, scaled accordingly such that their relative win probabilities remain the same.
Of course, the other issue is if the number of active players increases over time. In that case, it's not so easy to fix unless you start scaling down the number of starting points given to new players.
Perhaps a better thing to do would be to construct a model of the rating inflation over time and use that to correct for historical comparisons. It's still not particularly meaningful though, because you have no way to measure actual skill inflation.
If you choose to delete them, that means that everyone will have constantly eroding ratings unless they keep playing.
Ah yes! Inflation is also a problem I've seen in competitive online games. Rating inflation was a serious issue with World of Warcraft PvP arenas circa 10 years ago (iirc Blizzard hard capped arena ratings at 3000 during WotLK). I don't follow chess much, and I'm not exactly sure how chess avoids it (or even if it does).
Most chess Elo systems have an inflationary component where young or new players (who are overall faster improvers than the player pool at large) gain and lose points faster than established players (in detail, either using performance ratings or increased k-factors or both). In a balanced rating system, the sources of inflation and deflation are roughly equal. You can tweak the parameters to keep it this way, though it's not trivial to tell whether there is "real" inflation over the years or whether players are simply playing better - or indeed, what's the difference.
If they know that some people just play a few games and then quit, let's say they only can get Elo when they played a specific amount of time or won at least n games etc.
especially with well-established, popular games-- Chess, League of Legends, Overwatch, etc. (where there is even a financial interest in being a top player to boot), the skill levels of the people at the absolute top simply profoundly dwarf players that would even superficially seem "comparable" by the standard of being in adjacent, or even within the same tier.
in League of Legends, for example, it is often claimed that the differences between players in high/low Challenger, high/low Master's, and even high/low "high diamond" (low d2 vs high d1) all constitute distinct "tiers" of player quality that are as substantial as the full-tier jumps closer to the median (e.g. silver -> gold, gold -> platinum), but because of this shoehorned prior about skill distribution it leads to this compression at the very top.
This is close but not exactly right, and the small difference matters. Elo does not assume that skill is normally distributed, but rather that "quality of play" in a single game is normally distributed around some average quality level for the player. Obviously this too is an approximation but it's a much smaller one.
I would think player skill level (at best; there easily can be cases where P typically beats Q, Q beats R and R beats P) is an ordinal (https://en.wikipedia.org/wiki/Ordinal_data), so one can’t say “player P is twice as good as player Q”, “player P is as much better than player Q as player R is better than player S”, and certainly can’t prove or disprove whether skill is being normally distributed. It is customary to assume that, though.
Also, if one assigns numbers to skill levels, those can be normally distributed. It probably is possible to design an ELO-like system that, given enough games, guarantees that the set of skill level numbers of all players approaches being normally distributed.
https://2p1ipt36o1g23g1tt6ba5nou-wpengine.netdna-ssl.com/wp-...
None of which matters? All that means is that the results of individual games are a bit higher variance. Elo handles that by design. If you lose a certain proportion of Magic games to less-skilled players then this should be considered a reflection of your skill, because the only reasonable definition of skill at the came is the rate at which you actually win it; anything else can be gamed and so should be ignored.
> Most games also don't have chess' high skill ceiling. Chess has such a high skill ceiling for a number of reasons -- it's one of the oldest games still being actively played, for one. Suppose your "game" is simply the flip of a coin (everyone wins 50% of the time). Zero skill involved. Trying to model win-loss-ratios using a sigmoid curve is silly. Obviously, no game is going to be a coin flip, but there's a world of difference between chess and DOTA.
That's also something that Elo handles just fine? If every game is a coin flip then everyone will end up with the same Elo. If player A has x more Elo points than player B, then they win y% of their games. If your game has a skill ceiling where even a complete beginner always wins, say, 20% of their games, then that just means no-one will ever be able to rise above a corresponding Elo rating.
That's not how it works. The distribution you end up with will not be uniform, it will look like this (just ran Elo with a coinflip; 11 players, 1000 matches): https://imgur.com/9O82pRj
On the long term, I think this will tend to a geometric distribution with a low p value.
If you're matchmaking players against equal-ranked players, then each match is just +/- 50 points, you'll get a binomial distribution which tends to normal as n gets large (assuming a large player pool so each player's results are independent). If players play players with different ratings then that will tend to push their rating back towards neutral. You certainly don't get a geometric distribution because the rating algorithm is completely symmetric.
Probably just Occam's Razor, they don't know of or care to make something better and can just pull Elo off the shelf.
Predictably this led to an explosion in boosting and win-trading services.
and
> They tweaked the variables until they got their desired distribution, lumping the majority of players in gold and plat
seem to be opposites. What am I missing here?
Ben Finegold (a Chess Grandmaster) talks about this all the time--"The reason why I'm higher rated than you is that I can play 100 moves without a major mistake and at some point you will hang a piece. The reason why Magnus Carlsson is rated higher than me is that he will play 100 moves that are slightly better than mine and I will lose."
None, and actually I don't think it's particularly healthy for the game. For example, I had plenty of fun casually pubbing Counter Strike in the early 2000s. When I wanted to take the game more seriously, I made a team and joined a league which might include group play, single/double elimination, and exhibition games. Actual competitive play (scrims, matches, tournaments) is fundamentally different than what today we call "matchmaking."
To your coin-flip example, if you model a league in excel you'll find that elo actually results in a rank distribution very consistent with what your intuition would expect (given enough players and enough matches, of course).
This is incorrect. If you simulate Elo with a coin-flip, you'll get something that looks like this (11 players, 1000 matches): https://imgur.com/9O82pRj -- I think this will tend to a geometric distribution (not sure what the p is though, probably depends on the constants).
Some of these complaints are solved by existing systems, namely Glicko. For example, rating deviation helps with experienced players (low RD) losing points to newer players (high RD). It also has a built-in way to discourage inactivity. Players' RD increase over periods of inactivity, so they can be excluded from the leaderboard after reaching a certain point. That allows us to maintain their rating without decreasing it. After all, that's our best guess of the player's skill. It's just a less reliable guess over time.
If the author (you?) had just described it in those terms, it'd be hard to object. But the article goes further, and makes claims about the system being more accurate due to a different rating curve. That's the claim that would need to be justified by actually comparing whether the predictions the new system makes really are better.
[0] http://sirlingames.squarespace.com/blog/2010/7/24/analyzing-...
Like our tolerance for losing is acquired. Most normal people losing in League for the first time stop playing, usually forever. Just randomly visit your friend's match histories in League, frequent players have many days of long losing streaks.
If you're just conditioned to play despite losing, great, in a Darwinian way (surviving, being around to be measured) you will be representative of the average player in League. And there are so many League players with such long retention you cannot possibly argue that skill-based matchmaking is the core component of user engagement.
His dataset is interesting because it will necessarily overrepresent people who kept playing despite the old system. That sort of refutes its importance - I mean sure people complain but they keep playing, so was it really that important? So what if complaints go down?
Those are important goals, and also, it's still an interesting twist in multiplayer game design. You just gotta interpret it as a commentary on a whole system even if it doesn't narrowly talk about a scientific objective like performance prediction.
That is, perhaps the system could be engineered to maintain a more even win/loss ratio so that people don't go on super-long win (or loss) streaks in general by adjusting who they get matched with.
It probably wouldn't work that well towards the edges, but around the middle it might work well enough.
On the other hand I suspect the historical data really is the "best fit" to the historical data.
Personally, if I was designing a rating system, I would use two separate systems.
One would be like the one in this game; publicly viewable, pleases the players, and gives a sense of accomplishment.
Then, I would have a second, internal only rating system that players can't see but is used for matchmaking to make sure people are matched up to players with as close to equivalent skill as possible.
Peple knew this and ignored xp altogether.
Naturally future data is much harder to deal with than past data. But even for future data it's not obvious that ELO (or any other theoretical fit to the odds of winning) will be more accurate than the historical odds.
You raise a good point in that I could've created a training set and a test set, that probably would be a better validation. But I don't know, I'm not doing science, I'm making a game.
On the topic of whether the future matches the past, the predictions were based on a rolling database of the past 100000 matches, which is approximately the number of matches played per 7 days. So my theory is that the data is quite recent and up-to-date and so should match, in general.
Of course I never tested this. In the end, I'm not doing science, I'm making a game. If the retention goes up, complaints are down, then I can't keep working on the rating system, there are 1000 other things to do.
By gum, an opportunity to quibble semantics on the internet. That is true if benchmark using means 'only admit to knowing' and accuracy means 'must be numerically quantifiable given existing data'. It is false otherwise, especially if accuracy means 'conforming to truth' and we have a model for how the numbers are being generated.
Obviously if I generate a set of numbers by sampling a normal distribution then the most accurate model is a normal distribution, no matter what empirical data I use for benchmarking.
That is to say, if we know how the data was generated (sans noise) we can reject empirical distributions as the most accurate, because we can directly know the distribution of the data.
Generally speaking the "rating structure" is a lattice where you can, for any two players A and B, tell whether A is a better player than B or the other way around. Elo, Glicko, etc. are embeddings of this lattice on the real line (much like the utility functions of microeconomics are real embeddings of preference lattices).
Not really. You tend to have cycles, where person A can beat person B who beats person C who beats person A.
There's a guy who was even with the guys playing Go who were 2 or 3 stones weaker than me that would tend to beat me because of some of the unorthodox things he did. (Eventually I strengthened my game against these things).
Considering ratings to be a total ordering is a useful approximation.
Others have pointed out how there is a psychological aspect of rating systems, and no developer wants to constantly field complaints. That said, I believe the answer is yes. A rating system derives meaningfulness from its predictive power. In other words, people want to know how good they actually are compared to one another.
At high skill levels players are skilled enough to where it might, but with most online multiplayers the overwhelmingly vast majority of players are lacking in basic fundamentals to varying degrees. At that level, ELO based matchmaking mostly just results in one person getting rolled or doing the rolling. They’re not really competitive games in my experience.
Having played a lot of ranked LoL, I saw a few recurring but irrational gripes players had with the Elo based system:
- "I get matched with bad teammates and they drag me down". On average your teammates are the same Elo as you. All players get their fair share of games where they are/aren't the underdog side. On average, it averages out. Deal with it.
- "I've been stuck at the same Elo for ages but I should be higher". Nope, Elo only cares if you win or lose. It doesn't care about kill/death ratio, creep score or how many ganks you pull off. Focus on winning more. Incidentally, focusing on winning instead of secondary metrics like kills/CS was one of the biggest mindset differences between high/low Elo players.
"I should be higher Elo but I play support roles so can't climb". It may be true that you climb slower but here's the rub - think of your matchups as you being compared to the enemy team's support player. The other four roles on each team are actually a constant factor (by symmetry arguments you could not consistently find that your four teammates are any better/worse than the enemy support player's teammates). As a result, the only remaining factor in the statistical equation is you weighed up against the enemy support player. If you can provide even a slight statistical advantage towards winning vs them then you will climb the Elo ladder.
An alternative explanation is that the skill ceiling is lower for support players.
I mentioned this in another comment, but I think if e-sports wants to become truly culturally significant the games will need to figure out how to more elegantly bridge the gap between ranked and unranked play, and how to make it fun to play at low skill levels. I don't think it's a coincidence that Fortnite has done both.
A better mechanism, IMO, even though it doesn't catch all the cases: provisional ratings, where you don't affect other peoples' ratings much for the first n games (and where your rating moves faster, too).
Effectively it acknowledges you have less of a prior when you've played few games.
1) Cases where the ranking of the opponent is not well known. Either because they are on a new account, or recently had a massive change in skill level (say, practicing on a different account; or not playing and loosing ranking points due to the decay mechanic).
2) Inconsistent play. Ranking systems generally assume skill level is mostly static, with gradual changes over time. In practice, people have bad days. Limiting the influence of any single game reduces the noise introduced by inconsistent play at the expense of making convergence a slower process.
There are plenty of cases where a highly-rated player loses to a low-rated player fair and square. In those cases there should be significant rating adjustments -- at least more significant than a normal game between similarly-rated opponents -- but this system removes those to combat edge cases. I think it would be more effective to deal with those edge cases directly.
see war thunder, the simulation queue is a desert, high tier ships a wasteland, unless all the available player get forcibly lumped together matches will just not happen
compare with stormworks too, most servers are empty in my timezone and the populated one as password protected or spawn limited, it wouldn't take much to get known and partecipare in their community but for working games the time commitment is simply impossible.
same with arma3 I'd love to get into shack tac but timezone and commitments make it unavailable to me, and since most of the good players are sucked up in teams the public server are a mess of "what's left" of the community
I think ideally you wouldn't show the ranking to the player, just use it to create the match.
With a large population, everyone should end up winning about half their games. That would be the sign of a successful ranking system.
A ranking system measures relative skill/performance levels. You can have a ranking system without using it for matchmaking.[0]
I don't agree that a typical 50% win rate[1] indicates a successful matchmaking system. For one thing, creating fair matches is _a_ purpose of a matchmaking system, but not necessarily the _sole_ purpose. For another, that people win half their games on average says nothing about how fair the matches were.
I think that fairness often gets prioritized over fun. Playing sports should be fun at all levels, but it's particularly important at the lower skill levels that the participants enjoy themselves. That's how sports grow and become cultural institutions. Being a low skill player in a silo of other low skill players is a decidedly un-fun experience that drives a lot of new players away from e-sports. A ranked matchmaking system could be designed with the express purpose of helping low skill players have fun and naturally develop into average skill players.[2] I wonder what such a system would look like.
[0] See FiveThirtyEight's Elo ratings for NFL teams: https://projects.fivethirtyeight.com/2019-nfl-predictions/
[1] However that's measured.
[2] Under such a system fairness might be relegated to the seeding process for tournament play.
Assuming there are no ties, and teams have an even number of players, a multiplayer competitive video game is going to be a zero sum game; for every winner, there is going to be a loser.
While I agree that you it isn't the SOLE purpose of a matchmaking system, I do think a fair matchmaking system will end up with most players having a 50% win rate (with a few people at either ends of the skill spectrum having lower or higher win rates). If you are winning more of your games, you should play better at better players until you start losing again (and vice versa). You should eventually hit an equilibrium where you are playing people you have about a 50% chance of beating.
[0] https://www.microsoft.com/en-us/research/publication/trueski...
[1] https://github.com/sublee/trueskill/issues/27
[2] https://www.microsoft.com/en-us/research/project/trueskill-r...
I just want to clarify the point of the article:
Why would you fit a curve to the data when you can just use the actual data?
That's the point of the article.
We're in the age of big data, we should use it to make better win rate predictions. Elo's exponential curve is fine, it's approximately right, it's just now we can have databases of millions of games and we can just do better. Elo was invented before the big data age and it is limited by that.
That's all I'm saying.
I shouldn't have included all the other stuff in the article, it just distracts from the point.
I'd be interested to know what fit you used for the red "line of best fit", why not a straight line? My main question here is do you actually expect a player ~210 points above another to win _less_ than if they were only ~190 points above? (the first dip in the red graph)
Shameless plug, I've created an R implementation of it here: https://dclaz.github.io/mELO/
2. The fit isn't as bad as the author claims. It looks like the biggest difference between the graphs is that the point differences are scaled differently (400 pts for 90% in elo vs 800 pts in the second graph).
A quick and dirty overlay of the two graphs shows a reasonable fit: https://ibb.co/0YwYH9z
3. I like observations about player psychology. Satisfying the players is more important than having the mathematically best ranking system.
4. Personally I like Whole History Ranking (https://www.remi-coulom.fr/WHR/), but it's unlikely to be popular with players (the psychological criticisms the article makes apply to it as well, with some additional problems, like rank drifting without playing). KGS which uses ranking system similar to WHR (but more primitive) certainly draws a lot of criticism for its ranking system.
If I had to design a mathematically optimal ranking system, I'd start with WHR and make parts of it trainable/fittable.
----
⁺ Bayes' theorem turns into addition when applied to logarithmic probabilities and the sigmoid function converts from logarithmic probabilities to normal probabilities. This property is why it (or its multi category equivalent softmax) is used when predicting probabilities using logistic regression or neural networks.
Specifically, the claim that Dota's matchmaking system is "probably wrong" because the model chosen doesn't match your own findings feels like a reach. Sibling commenters have pointed out how skill variance is important to allow the ELO system to function in games like chess. Additionally, someone else pointed out that the sigmoid function is similar to a linear funciton close to zero.
It seems at least as likely that Acolytefight doesn't have a high enough level of skill expression present in the game to see top players "curve out" weaker players, rather than exponential functions mapping player skill to be useless or wrong.
Does elo suck? Maybe, but this hasn't convinced me.
Meaning just that skill is a weaker factor in this game than in chess...
Edit: The 'actual' curve includes a correction for the obvious anomaly of ~55% win expectation at 0 point delta.
You can read their rationally for it in this forum: https://forums.online-go.com/t/ogs-has-a-new-glicko-2-based-...
The key takeaway is this:
> Most of the shortcomings [of Elo] can be traced back to the fact that the system is too slow to find a player’s correct rank, and too slow to adapt when jumps in strength occur.
> The problem of slow moving ratings is a well-known problem with Elo implementations. In response to this, Prof. Mark Glickman developed the Glicko, and later Glicko-2, rating systems which address this problem very well and are fairly widely used
A few weeks ago they then made an update to their implementation of Glicko-2, where—during the announcements they summarized many interesting statistics on how the system has panned out for them: https://forums.online-go.com/t/2020-rating-and-rank-tweaks-a...
When I first read this, I thought to myself "well we get to pick the scores, so it's exponential by definition". The problem becomes more clear when you express it without any reference to the scores.
If Player A wins 80% of the time against Player B, and Player B wins 80% of the time against Player C, how often does Player A win against Player C? This is a question purely in terms of observables. Elo makes a prediction here (94.1% of the time) and it can be either right or wrong. If it's wrong, then there is no valid assignment of scores.
I'm not sure if this makes sense, but what I know for sure is that as an experienced player, I can watch a player play a single game (sometimes a few rounds), and access his average rank/skill level with high confidence, with no need of information from his prior games whatsoever, or detailed statistics of his gameplay.
There's something else to remember for high skill-cieling games: winrate is not what really matters. A lot of times I will play a very good, balanced and fun game and lose. Sometimes it will even happen with very uneven scores like 16-5 or soomething...
Elo can be thought of as an approximation to item response theory models [1]. These describe skill as normally distributed, and whether one person will win using a logistic function (not exponetial).
I think what the author has keyed in on is that afaik in simple Elo there is no slope coefficient for the logistic, but in general IRT models there is (called item discrimination). So in Elo you can't learn that flatter curve they show.
[1]: http://hvandermaas.socsci.uva.nl/Homepage_Han_van_der_Maas/P...
This an assumption and an approximation and is not necessarily a good fit. Pulling from actual probabilities would generally perform better.
The rest is massaging to better fit the different objectives.
The curve being linear means essentially that skill in the game confers less of a relative advantage. Chess is a good counterexample here, also rocket league. Both are games where difference in MMR is very strongly correlated with outcome, and both are games where skill is easily measured and highly correlated with ranking.
It worked and worked well. Points were calculated for each person. However Dots2 and lol don’t implement Elo the same way, points are calculated for the team. So if you’re Low score and you win against high people. In Dota and lol you won’t gain many points.
I believe this is done to avoid being carried but it doesn’t work because it just results in you being stuck in a Low tier for ages.
TLDR: elo works and it’s great. No one implements it right.
Edit: In Age of Empires / Zone, if you had a 4v4, it used all 8 players to calculate the ELO on an individual player, so if you had in your team 1750 elo, 1550 elo, and anything in between. The 1750 may gain only 1 point, while the 1550 may gain 16 points (the highest gain lowered the more people who played) While on the losing side the lowest elo will lose the lowest amount of points and the highest will lose the highest amount of points.
dota / lol don't do this, the winning/losing team gains/loses the same amount of points. This is wrong.
This means a high elo player has the potential to farm points from low elo players with little risk. While low elo players get stuck not playing people in their own range.