MCC (Master Chief Collection) Server Incident Summary (opens in new tab)

(halowaypoint.com)

120 pointsDivisionSol4y ago32 comments

32 comments

31 comments · 10 top-level

ozarker4y ago· 5 in thread

Really cool to see this level of transparency on an issue from a multiplayer game dev. Really cool write up

Both Bungie and 343 have done an admirable job (well, compared to other devs) about explaining their network infrastructure etc. Back in the day they did a big talk about how their matchmaking in Halo2/3 worked that I think to this day is still one of the best methods of learning when you're not in the industry yet. I can't recall what it was called though: might be the "Chris Butcher - Recreating the LAN Party Online: The Networking and Social Infrastructure of Halo 2" GS talk but I can't listen right now to check

Darkphibre4y ago

Along these lines is the venerable TRIBES Engine Networking Model whitepaper. It was so good, it was shipped as part of the XDK for a decade. I believe Bungie even leaned on it quite a bit when creating their networking stack. https://www.gamedevs.org/uploads/tribes-networking-model.pdf

Disclaimer: I work at Microsoft Game Studios, but this comment reflects my own opinions.

coldpie4y ago

Only semi-related, but there was a recent excellent, short podcast interview series with one of Halo 2's multiplayer designers: https://smarturl.it/H2Pod

1 more reply

maicro4y ago

And similar to my sibling comment, mostly for anyone else browsing through - the comments on this recent HN post have a lot of resources about game dev stuff in general, which might be of interest (though I also cannot promise directly related X) ): https://news.ycombinator.com/item?id=31084779

Jasper_4y ago

This talk from the Halo: Reach team might be one of my favorite game networking talks I've ever seen: https://www.youtube.com/watch?v=h47zZrqjgLc

verall4y ago· 5 in thread

Anyone else having issues viewing this on Firefox? I can see the whole webpage for an instant, then everything disappears, then it is "slowing down my browser".

abbeyj4y ago

I am but only with uBlock Origin turned on. uBlock Origin blocks something and then the page goes into an infinite loop requesting https://wpcontent.svc.halowaypoint.com/purchase-content/game... and https://wpcontent.svc.halowaypoint.com/content-ratings/esrb/... over and over again. It is also rapidly using up memory so eventually you run low and the whole browser starts performing poorly. Whitelisting the site in uBlock Origin is a workaround.

zymhan4y ago

Odd, maybe another extension in combination is causing that issue? I'm running FF 100.0 (wow, I remember FF 3.5) with UBO and Privacy Badger.

spartanatreyu4y ago

I am running FF 99 with only uBlock Origin and I had no slowdowns

weberer4y ago

https://archive.ph/tJcbd

tjpnz4y ago

Loaded fine in Firefox for Android. Shame the text size makes the post unreadable.

mrguyorama4y ago· 4 in thread

The most interesting part of this to me is that they have a path, both in code and in process, to fallback to peer-to-peer if stuff breaks. That's pretty impressive.

thatguy09004y ago

Now that big game companies are really starting to shut down old servers en masse it should be the default, really. https://www.gamespot.com/articles/ubisoft-shuts-down-online-...

jasomill4y ago

Also interesting and impressive: Halo multiplayer fans have modded the original Xbox version of Halo to act as a dedicated server for Xbox LAN multiplayer[1] that serves the same architectural role as 343's UHS does for MCC online play.

[1] http://halo1nhe.com

zymhan4y ago

Now I'm very curious how the P2P matchmaking is bootstrapped.

AgentME4y ago

The clients still talk to the official servers which run matchmaking and group up players together. The difference is whether the matchmaking servers tell all the players to connect to a dedicated gameserver or to connect to one of the players.

xmodem4y ago· 2 in thread

I know we normally never hear about this stuff from game compaies at all, but it would be nice to have a little bit more detail.

> Finally, it was identified that there had been some updates behind the scenes to the servers we use to relay STUN traffic as part of the ICE process which had resulted in misconfigurations.

What updates? How were they tested (or not)?

Darkphibre4y ago

The thing with using cloud services is that sometimes the hosting provider can make changes to configurations that have impacts (such as monitoring software that steals precious CPU at seemingly random moments, or network topology hardware that impacts connectivity)... without it necessarily being under your own control. I've had a few high-visibility incidents where, after investigation, the buck stopped with us even though technically the studio could have shifted blame to an unanticipated change from the hosting provider.

Disclaimer: I currently work at Microsoft Game Studios, though this comment reflects my own opinions based on experiences unrelated to this incident.

xmodem4y ago

Of course, that's totally understandable - I recently root caused an incident to "AWS likely changed how this component behaves at some point in a particular 2 month timespan."

But I think there's a lot more to be learned by sharing how STUN changed, what the new behaviour is, what the intent of the change was, how it was tested, etc.

For a counter-example of the level of detail I'd like to see, I saw this [1] DataDog incident report go by on Twitter this morning. This is straight up awesome and more detailed that most of our internal incident reports. I definitely learned a lot from reading it.

1: https://www.datadoghq.com/blog/engineering/grpc-dns-and-load...

darknavi4y ago· 2 in thread

Interesting that they call their service UDS. I was under the impression that they used PlayFab.

tehbeard4y ago

MCC probably has an "interesting" architecture as it's a combination of Halo games that spans a few generations that makes something like playfab a bit too restrictive to work in.

sgtfrankieboy4y ago

Halo Master Chief Collection was already released for 4 years (2014) when MSFT bought PlayFab (2018)

auto4y ago· 1 in thread

Given that the error resulted from the STUN and ICE servers, which from my understanding exist solely to play a part in the NAT punching process, would this entire situation have been mitigated if things were end-to-end IPV6?

pilif4y ago

Only in theory. In practice even when IPv6 is in use, people have stateful firewalls that will drop unsolicited connection attempts.

Compared to ipv4 where there is UPnP and NAT-PMP with widespread support in routers, there are protocols to allow clients to reconfigure the router with ephemeral firewall rules, but they are not wide-spread and support is very spotty.

So in practice, users with just IPv6 would have the exact same problems and would be even more likely to depend on STUN and ICE because their firewalls likely won’t support client-side hole-punching correctly

gundmc4y ago· 1 in thread

I don't recall seeing this sort of post-mortem from a gaming provider before. Really cool to see! Kudos, Halo team and Microsoft!

srmn4y ago

Roblox did a very interesting and similar write up at the beginning of this year regarding a large outage they experienced at the end of 2021: https://blog.roblox.com/2022/01/roblox-return-to-service-10-...

wyldfire4y ago· 1 in thread

So if I wanted to refer to a group of them, would they be called 'Masters Chief'?

seizethegdgap4y ago

I don't believe so. The full E-9 "title" in the US Navy is Matter Chief Petty Officer, so plural would be Master Chief Petty Officers

xeromal4y ago

Wow, that was a fun read! I don't envy the people who had to stare at wireshark logs for 3 days though. Oof.

BaconPackets4y ago

The root cause is REALLY surprising. If it's really an unrelated change to the NAT/STUN relay server, it means that there was a pretty broad lack of change management framework.

j / k navigate · click thread line to collapse

32 comments

31 comments · 10 top-level

ozarker4y ago· 5 in thread

Really cool to see this level of transparency on an issue from a multiplayer game dev. Really cool write up

stryan4y ago

Darkphibre4y ago

Disclaimer: I work at Microsoft Game Studios, but this comment reflects my own opinions.

coldpie4y ago

Only semi-related, but there was a recent excellent, short podcast interview series with one of Halo 2's multiplayer designers: https://smarturl.it/H2Pod

1 more reply

maicro4y ago

Jasper_4y ago

This talk from the Halo: Reach team might be one of my favorite game networking talks I've ever seen: https://www.youtube.com/watch?v=h47zZrqjgLc

verall4y ago· 5 in thread

Anyone else having issues viewing this on Firefox? I can see the whole webpage for an instant, then everything disappears, then it is "slowing down my browser".

abbeyj4y ago

zymhan4y ago

Odd, maybe another extension in combination is causing that issue? I'm running FF 100.0 (wow, I remember FF 3.5) with UBO and Privacy Badger.

spartanatreyu4y ago

I am running FF 99 with only uBlock Origin and I had no slowdowns

weberer4y ago

https://archive.ph/tJcbd

tjpnz4y ago

Loaded fine in Firefox for Android. Shame the text size makes the post unreadable.

mrguyorama4y ago· 4 in thread

The most interesting part of this to me is that they have a path, both in code and in process, to fallback to peer-to-peer if stuff breaks. That's pretty impressive.

thatguy09004y ago

Now that big game companies are really starting to shut down old servers en masse it should be the default, really. https://www.gamespot.com/articles/ubisoft-shuts-down-online-...

jasomill4y ago

[1] http://halo1nhe.com

zymhan4y ago

Now I'm very curious how the P2P matchmaking is bootstrapped.

AgentME4y ago

xmodem4y ago· 2 in thread

I know we normally never hear about this stuff from game compaies at all, but it would be nice to have a little bit more detail.

> Finally, it was identified that there had been some updates behind the scenes to the servers we use to relay STUN traffic as part of the ICE process which had resulted in misconfigurations.

What updates? How were they tested (or not)?

Darkphibre4y ago

Disclaimer: I currently work at Microsoft Game Studios, though this comment reflects my own opinions based on experiences unrelated to this incident.

xmodem4y ago

Of course, that's totally understandable - I recently root caused an incident to "AWS likely changed how this component behaves at some point in a particular 2 month timespan."

But I think there's a lot more to be learned by sharing how STUN changed, what the new behaviour is, what the intent of the change was, how it was tested, etc.

1: https://www.datadoghq.com/blog/engineering/grpc-dns-and-load...

darknavi4y ago· 2 in thread

Interesting that they call their service UDS. I was under the impression that they used PlayFab.

tehbeard4y ago

MCC probably has an "interesting" architecture as it's a combination of Halo games that spans a few generations that makes something like playfab a bit too restrictive to work in.

sgtfrankieboy4y ago

Halo Master Chief Collection was already released for 4 years (2014) when MSFT bought PlayFab (2018)

auto4y ago· 1 in thread

pilif4y ago

Only in theory. In practice even when IPv6 is in use, people have stateful firewalls that will drop unsolicited connection attempts.

gundmc4y ago· 1 in thread

I don't recall seeing this sort of post-mortem from a gaming provider before. Really cool to see! Kudos, Halo team and Microsoft!

srmn4y ago

wyldfire4y ago· 1 in thread

So if I wanted to refer to a group of them, would they be called 'Masters Chief'?

seizethegdgap4y ago

I don't believe so. The full E-9 "title" in the US Navy is Matter Chief Petty Officer, so plural would be Master Chief Petty Officers

xeromal4y ago

Wow, that was a fun read! I don't envy the people who had to stare at wireshark logs for 3 days though. Oof.

BaconPackets4y ago

The root cause is REALLY surprising. If it's really an unrelated change to the NAT/STUN relay server, it means that there was a pretty broad lack of change management framework.

j / k navigate · click thread line to collapse