Roblox October Outage Postmortem (opens in new tab)

(blog.roblox.com)

687 pointskbuck4y ago310 comments

310 comments

151 comments · 35 top-level

erwincoumans4y ago· 19 in thread

>> We are working to move to multiple availability zones and data centers.

Surprised it was a single availability zone, without redundancy. Having multiple fully independent zones seems more reliable and failsafe.

abarringer4y ago

Was on a call with a bank VP that had moved to AWS. Asked how it was going. Said it was going great after six months but just learning about availability zones so they were going to have to rework a bunch of things.

Astonishing how our important infrastructure is moved to AWS with zero knowledge of how AWS works.

foobarian4y ago

> Surprised it was a single availability zone, without redundancy. Having multiple fully independent zones seems more reliable and failsafe.

It's also a lot more expensive. Probably order of magnitude more expensive than the cost of a 1 day outage

sam0x174y ago

Most startups I've worked at literally have a script to deploy their whole setup to a new region when desired. Then you just need latency-based routing running on top of it to ensure people are processed in the closest region to them. Really not expensive. You can do this with under $200/month in terms of complexity and the bandwidth + database costs are going to be roughly the same as they normally are because you're splitting your load between regions. Now if you stupidly just duplicate your current infrastructure entirely, yes it would be expensive because you'd be massively overpaying on DB.

In theory the only additional cost should be the latency-based routing itself, which is $50/month. Other than that, you'll probably save money if you choose the right regions.

6 more replies

outworlder4y ago

> It's also a lot more expensive. Probably order of magnitude more expensive than the cost of a 1 day outage

Not sure I agree. Yes, network costs are higher, but your overall costs may not be depending on how you architect. Independent services across AZs? Sure. You'll have multiples of your current costs. Deploying your clusters spanning AZs? Not that much - you'll pay for AZ traffic though.

1 more reply

bradly4y ago

Yes. If you are running in two zones in the hopes that you will be up if one goes down, you need to be handling less than 50% load in each zone. If you can scale up fast enough for your use case, great. But when a zone goes down and everyone is trying to launch in the zone still up, there may not be instances for you available at that time. Our site had a billion in revenue or something based on a single day, so for us it was worth the cost, but it not easy (or at least it wasn't at the time).

Hamuko4y ago

How expensive? Remember that the Roblox Corporation does about a billion dollars in revenue per year and takes about 50% of all revenue developers generate on their platform.

1 more reply

johnmarcus4y ago

Multi-AZ is free at Amazon. Having things split amongst 3 AZ's cost no more than having in a single AZ.

Multi-Region is a different story.

2 more replies

kreeben4y ago

>> Having multiple fully independent zones seems more reliable

I don't think these independent zones exist. See AWS's recent outages, where east cripples west and vice versa.

Karrot_Kream4y ago

Availability Zones aren't the same thing as regions. AWS regions have multiple Availability Zones. Independent availability zones publishes lower reliability SLAs so you need to load balance across multiple independent availability zones in a region to reach higher reliability. Per AZ SLAs are discussed in more detail here [1]

(N.B. I find HN commentary on AWS outages pretty depressing because it becomes pretty obvious that folks don't understand cloud networking concepts at all.)

[1]: https://aws.amazon.com/compute/sla/

3 more replies

count4y ago

That's not how they work. They exist, and work extremely well within their defined engineering / design goals. It's much more nuanced than 'everything works independently'.

1 more reply

Bluecobra4y ago

> I don't think these independent zones exist.

Wouldn't it be possible to create fully independent zones with multiple cloud providers, like AWS, GCP, Azure? This is assuming that your workloads don't rely on proprietary services from a given provider.

1 more reply

mbesto4y ago

There have been multiple discussions on HN about cloud vs not cloud and there are endless amount of opinions of "cloud is a waste blah blah".

This is exactly one of the reasons people go cloud. Introducing an additional AZ is a click of a button and some relatively trivial infrastructure as code scripting, even at this scale.

Running your own data center and AZ on the other hand requires a very tight relationship with your data center provider at global scale.

For a platform like Roblox where downtime equals money loss (i.e. every hour of the day people make purchases), then there is a real tangible benefit to using something like AWS. 72 hours downtime is A LOT, and we're talking potentially millions of dollars of real value lost and millions of potential in brand value lost. I'm not saying definitively they would save money (in this case profit impact) by going to AWS, but there is definitely a story to be had here.

treis4y ago

But it wasn't a hardware issue. It was a software one and that would have crossed AZ boundaries.

1 more reply

vorpalhex4y ago

I'm more impressed that it hasn't been an issue until now.

bob10294y ago

> Having multiple fully independent zones seems more reliable and failsafe.

This also introduces new modes of failure which did not exist before. There are no silver bullets for this problem.

rhizome4y ago

There are no silver bullets to any problem, but there are other ways of implementing services and architecture that can sidestep these things.

maxclark4y ago

No surprised at all. Multi AZ is a PITA. You'd be surprised how many 7fig+/month infra is single region/az

mhitza4y ago

For example parts of AWS itself. us-east-1 having issues? Looks like aws console all over the world have issues.

You constantly hear about multi zone, region, cloud. But in practice when things break you hear all these stories of them running in a single region+zone

hedwall4y ago

A guess would be that game servers are distributed across the globe but backend services l are in one place. A common pattern in game companies.

regnull4y ago· 15 in thread

It's weird it took them so long to disable streaming. One of the first things you do in this case is roll back the last software and config updates, even innocent looking ones.

yashap4y ago

That’s what stood out to me too. Although they’d been slowly rolling it out for awhile, their last major rollout was quite close to the outage start:

> Several months ago, we enabled a new Consul streaming feature on a subset of our services. This feature, designed to lower the CPU usage and network bandwidth of the Consul cluster, worked as expected, so over the next few months we incrementally enabled the feature on more of our backend services. On October 27th at 14:00, one day before the outage, we enabled this feature on a backend service that is responsible for traffic routing. As part of this rollout, in order to prepare for the increased traffic we typically see at the end of the year, we also increased the number of nodes supporting traffic routing by 50%

Consul was clearly the culprit early on, and you just made a significant Consul-related infrastructure change, you’d think rolling that back would be one of the first things you’d try. One of the absolute first steps in any outage is “is there any recent change we could possibly see causing this? If so, try rolling it back.”

They’ve obviously got a lot of strong engineers there, and it’s easy to critique from the outside, but this certainly struck me as odd. Sounds like they never even tried “let’s try rolling back Consul-related changes”, it was more that, 50+ hours into a full outage, they’d done some deep profiling, and discovered the steaming issue. But IMO root cause analysis is for later, “resolve ASAP” is the first response, and that often involves rollbacks.

I wonder if this actually hindered their response:

> Roblox Engineering and technical staff from HashiCorp combined efforts to return Roblox to service. We want to acknowledge the HashiCorp team, who brought on board incredible resources and worked with us tirelessly until the issues were resolved.

i.e. earlier on, were there HashiCorp peeps saying “naw, we tested streaming very thoroughly, can’t be that”?

otterley4y ago

When you're at Roblox's scale, it is often difficult to know in advance whether you will have a lower MTTR by rolling back or fixing forward. If it takes you longer to resolve a problem by rolling back a significant change than by tweaking a configuration file, then rolling back is not the best action to take.

Also, multiple changes may have confounded the analysis. Adjusting the Consul configuration may have been one of many changes that happened in the recent past, and certainly changes in client load could have been a possible culprit.

3 more replies

notacoward4y ago

In a not-too-distant alternate universe, they made the rookie assumption that every change to every system is trivially reversible, only to find that it's not always true (especially for storage or storage-adjacent systems), and ended up making things worse. Naturally, people in alternate-universe HN bashed them for that too.

2 more replies

hughrr4y ago

As a fairly regular consul cluster admin for the last 6 years or so, but not on that scale i can safely say that you generally have no idea if rolling back will work. I’ve experienced everything up to complete cluster collapses before. I spent an entire night blasting and reseeding a 200 node cluster once after a well tested forward migration went into a leadership battle it never resolved. Even if you test it before that’s no guarantee it’ll be alright on the night.

Quite frankly relying on consul scares the shit out of me. There are so few guarantees and so many pitfalls and traps that I don’t sleep well. At this point I consider it a mortal risk.

That applies to vault as well.

2 more replies

throwdbaaway4y ago

At first I thought it is a well-written post-mortem with proper root cause analysis. After reading it for the second time though, it doesn't sound like the root cause has been identified? At one point, they disabled streaming across the board, and the consul cluster started to become sort of stable. Is streaming to be blamed here? Why would streaming, an enhancement over the existing blocking query, which is read-only, end up causing "elevated write latency"? Why did some voter nodes encounter the boltdb freelist issue, while some other voter nodes didn't?

And there is still no satisfying explanation for this:

> The system had worked well with streaming at this level for a day before the incident started, so it wasn’t initially clear why it’s performance had changed.

But I totally agree with you that the first thing they should look into is to rollback the 2 changes made to the traffic routing service the day before, as soon as they discovered that the consul cluster had become unhealthy.

londons_explore4y ago

"just roll back" gets risky when you roll back more than a few hours in many cases.

Frequently the feature you want to roll back now has other services depending on it, has already written data into the datastore that the old version of the code won't be able to parse, has already been released to customers in a way that will be a big PR disaster if it vanishes, etc.

Many teams only require developers to maintain rollback ability for a single release. Everything beyond that is just luck, and there's a good chance you're going to be manually cherry picking patches and having to understand the effects and side effects of tens of conflicting commits to get something that works.

Twirrim4y ago

The post indicates they'd been rolling it out for months, and indicate the feature went live "several months ago".

With the behaviour matching other types of degradation (hardware), it's entirely reasonable that it could have taken quite a while to recognise that software and configurations that have proven stable for several months, that is still there working, wasn't quite so stable as it seemed.

nightpool4y ago

Right, but it only went live on the DB that failed the day before. Obviously, hindsight is 20/20, but it's strange that the oversight didn't rate a mention in the postmortem.

1 more reply

atmosx4y ago

Some comments:

- The write up is amazing. There is a great level of detail. - When they had the first indication of a problem, instead of looking if the problem was the hardware (disk I/O, etc.) the team went full cattle/cloud: bring down the node, launch a new one. Apparently that cost them a few hours. We would probably have done the same but I wonder if there's a lesson there. - The obvious thing to do was revert configs. It is very strange that it took so long to revert. After being down for hours and having no idea what gives, it's the reasonable thing to try. - The problem was consul. But consul is a key component and Roblox seem to be running a fairly large infrastructure. The company's valuation is sky-high, I assume the infra team is quite large. Consul is an open source project. Wouldn't make sense instead of relying on hashicorp so heavily to bring-in or train ppl around consul internals at this point? (maybe not possible/feasible/optimal, just wondering)

Would be a nice touch to check if bbolt has the bug and possibly push a fix. That said, the post-mortem is state-of-art. Way better than anything we've seen from much much bigger companies.

fullsend4y ago

Honestly I would guess part of it is that streaming is supposed to be a performance increase. So during a performance related outage, it might be easy to overlook. Am I really going to turn off a feature that I think is actually helping the problem?

sidlls4y ago

If that feature was the one most recently deployed or updated? Yes, if possible. That could be a big if, though right? Maybe rolling back such a change isn't trivial, or imposes other costs to returning to service that are more expensive than simply working through the problem.

captaincaveman4y ago

Well its a feature that affects the specific area where your having issues (performance), so yes would be the right thing to start with.

brobinson4y ago

The htop screenshot was an immediate, appropriately-colored red flag for me: that much red (kernel time) on the CPU utilization bars for a system running etcd/consul is not right in my experience.

geoelkh4y ago

The post mortem is really well written but I had the same thoughts. They upgraded the machines hardware before rolling back the latest config updates.

rkuykendall-com4y ago

Hindsight is 20-20

I shouldn't have drank that many

Hindsight is 20-20

Stop.

–

Little elevators are far too small for me

So I ride the big ones

It's not so fun unless you're OCD

And you like buttons

conorh4y ago· 14 in thread

Excellent write up. Reading a thorough, detailed and open postmortem like this makes me respect the company. They may have issues but it sounds like the type of company that (hopefully) does not blame, has open processes, and looks to improve - the type of company I'd want to work for!

ehsankia4y ago

> the type of company I'd want to work for!

I recommend watching the following:

https://www.youtube.com/watch?v=_gXlauRB1EQ

https://www.youtube.com/watch?v=vTMF6xEiAaY

ineedasername4y ago

The first video reveals a more general issue that is not specific to Roblox: child labor in the marketplace of monetized user generated content. There are plenty of under-18 YouTubers. It's not even just online content: these questions came up in the entertainment industry a long time ago, but in that industry at least some safeguards were put in place.

2 more replies

chrislusf4y ago

This youtuber needs to make a living.

Roblox is great that it built up an ecosystem where people can contribute and get rewarded. It is a positive feedback loop.

Not like open source software, where the financial loop is broken. I am pretty sure the Bolt creator did not get anything from HashCorp for his work.

sam0x174y ago

Too bad they exploit young game developers by taking a 75.5% cut of their earnings. Big yikes of a red flag for me. https://www.nme.com/news/gaming-news/roblox-is-allegedly-exp...

DerArzt4y ago

To add, there is a nice documentary here[1] which also has a followup[2] that show even more of the issue at hand. Kids making games and only getting 24.5% of the profit is one thing, but everything else that Roblox does is much worse.

[1] https://youtu.be/_gXlauRB1EQ

[2] https://youtu.be/vTMF6xEiAaY

1 more reply

badcc4y ago

This % includes cost of all game server hosting, databases, memory stores, etc. even with millions of concurrents, app store fees, etc. All included in that number. Developer gets effectively pure profit for the sole cost of programming/designing a great game. Taught me how to program, & changed my entire future. Disclosure: My game is one of most popular on the platform.

2 more replies

tptacek4y ago

Again, as across-thread: this is a tangent unrelated to the actual story, which is interesting for reasons having nothing at all to do with Roblox (I'll never use it, but operating HashiStack at this scale is intensely relevant to me). We need to be careful with tangents like this, because they're easier for people to comment on than the vagaries of Raft and Go synchronization primitives, and --- as you can see here --- they quickly drown out the on-topic comments.

nostrebored4y ago

The idea that these children would otherwise be making their own games is knowingly, generally wrong.

munk-a4y ago

No matter what the cut is I think there are some legitimate social questions to ask about whether want young people to be potentially exposed to economic pressure to earn or whether we'd rather push back aggressively against youth monetization to preserve a childhood where, ideally, children get to play.

I know there are lots of child actors and plenty of household situations that make enjoying childhood difficult for many youths - but just because we're already bad at a thing doesn't mean we should let it get worse. Child labour laws were some of the first steps of regulation in the industrial revolution because inflation works in such a way where opening the door up to child labour can put significant financial pressure on families that choose not to participate when demand adjusts to that participation being normal.

perihelions4y ago

More egregiously, they're (per your article) manipulating kids into buying real ads for their creations, with the false promise that "you could get rich -- if you pay us".

>"As there are no discoverability tools, users are only able to see a tiny selection of the millions of experiences available. One of the ways boost to discoverability is to pay to advertise on the platform using its virtual currency, Robux."

(Note that this "virtual" currency is real money, bidirectionally exchangeable with USD).

The sales pitch is "get rich fast":

>"Under the platform’s ‘Create’ tab, it sells the idea that users can “make anything”, “reach millions of players”, and “earn serious cash”, while its official tutorials and support website both “assume” they are looking for help with monetisation."

I agree that this doesn't really look like a labor issue. That's distracting and contentious tangent; it's easier to just label this as a kind of consumer exploitation. (Most of the people involved aren't earning money -- but they are all paying money). It's a scam either way.

flippinburgers4y ago

I am naive about the reality on the ground when it comes to this issue, but doesn't this hinge on transparency? If they can show they are covering costs + the going market rate, which seems to be 30% (at best), then wouldn't it be reasonable? So is a 45% cut for infra ok or not seems to be the question.

Aunche4y ago

By that logic, Dreams is "exploiting" developers by taking a 100% cut of their earnings. Making money isn't the point of either of these platforms.

breakfastduck4y ago

Or how about giving a free platform to get into games development for young people that otherwise wouldn't have become interested.

loceng4y ago

The solution is creating a competing platform and offering a better cut. You up for the task?

Edit to add: lazy people downvote.

AaronFriel4y ago· 9 in thread

This outage has it all, distributed systems, non-uniform memory access contention (aka "you wanted scale up? how about instead we make your CPU a distributed system that you have to reason about?"), a defect in a log-structured merge tree based data store, malfunctioning heartbeats affecting scheduling, wow wow wow.

Big props to the on-calls during this.

tacLog4y ago

> Big props to the on-calls during this.

Kind of curious about this. I know this is probably company specific but how do outages get handled at large orgs? Would the on-calls have been called in first then called in the rest of the relevant team?

Is their a leadership structure that takes command of the incident to make big coordinated decisions to manage the risk of different approaches?

Would this have represented crunch time to all the relevant people or would this be a core team with other people helping as needed?

yazaddaruvala4y ago

Typically:

Yes. This was a multi-day outage and eventually the oncall does need sleep, so you need more of the team to help with it. Typically, at any reasonable team, everyone that chipped in nights get to take off equivalent days and sprint tasks are all punted.

Yes. Not just to manage risks, but also to get quick prioritization from all teams at the company. "You need legal? Ok, meet ..." "You need string translations? Ok escalated to ..." "You need financial approval? Ok, looped in ..."

Kinda. Definitely would have represented crunch time, but a very very demoralizing crunch time. Managers also try to insulate most of their teams from it, but everyone pays attention anyways. Keep in mind these typically only last an hour or 3, at most they last a few days, so there is no "core team" other than the leadership structure from your question 2. Otherwise, it is very much "people/teams helping as needed".

1 more reply

quirino4y ago

Google has his Site Reliability Engineering book, which might answer some of your questions

https://sre.google/sre-book/table-of-contents/

1 more reply

WaxProlix4y ago

Oncalls get paged first and then escalate. As they assess impact to other teams and orgs, they usually post their tickets to a shared space. Once multiple team/org impact is determined, leadership and relevant ops groups (networking, eg) get pulled in to a call. A single ticket gets designated the Master Ticket for the Event, and oncalls dump diagnostic info there. Root cause is found (hopefully), affected teams work to mitigate while RC team rushes to fix.

The largest of these calls I've seen was well into the hundreds of sw engineers, managers, network engineers, etc.

2 more replies

NordSteve4y ago

For the on call system that I ran until recently, there are about a dozen on call teams responsible for parts of the service. Each team has a primary and backup engineer, generally on a 7x24 shift that lasts a week. Most weeks it's not very busy.

Working with them during an incident is an on call comms lead, who handles outside-of-team comms (protecting the engineers), and an engineering lead (who is a consultant, advisor, and can approve certain actions).

For big incidents, an exec incident manager is involved. They primarily help with getting resources from other teams.

hamburglar4y ago

Where I work there is an incident team that handles things like creating a master ticket, starting a call bridge, getting the on-calls into the bridge, keeping track of what teams (and who from those teams) have been brought in, manages the call (keeping chatter down and focused when there are 100 people in a call is important), periodically comments on the master ticket with status and a list of impacted teams, marks down milestone times like when the impact started, when it was detected, mitigated, root cause found, etc. This person is also responsible for stuff like when they hear you want to engage team X, they'll go track down an on-call for you, or summarizing known impact for the outward-facing status pages, etc. They also create the postmortem template and follow up with all involved teams to get them to contribute their detailed impact statement there.

Edit: sometimes when it's a really gnarly problem and there are huge numbers of people on the call, the set of people who are actively trying to come up with mitigations and need to just be able to talk freely at each other will break off into a less noisy call and leave a representative to relay status to the main call.

sciurus4y ago

Approaches vary company-to-company, but https://response.pagerduty.com/ is a good resource for understanding how it often looks.

thethethethe4y ago

At Google an oncaller typically gets paged, triages the incident and, if it's bad, they page other oncallers and or team members for help. For more serious incidents, people take on different roles like communications lead, incident commander etc.

During the worst outage I was a involved in basically the entire org including all of the most senior engineers worked around the clock for two weeks to fix everything

bezospen154y ago

The on calls ARE the relevant team lol. You're doing it wrong otherwise

1 more reply

jandrese4y ago· 9 in thread

The BoltDB issue seems like straight up bad design. Needing a freelist is fine, needing to sync the entire freelist to disk after every append is pants on head.

benbjohnson4y ago

BoltDB author here. Yes, it is a bad design. The project was never intended to go to production but rather it was a port of LMDB so I could understand the internals. I simplified the freelist handling since it was a toy project. At Shopify, we had some serious issues at the time (~2014) with either LMDB or the Go driver that we couldn't resolve after several months so we swapped out for Bolt. And alas, my poor design stuck around.

LMDB uses a regular bucket for the freelist whereas Bolt simply saved the list as an array. It simplified the logic quite a bit and generally didn't cause a problem for most use cases. It only became an issue when someone wrote a ton of data and then deleted it and never used it again. Roblox reported having 4GB of free pages which translated into a giant array of 4-byte page numbers.

otterley4y ago

I, for one, appreciate you owning this. It takes humility and strength of character to admit one's errors. And Heaven knows we all make them, large and small.

1 more reply

tacLog4y ago

> BoltDB author here.

How does this happen so often? It's awesome to get the authors take on things. Also thank you for explaining and owning it. Where you part of this incident response?

1 more reply

erthink4y ago

These issues partially solved in the libmdbx (a deeply revised and extended descendant of LMDB).

So BoltDB and LMDB affected users may switch to libmdbx as the Erigon (Ethereum implementation) does year ago https://github.com/ledgerwatch/erigon/wiki/Criteria-for-tran...

For now this is (relatively) easy since bindings for GoLang, Rust NodeJS/Deno, etc are available and the API is mostly the same in general.

---

The ideas that MDBX uses to solve these issues are simple: zero-cost micro-defragmentation, coalescing short GC/freelist records, chunking too long GC/freelist records, LIFO for GC/freelist reclaiming, etc.

Many of the ideas mentioned seems simple to implement in BoldDB. However the complete solution is not documented and too complicated (in accordance with the traditions inherited from LMDB ;)

sjg0074y ago

> The project was never intended to go to production

coldcode4y ago

Having written a commercial memory allocator a quarter century ago, I remember dealing with freelists, and decided they were too much of a pain to manage if fragmentation got out of control. I chose a different architecture that was less fragile under load. Interesting that this can still be an issue even on today's hardware.

It's also interesting how much a tiny detail can derail a huge organization. My former employer lost all services worldwide because of a single incorrect routing in a DNS server.

buchanmilne4y ago

> At Shopify, we had some serious issues at the time (~2014) with either LMDB or the Go driver that we couldn't resolve after several months

Is there an issue/bug for this somewhere?

1 more reply

dottedmag4y ago

By any chance could you (or bbolt folks?) update README to include this information?

chrislusf4y ago

Your answer should be voted to the top! :)

OSS contributors are rarely noticed or appreciated. Did HashiCorp ever sponsor you or share any revenue with you? The OSS ecosystem is broken.

1 more reply

Twirrim4y ago· 7 in thread

"We enjoyed seeing some of our most dedicated players figure out our DNS steering scheme and start exchanging this information on Twitter so that they could get “early” access as we brought the service back up."

Why do I have a feeling "enjoyed" wasn't really enjoyed so much as "WTF", followed by "oh shit..." at the thought that their main way to balance load may have gone out the window.

deathanatos4y ago

At their scale, it was probably an insignificant minority. I read that as nothing more than a wink and nod of "we see what you did ;)" ; which I appreciate. Some companies would have a fit and go nuclear on people for that, for no particular reason. As long as it is an insignificant minority, it doesn't matter, and ideally it's teenagers learning how something works on the side, and that helped grow some future hacker (in the HN sense) somewhere.

DaiPlusPlus4y ago

> Some companies would have a fit and go nuclear on people for that, for no particular reason

Sometimes it's even the Missouri state governor doing that too.

Symbiote4y ago

It's difficult to know how quickly word could have spread, but I enjoy knowing a few 11 year olds learned something about the Internet in order to play a game an hour early.

Twirrim4y ago

With social media etc, I can see it spreading really fast. That would be my bigger fear trying to get a service back up from a very long outage like that.

fragmede4y ago

The intentionally slow bringup is to handle the thundering herd of having the system come back online to 100% at once. If a couple hundred users (small percentage of userbase) here or there are able to jump to queue, it's no real big deal.

As far as players figuring out the DNS steering scheme; the company has no responsibility to keep a non-advertised backend up. If it was a problem, disallow new connection to it and remove it from the main pool.

selectodude4y ago

I think it mostly consisted of "KEEP PRESSING REFRESH AND YOU'LL GET LET IN AT SOME POINT" so there wasn't any additional unplanned load for Roblox.

1 more reply

buryat4y ago

enjoyed as in having dedicated fans that would go through hops to have access

kjw4y ago· 5 in thread

I would not have guessed Roblox was on-prem with such little redundancy. Later in the post, they address the obvious “why not public cloud question”? They argue that running their own hardware gives them advantages to cost and performance. But those seem irrelevant if usage and revenue go to zero when you can’t keep a service up. It will be interesting to see how well this architecural decision ages if they keep scaling to their ambitions. I wonder about their ability to recruit the level of talent required to run a service at this scale.

otterley4y ago

Since the issue's root cause was a pathological database software issue, Roblox would have suffered the same issue in the public cloud. (I am assuming for this analysis that their software stack would be identical.) Perhaps they would have been better off with other distributed databases than Consul (e.g., DynamoDB), but at their scale, that's not guaranteed, either. Different choices present different potential difficulties.

Playing "what-if" thought experiments is fun, but when the rubber hits the road, you often find that things that are stable for 99.99%+ of load patterns encounter previously unforeseen problems once you get into that far-right-hand side of the scale. And it's not like we've completely mastered squeezing performance out of huge CPU core counts on NUMA architectures while avoiding bottlenecking on critical sections in software. This shit is hard, man.

baskethead4y ago

This is not true, if they handled the rollout properly. Companies like Uber have two entirely different data centers and during outages they failover you either datacenter.

Everything is duplicated which is potentially wasteful but ensures complete redundancy and it’s an insurance policy. If you rollout, you rollout to each datacenter separately. So in this case rolling out in one complete datacenter and waiting a day for their Consul streaming changes probably would have caught it.

3 more replies

noahtallen4y ago

I think the public cloud is a good choice for startups, teams, and projects which don't have infrastructure experience. Plenty of companies still have their own infrastructure expertise and roll their own CDNs, as an example.

Not only can one save a significant amount of money, it can also be simpler to troubleshoot and resolve issues when you have a simpler backend tech stack. Perhaps that doesn’t apply in this case, but there are plenty of use cases which don’t need a hundred micro services on AWS, none of which anyone fully understands.

nomel4y ago

> But those seem irrelevant if usage and revenue go to zero when you can’t keep a service up

You're assuming the average profits lost are more than the average cost of doing things differently, which, according to their statement, is not the case.

dylan6044y ago

>I wonder about their ability to recruit the level of talent required to run a service at this scale.

According to this user's comments, it doesn't look like it'll be that tough for them:

https://news.ycombinator.com/item?id=30014748

NightMKoder4y ago· 4 in thread

Admittedly this is armchair architecture talk, but it seems like either consul or Roblox's use of Consul is falling into a CAP-trap: they are using a CP system when what they need is an eventually-consistent AP system. Granted, the use of consul seems heterogenous, but it seems like the main root cause was service discovery. And service discovery loves stale data.

Service discovery largely doesn't change that often. Especially in an outage where a lot of things that churn service discovery are disabled (e.g. deploys), returning stale responses should work fine. There's a reason DNS works this way - it prioritizes having any response, even if stale, since most DNS entries don't change that frequently. That said, DNS is not a great service discovery mechanism for other reasons. Not sure if there's an off-the-shelf solution that relies more on fast invalidation rather than distributed consistent stores.

throwdbaaway4y ago

Good catch. If Roblox only uses consul for service discovery, things should continue to work, just slowly degrade over the hours/days. There should at least be one consul agent running on each physical hosts, and this consul agent has cache and can continue to provide service discovery functionality with stale data.

Dissecting this paragraph from the post-mortem...

> When a Roblox service wants to talk to another service, it relies on Consul to have up-to-date knowledge of the location of the service it wants to talk to.

OK.

> However, if Consul is unhealthy, servers struggle to connect.

Why? The local "client-side" consul agents running on each hosts should be the authoritative source for service discovery, not the "server-side" consul agents running on the 5 voter nodes.

> Furthermore, Nomad and Vault rely on Consul, so when Consul is unhealthy, the system cannot schedule new containers or retrieve production secrets used for authentication.

Now that's one very bad setup, similar to deploying all services in a single k8s cluster.

NightMKoder4y ago

Didn’t realize consul had that. Seems like the right approach - though I wonder why Roblox wasn’t using it.

Fwiw I believe kubernetes did this right - if you shoot the entire set of leaders, nothing really happens. Yes if containers die they aren’t restarted and things that create new pods (eg cron jobs) won’t run, but you don’t immediately lose cluster connectivity or the (built-in) service discovery. Not to say you can survive az failures or the like - or that kubernetes upgrades are easy/fun.

And don’t run dev stuff in your prod kube cluster. Just…don’t.

tptacek4y ago

Can you say more about service discovery "loving stale data"? Loves in the sense of "generates a lot of it; is constantly plagued by it"?

boulos4y ago

Their comment implies "are totally fine with stale data".

Their argument is that the membership set for a service (especially on-prem) doesn't change all that frequently, and even if it's out of date, it's likely that most of the endpoints are still actually servicing the thing you were looking for. That plus client retries and you're often pretty good.

1 more reply

stuff4ben4y ago· 4 in thread

Sounds like they need to switch to Kubernetes?

I kid of course. One of the best post-mortems I've seen. I'm sure there are K8s horror stories out there of etcd giving up the ghost in a similar fashion.

schoolornot4y ago

The one thing you can say about Nomad is that's generally incredibly scalable compared to Kubernetes. At 1000+ nodes over multiple datacenters, things in Kube seem to break down.

tapoxi4y ago

Do they still? GKE supports 15,000 nodes per cluster.

spydum4y ago

you joke, but it's precisely this:

>Critical monitoring systems that would have provided better visibility into the cause of the outage relied on affected systems, such as Consul. This combination severely hampered the triage process.

which gives me goosebumps whenever I hear people proselytizing everything run on Kubernetes. At some point, it makes good sense to keep capabilities isolated from each other, especially when those functions are key to keeping the lights on. Mapping out system dependencies (either systems, software components, etc) is really the soft underbelly of most tech stacks.

YATA04y ago

>Sounds like they need to switch to Kubernetes?

Hah! Good one!

ctvo4y ago· 3 in thread

It's a spicy read. Really could have happened to anyone. All very reasonable assumptions and steps taken. You could argue they could have more thoroughly load tested Consul, but doubtful any of us would have done more due diligence than they did with the slow rollout of streaming support.

(Ignoring the points around observability dependencies on the system that went down causing the failure to be extended)

yashap4y ago

The main mistake IMO is that, the day before the outage, they made a significant Consul-related infra change. Then they have this massive outage, where Consul is clearly the root cause, but nobody ever tries rolling that recent change back? That’s weird.

I went into more detail here: https://news.ycombinator.com/item?id=30015826

The outage occurring could certainly happen to anyone, but it taking 72 hours to resolve seems like a pretty fundamental SRE mistake. It’s also strange that “try rollbacks of changes related to the affected system” isn’t even acknowledged as a learning in their learnings/action items section.

faitswulff4y ago

It's possible they deal with so much load that they considered a day's worth of traffic to be sufficient load testing:

> The system had worked well with streaming at this level for a day before the incident started, so it wasn’t initially clear why it’s performance had changed.

And a short note later on how much load their caching system sees:

> These databases were unaffected by the outage, but the caching system, which regularly handles 1B requests-per-second across its multiple layers during regular system operation, was unhealthy.

tptacek4y ago

That doesn't sound accurate. Wasn't the major change they ended up rolling back Consul streaming, which they'd enabled months before, and had been slowly rolling out?

2 more replies

wizwit9994y ago· 3 in thread

> On October 27th at 14:00, one day before the outage, we enabled this feature on a backend service that is responsible for traffic routing. As part of this rollout, in order to prepare for the increased traffic we typically see at the end of the year, we also increased the number of nodes supporting traffic routing by 50%.

Seems like the smoking gun, this should have been identified and rolled back much earlier.

Karrot_Kream4y ago

If reading a postmortem makes the smoking gun obvious, then the postmortem is doing its job. Don't mistake the amount of investigation that goes into a postmortem for the available information and mental headspace during an outage.

wizwit9994y ago

I've been in my fair share of incidents so I'm aware of how they work. But they knew it was an issue related to Consul within hours. It shouldn't take more than two days before they check for recent deployments made to Consul.

1 more reply

yuliyp4y ago

It's obvious when it's pointed out in an article like this. It's less clear when it's one of many changes that could have been happening in a day, and it was an operation that was considered "safe" given that it had been done multiple times for other services in the preceding months.

willcipriano4y ago· 3 in thread

I have this little idea I think about called the "status update chain". When I worked in small organizations and we had issues the status update chain looked like this: ceo-->me, as the organizations got larger the chain got longer first it was ceo-->manager-->me then ceo-->director-->manager-->me and so on. I wonder how long the status update chains are at companies like this? How long does at status update take to make it end to end?

Rygian4y ago

If the situation is serious enough, you'll have several layers sitting together at the status update meetings to hear it straight from the dog's mouth.

tacLog4y ago

I am sorry, I didn't have enough context to understand what your saying.

When you say: status update chain: ceo --> me. What information is flowing from the CEO to you? or is it the other way around?

willcipriano4y ago

Both directions, he is asking "What is going on" and I am telling him. As the org gets larger the request to know what is going on passes down the chain and the reply passes back up.

3 more replies

johnmarcus4y ago· 2 in thread

aaaalllllllll the way down at the bottom is this gem: >Some core Roblox services are using Consul’s KV store directly as a convenient place to store data, even though we have other storage systems that are likely more appropriate.

Yeah, don't use consul as redis, they are not the same.

stuff4ben4y ago

But you can... which is what some engineers were thinking. In my experience they do this because:

A) they're afraid to ask for permission and would rather ask for forgiveness

B) management refused to provision extra infra to support the engineers need, but they needed to do this "one thing" anyways

C) security was lax and permissions were wide open so people just decided to take advantage of it to test a thing that then became a feature and so they kept it but "put it on the backlog" to refactor to something better later

aprdm4y ago

Yes, this and having such a big consul cluster where the recommendation is to have more smaller clusters.

That said, could've happened to anyone and it was a great write up.

ryanworl4y ago· 2 in thread

It seems that Consul does not have the ability to use the newer hashmap implementation of freelist that Alibaba implemented for etcd. I cannot find any reference to setting this option in Consul's configuration.

Unfortunate, given it has been around for a while.

https://www.alibabacloud.com/blog/594750

throwdbaaway4y ago

I think they just made the switch to the fork that does contain the freelist improvement in https://github.com/hashicorp/consul/pull/11720

Took a major incident to swallow your pride? (consul, powered by go.etcd.io/bbolt)

ryanworl4y ago

Is this option enabled by default? I don't this it is and I don't think they actually set it manually anywhere.

EDIT: I think we're talking about two different options. I meant the ability to leave sync turned on but change the data structure.

1 more reply

kalev4y ago· 2 in thread

Slightly offtopic; “the team decided to replace all the nodes in the Consul cluster with new, more powerful machines”. How do teams usually do this quickly? Is it a combination of Terraform to provision servers and something like Ansible to install and configure software on it?

stingraycharles4y ago

Totally depends on how “disciplined” the team’s DevOps practices are. In theory it should be as easy as updating a config parameter as you say, but my experience tells me that it’s sometimes not the case.

Especially with these kind of fundamental, core services such as Consul provides, it’s not unheard of to have templates with static machine allocations (as opposed to everything in a single auto-scaling group). It’s a bit of a shortcut, but it’s often a bit hairy to implement these services using true auto-scaling.

Having said all this, doing these types of migrations when things are already completely broken / on fire makes things a lot easier: you don’t care about downtime. So then it can be as simple as restarting all instances using a new instance type, downtime be damned.

oars4y ago

If I wasn't using AWS I would have no idea how to do this.

Quantumhunk4y ago· 2 in thread

What I learn from this is issue is partly because of not proper use "Go Channels" and open source product "BoltDB"

ekimekim4y ago

IMO looking at the root causes here isn't that helpful. Software is complicated and there will always be some unknown bottleneck or bug lurking to knock you over on a bad day. The important lessons here are about:

* How their system architecture made them particularly vulnerable to this kind of issue

* Their actions to diagnose and attempt to mitigate the issue

* The whole later part about effectively cold-starting their entire infrastructure, all while millions of users were banging on their metaphorical door to start using the service again.

CyanLite24y ago

That and going all-in on Hashicorp.

londons_explore4y ago· 2 in thread

I think this outage was made worse by them not being properly in a big cloud provider.

In a cloud provider, having a few people working simultaneously on spinning up instances with different potential fixes, running different tests, and then directing all traffic to the first one that works properly is a viable path to a solution.

When you have your own hardware, you can really only try one thing at a time.

KronisLV4y ago

> When you have your own hardware, you can really only try one thing at a time.

How so? What would prevent you from hiring 5-10 people for Ops heavy stuff and getting a bit more hardware resources and doing those things in staging environments with load tests and whatnot? I mean, isn't that how you should do things, regardless of where your infra and software is?

londons_explore4y ago

If you own your own hardware, for a given service you probably have enough hardware for the production workload, plus maybe 50% more for dev, test, staging, experiments, etc. All those other environments will probably be scaled down versions. Sure, they can be used in an emergency situation, but they can't withstand the full production load, and anyway they're likely on a separate physical hardware and network (usually you want good isolation between production and test environments).

If you use AWS, then you probably on average use the same day to day, but in an emergency you can spin up 5 versions of full production scale to test 5 things at once, and just edit a configfile to direct production traffic to any.

k8sToGo4y ago· 2 in thread

Does anyone know what tool this one is?

https://blog.roblox.com/wp-content/uploads/2021/11/3-perf-re...

Is it really perf?

ketanhwr4y ago

It's perf-report[0] which reads the output of a perf data file and displays the profile.

[0]: https://man7.org/linux/man-pages/man1/perf-report.1.html

doublerabbit4y ago

> /wp-content/uploads/2021/11/3-perf-report.png

It's perf.

867-53094y ago· 2 in thread

>for our most performance and latency critical workloads, we have made the choice to build and manage our own infrastructure on-prem

I don't understand this logic. are they basically saying that their servers are on average closer to the user than mainstream cloud infra? are they e.g. choosing to have N satellite servers around a city instead of N instances at one cloud provider location in the centre of the city? is it the sparseness of the servers that decreases the latency?

or is it more to do with avoiding the herd, i.e. less trafficky routes / beating the queues?

it's also unclear whether they use their own hardware on rented rackspace as that could potentially lower costs too

mike_d4y ago

Cloud providers are rarely in cities. Google's biggest region is in the middle of Iowa, Amazon's is in Virginia.

If you have a latency sensitive application (like multiplayer games) it makes sense to put a few servers in each of 100 locations rather concentrate them in a half dozen cloud regions.

As they point out elsewhere, the cost of infrastructure directly impacts their ability to pay creators on the platform. Doing it yourself will always be cheaper, and they hired the smart people to make it happen.

InsomniacL4y ago

> it makes sense to put a few servers in each of 100 locations rather concentrate them in a half dozen cloud regions.

Large cloud providers have a backbone network with interconnects to many ISPs reducing the amount of Hops a client has to take across the internet.

> Doing it yourself will always be cheaper

Treating the Cloud as a traditional IAAS Datacenter extension will be more expensive. By utilising PAAS, only using resource that's needed and when it's needed, etc.. is much cheaper.

statguy4y ago· 2 in thread

So the outage lasted 3 days and the postmortem took 3 months!

koshergweilo4y ago

Read the article " It has been 2.5 months since the outage. What have we been up to? We used this time to learn as much as we could from the outage, to adjust engineering priorities based on what we learned, and to aggressively harden our systems. One of our Roblox values is Respect The Community, and while we could have issued a post sooner to explain what happened, we felt we owed it to you, our community, to make significant progress on improving the reliability of our systems before publishing."

They wanted to make sure everything was fixed before publishing

Operyl4y ago

They just got out of their busiest time of year, and taking the time to write an accurate post mortem with data gleamed afterwards seems sensible to me.

chainwax4y ago· 1 in thread

Love the "Note on Public Cloud", and their stance on owning and operating their own hardware in general. I know there has to be people thinking this could all be avoided/the blame could be passed if they used a public cloud solution. Directly addressing that and doubling down on your philosophies is a badass move, especially after a situation like this.

Neil444y ago

It's interesting, I don't see that being on cloud would have avoided or helped this situation much. They were able to ramp up their hardware very quickly - who knows where they got it that fast - and it actually made the problem worse, so being on cloud and having the ability to do that with keystrokes would not have helped. You could say they might be using a different set of components if they were on cloud which may not have suffered the same issues, but you can play the what if game all day it's not related to pros/cons of public cloud.

ineedasername4y ago· 1 in thread

">circular dependencies in our observability stack"

This appears to be why the outage was extended, and was referenced elsewhere too. It's hard to diagnose something when part of the diagnostic tool kit is also malfunctioning.

phgn4y ago

Like the Facebook outage a few months ago, when their DNS being down prevented them from communicating interally.

fifticon4y ago· 1 in thread

For me as a roblox user/programmer, the most annoying part of this was, that their desktop development tools refused to run during this outage, because they insist on "phoning home" when you launch them.

It is annoying, because the tools actually run perfectly fine on a local desktop, once you are past the "mothership handshake". I spent that week reading roblox dev documentation instead.

southerntofu4y ago

Wow, that's a very bad trend we see emerging in those past years. Did you have any chance to investigate the request at play and whether you could impersonate it via DNS on your local network (or if on the opposite, TLS certificates were stapled into the app)?

Also, i'm curious about your experiences with Roblox. I've only heard about it from these HN threads (no i don't know a single person using it) so if you have feedback to share regarding how to program it and how it compares to a modern game engine/editor like Godot, i'm all ears. Also, if you know of a free-software "alternative" to Roblox ; i'm amazed we run proprietary software in the first place, worried when it doesn't run because it can't phone home, but i'm actually ashamed we end up *producing* (eg. developing) content with proprietary tools that these companies can take away from us any minute.

snwfog4y ago· 1 in thread

Is there any tutorial on how go get a pref report like the one show in this screenshot? https://blog.roblox.com/wp-content/uploads/2021/11/3-perf-re...

TheDong4y ago

Yes. It's the default output for "perf report". I recommend reading this: https://www.brendangregg.com/perf.html

However, the short 2-line way to get that output is the following:

    perf record -F99 -g --pid $(pidof consul)
    # Wait a few seconds and hit ctrl-c
    perf report

You'll get similar output to what they show if you have consul running with a similar load :)

captaincaveman4y ago· 1 in thread

Sounds like they didn't check what had changed first, before starting to fix things with best guesses ... not saying I wouldn't do the same, but arguably lost them a lot of time.

k8sToGo4y ago

They were aware of the changes, but as they stated: it seemed to be working fine and, therefore, was ruled out early on as a potential problem.

sjtindell4y ago

Super interesting. A place where ipvs or ebpf rules per-host for the discovery of services seems much more resilient than this heavy reliance on a functional consul service. The team shared a great postmortem here. I know the feeling well of testing something like a full redeploy and seeing no improvement…easy to lose hope at that point. 70+ hours of a full outage, multiple failed attempts to restore, has got to result in a few grey hairs worth of stress. Well done to all the sre, frontline, support engineers, devs, and whoever else rolled up their sleeves and got after it. The lessons learned here could only have been learned in an infra this big.

zomglings4y ago

This is a great post-mortem - thank you to the Roblox engineering team for being this transparent about the issue and the process you took to fix it. It couldn't have been easy and it sounds like it was a beast to track down (under pressure no less). gg

alpb4y ago

I still don't understand how the elevated Consul latency ended up bringing all of the fleet to halt, fail health checks and drop user traffic. I guess use cases calling consul directly (e.g. service discovery) or indirectly (e.g. vault) could not tolerate having stale reads or sticking with what they've read? If anyone can shed a light on this, I appreciate.

elfchief4y ago

One thing I don't see mentioned -- why is the write load so high? Can anyone from Roblox say? (I have a specific reason for asking.)

fasteo4y ago

>>> The scale of our deployment is significant, with over 18,000 servers and 170,000 containers.

That's impressive.

jeffrallen4y ago

Tldr: We made a single point of failure, then we made it super reliable, then the stuff it was doing to maintain itself made it slow itself down, then our single point of failure took down our service.

Would be interesting to compare this result to the classic paper on Tandem failures:

A. Thakur, R. K. Iyer, L. Young and I. Lee, "Analysis of failures in the Tandem NonStop-UX Operating System," Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE'95, 1995, pp. 40-50, doi: 10.1109/ISSRE.1995.497642.

LaserToy4y ago

Just curious, does Roblox push engineers to learn internals of critical software they operate or they lean on vendors.

If vendors, it is reckless.

nanis4y ago

> 50th percentile

I would normally not call this out, but it is repeated so often in the text that it is jarring. Just call it "median" as it is everywhere else, please.

On the other hand, I must commend the author(s) for not using "based off of" :-)

Great write-up, otherwise.

qaq4y ago

Love NATS for not having to deal with service discovery at all.

tlynchpin4y ago

warning, completely pedantic pet peeve.

> Note all dates and time in this blog post are in Pacific Standard Time (PST).

But the incident was during PDT. Just use UTC or colloquial "Pacific time" or equiv and never be wrong!

My heart goes out to these people. I can imagine how much sustained terror they were feeling, stare hard and harder at your terminals and still nothing makes sense.

j / k navigate · click thread line to collapse

310 comments

151 comments · 35 top-level

erwincoumans4y ago· 19 in thread

>> We are working to move to multiple availability zones and data centers.

Surprised it was a single availability zone, without redundancy. Having multiple fully independent zones seems more reliable and failsafe.

abarringer4y ago

Astonishing how our important infrastructure is moved to AWS with zero knowledge of how AWS works.

foobarian4y ago

> Surprised it was a single availability zone, without redundancy. Having multiple fully independent zones seems more reliable and failsafe.

It's also a lot more expensive. Probably order of magnitude more expensive than the cost of a 1 day outage

sam0x174y ago

In theory the only additional cost should be the latency-based routing itself, which is $50/month. Other than that, you'll probably save money if you choose the right regions.

6 more replies

outworlder4y ago

> It's also a lot more expensive. Probably order of magnitude more expensive than the cost of a 1 day outage

1 more reply

bradly4y ago

Hamuko4y ago

How expensive? Remember that the Roblox Corporation does about a billion dollars in revenue per year and takes about 50% of all revenue developers generate on their platform.

1 more reply

johnmarcus4y ago

Multi-AZ is free at Amazon. Having things split amongst 3 AZ's cost no more than having in a single AZ.

Multi-Region is a different story.

2 more replies

kreeben4y ago

>> Having multiple fully independent zones seems more reliable

I don't think these independent zones exist. See AWS's recent outages, where east cripples west and vice versa.

Karrot_Kream4y ago

(N.B. I find HN commentary on AWS outages pretty depressing because it becomes pretty obvious that folks don't understand cloud networking concepts at all.)

[1]: https://aws.amazon.com/compute/sla/

3 more replies

count4y ago

That's not how they work. They exist, and work extremely well within their defined engineering / design goals. It's much more nuanced than 'everything works independently'.

1 more reply

Bluecobra4y ago

> I don't think these independent zones exist.

1 more reply

mbesto4y ago

There have been multiple discussions on HN about cloud vs not cloud and there are endless amount of opinions of "cloud is a waste blah blah".

This is exactly one of the reasons people go cloud. Introducing an additional AZ is a click of a button and some relatively trivial infrastructure as code scripting, even at this scale.

Running your own data center and AZ on the other hand requires a very tight relationship with your data center provider at global scale.

treis4y ago

But it wasn't a hardware issue. It was a software one and that would have crossed AZ boundaries.

1 more reply

vorpalhex4y ago

I'm more impressed that it hasn't been an issue until now.

bob10294y ago

> Having multiple fully independent zones seems more reliable and failsafe.

This also introduces new modes of failure which did not exist before. There are no silver bullets for this problem.

rhizome4y ago

There are no silver bullets to any problem, but there are other ways of implementing services and architecture that can sidestep these things.

maxclark4y ago

No surprised at all. Multi AZ is a PITA. You'd be surprised how many 7fig+/month infra is single region/az

mhitza4y ago

For example parts of AWS itself. us-east-1 having issues? Looks like aws console all over the world have issues.

You constantly hear about multi zone, region, cloud. But in practice when things break you hear all these stories of them running in a single region+zone

hedwall4y ago

A guess would be that game servers are distributed across the globe but backend services l are in one place. A common pattern in game companies.

regnull4y ago· 15 in thread

It's weird it took them so long to disable streaming. One of the first things you do in this case is roll back the last software and config updates, even innocent looking ones.

yashap4y ago

That’s what stood out to me too. Although they’d been slowly rolling it out for awhile, their last major rollout was quite close to the outage start:

I wonder if this actually hindered their response:

i.e. earlier on, were there HashiCorp peeps saying “naw, we tested streaming very thoroughly, can’t be that”?

otterley4y ago

3 more replies

notacoward4y ago

2 more replies

hughrr4y ago

Quite frankly relying on consul scares the shit out of me. There are so few guarantees and so many pitfalls and traps that I don’t sleep well. At this point I consider it a mortal risk.

That applies to vault as well.

2 more replies

throwdbaaway4y ago

And there is still no satisfying explanation for this:

> The system had worked well with streaming at this level for a day before the incident started, so it wasn’t initially clear why it’s performance had changed.

londons_explore4y ago

"just roll back" gets risky when you roll back more than a few hours in many cases.

Twirrim4y ago

The post indicates they'd been rolling it out for months, and indicate the feature went live "several months ago".

nightpool4y ago

Right, but it only went live on the DB that failed the day before. Obviously, hindsight is 20/20, but it's strange that the oversight didn't rate a mention in the postmortem.

1 more reply

atmosx4y ago

Some comments:

Would be a nice touch to check if bbolt has the bug and possibly push a fix. That said, the post-mortem is state-of-art. Way better than anything we've seen from much much bigger companies.

fullsend4y ago

sidlls4y ago

captaincaveman4y ago

Well its a feature that affects the specific area where your having issues (performance), so yes would be the right thing to start with.

brobinson4y ago

The htop screenshot was an immediate, appropriately-colored red flag for me: that much red (kernel time) on the CPU utilization bars for a system running etcd/consul is not right in my experience.

geoelkh4y ago

The post mortem is really well written but I had the same thoughts. They upgraded the machines hardware before rolling back the latest config updates.

rkuykendall-com4y ago

Hindsight is 20-20

I shouldn't have drank that many

Hindsight is 20-20

Stop.

–

Little elevators are far too small for me

So I ride the big ones

It's not so fun unless you're OCD

And you like buttons

conorh4y ago· 14 in thread

ehsankia4y ago

> the type of company I'd want to work for!

I recommend watching the following:

https://www.youtube.com/watch?v=_gXlauRB1EQ

https://www.youtube.com/watch?v=vTMF6xEiAaY

ineedasername4y ago

2 more replies

chrislusf4y ago

This youtuber needs to make a living.

Roblox is great that it built up an ecosystem where people can contribute and get rewarded. It is a positive feedback loop.

Not like open source software, where the financial loop is broken. I am pretty sure the Bolt creator did not get anything from HashCorp for his work.

sam0x174y ago

Too bad they exploit young game developers by taking a 75.5% cut of their earnings. Big yikes of a red flag for me. https://www.nme.com/news/gaming-news/roblox-is-allegedly-exp...

DerArzt4y ago

[1] https://youtu.be/_gXlauRB1EQ

[2] https://youtu.be/vTMF6xEiAaY

1 more reply

badcc4y ago

2 more replies

tptacek4y ago

nostrebored4y ago

The idea that these children would otherwise be making their own games is knowingly, generally wrong.

munk-a4y ago

perihelions4y ago

More egregiously, they're (per your article) manipulating kids into buying real ads for their creations, with the false promise that "you could get rich -- if you pay us".

(Note that this "virtual" currency is real money, bidirectionally exchangeable with USD).

The sales pitch is "get rich fast":

flippinburgers4y ago

Aunche4y ago

By that logic, Dreams is "exploiting" developers by taking a 100% cut of their earnings. Making money isn't the point of either of these platforms.

breakfastduck4y ago

Or how about giving a free platform to get into games development for young people that otherwise wouldn't have become interested.

loceng4y ago

The solution is creating a competing platform and offering a better cut. You up for the task?

Edit to add: lazy people downvote.

AaronFriel4y ago· 9 in thread

Big props to the on-calls during this.

tacLog4y ago

> Big props to the on-calls during this.

Is their a leadership structure that takes command of the incident to make big coordinated decisions to manage the risk of different approaches?

Would this have represented crunch time to all the relevant people or would this be a core team with other people helping as needed?

yazaddaruvala4y ago

Typically:

1 more reply

quirino4y ago

Google has his Site Reliability Engineering book, which might answer some of your questions

https://sre.google/sre-book/table-of-contents/

1 more reply

WaxProlix4y ago

The largest of these calls I've seen was well into the hundreds of sw engineers, managers, network engineers, etc.

2 more replies

NordSteve4y ago

For big incidents, an exec incident manager is involved. They primarily help with getting resources from other teams.

hamburglar4y ago

sciurus4y ago

Approaches vary company-to-company, but https://response.pagerduty.com/ is a good resource for understanding how it often looks.

thethethethe4y ago

During the worst outage I was a involved in basically the entire org including all of the most senior engineers worked around the clock for two weeks to fix everything

bezospen154y ago

The on calls ARE the relevant team lol. You're doing it wrong otherwise

1 more reply

jandrese4y ago· 9 in thread

The BoltDB issue seems like straight up bad design. Needing a freelist is fine, needing to sync the entire freelist to disk after every append is pants on head.

benbjohnson4y ago

otterley4y ago

I, for one, appreciate you owning this. It takes humility and strength of character to admit one's errors. And Heaven knows we all make them, large and small.

1 more reply

tacLog4y ago

> BoltDB author here.

How does this happen so often? It's awesome to get the authors take on things. Also thank you for explaining and owning it. Where you part of this incident response?

1 more reply

erthink4y ago

These issues partially solved in the libmdbx (a deeply revised and extended descendant of LMDB).

So BoltDB and LMDB affected users may switch to libmdbx as the Erigon (Ethereum implementation) does year ago https://github.com/ledgerwatch/erigon/wiki/Criteria-for-tran...

For now this is (relatively) easy since bindings for GoLang, Rust NodeJS/Deno, etc are available and the API is mostly the same in general.

---

Many of the ideas mentioned seems simple to implement in BoldDB. However the complete solution is not documented and too complicated (in accordance with the traditions inherited from LMDB ;)

sjg0074y ago

> The project was never intended to go to production

coldcode4y ago

It's also interesting how much a tiny detail can derail a huge organization. My former employer lost all services worldwide because of a single incorrect routing in a DNS server.

buchanmilne4y ago

> At Shopify, we had some serious issues at the time (~2014) with either LMDB or the Go driver that we couldn't resolve after several months

Is there an issue/bug for this somewhere?

1 more reply

dottedmag4y ago

By any chance could you (or bbolt folks?) update README to include this information?

chrislusf4y ago

Your answer should be voted to the top! :)

OSS contributors are rarely noticed or appreciated. Did HashiCorp ever sponsor you or share any revenue with you? The OSS ecosystem is broken.

1 more reply

Twirrim4y ago· 7 in thread

Why do I have a feeling "enjoyed" wasn't really enjoyed so much as "WTF", followed by "oh shit..." at the thought that their main way to balance load may have gone out the window.

deathanatos4y ago

DaiPlusPlus4y ago

> Some companies would have a fit and go nuclear on people for that, for no particular reason

Sometimes it's even the Missouri state governor doing that too.

Symbiote4y ago

It's difficult to know how quickly word could have spread, but I enjoy knowing a few 11 year olds learned something about the Internet in order to play a game an hour early.

Twirrim4y ago

With social media etc, I can see it spreading really fast. That would be my bigger fear trying to get a service back up from a very long outage like that.

fragmede4y ago

selectodude4y ago

I think it mostly consisted of "KEEP PRESSING REFRESH AND YOU'LL GET LET IN AT SOME POINT" so there wasn't any additional unplanned load for Roblox.

1 more reply

buryat4y ago

enjoyed as in having dedicated fans that would go through hops to have access

kjw4y ago· 5 in thread

otterley4y ago

baskethead4y ago

This is not true, if they handled the rollout properly. Companies like Uber have two entirely different data centers and during outages they failover you either datacenter.

3 more replies

noahtallen4y ago

nomel4y ago

> But those seem irrelevant if usage and revenue go to zero when you can’t keep a service up

You're assuming the average profits lost are more than the average cost of doing things differently, which, according to their statement, is not the case.

dylan6044y ago

>I wonder about their ability to recruit the level of talent required to run a service at this scale.

According to this user's comments, it doesn't look like it'll be that tough for them:

https://news.ycombinator.com/item?id=30014748

NightMKoder4y ago· 4 in thread

throwdbaaway4y ago

Dissecting this paragraph from the post-mortem...

> When a Roblox service wants to talk to another service, it relies on Consul to have up-to-date knowledge of the location of the service it wants to talk to.

OK.

> However, if Consul is unhealthy, servers struggle to connect.

Why? The local "client-side" consul agents running on each hosts should be the authoritative source for service discovery, not the "server-side" consul agents running on the 5 voter nodes.

> Furthermore, Nomad and Vault rely on Consul, so when Consul is unhealthy, the system cannot schedule new containers or retrieve production secrets used for authentication.

Now that's one very bad setup, similar to deploying all services in a single k8s cluster.

NightMKoder4y ago

Didn’t realize consul had that. Seems like the right approach - though I wonder why Roblox wasn’t using it.

And don’t run dev stuff in your prod kube cluster. Just…don’t.

tptacek4y ago

Can you say more about service discovery "loving stale data"? Loves in the sense of "generates a lot of it; is constantly plagued by it"?

boulos4y ago

Their comment implies "are totally fine with stale data".

1 more reply

stuff4ben4y ago· 4 in thread

Sounds like they need to switch to Kubernetes?

I kid of course. One of the best post-mortems I've seen. I'm sure there are K8s horror stories out there of etcd giving up the ghost in a similar fashion.

schoolornot4y ago

The one thing you can say about Nomad is that's generally incredibly scalable compared to Kubernetes. At 1000+ nodes over multiple datacenters, things in Kube seem to break down.

tapoxi4y ago

Do they still? GKE supports 15,000 nodes per cluster.

spydum4y ago

you joke, but it's precisely this:

>Critical monitoring systems that would have provided better visibility into the cause of the outage relied on affected systems, such as Consul. This combination severely hampered the triage process.

YATA04y ago

>Sounds like they need to switch to Kubernetes?

Hah! Good one!

ctvo4y ago· 3 in thread

(Ignoring the points around observability dependencies on the system that went down causing the failure to be extended)

yashap4y ago

I went into more detail here: https://news.ycombinator.com/item?id=30015826

faitswulff4y ago

It's possible they deal with so much load that they considered a day's worth of traffic to be sufficient load testing:

> The system had worked well with streaming at this level for a day before the incident started, so it wasn’t initially clear why it’s performance had changed.

And a short note later on how much load their caching system sees:

> These databases were unaffected by the outage, but the caching system, which regularly handles 1B requests-per-second across its multiple layers during regular system operation, was unhealthy.

tptacek4y ago

That doesn't sound accurate. Wasn't the major change they ended up rolling back Consul streaming, which they'd enabled months before, and had been slowly rolling out?

2 more replies

wizwit9994y ago· 3 in thread

Seems like the smoking gun, this should have been identified and rolled back much earlier.

Karrot_Kream4y ago

wizwit9994y ago

1 more reply

yuliyp4y ago

willcipriano4y ago· 3 in thread

Rygian4y ago

If the situation is serious enough, you'll have several layers sitting together at the status update meetings to hear it straight from the dog's mouth.

tacLog4y ago

I am sorry, I didn't have enough context to understand what your saying.

When you say: status update chain: ceo --> me. What information is flowing from the CEO to you? or is it the other way around?

willcipriano4y ago

Both directions, he is asking "What is going on" and I am telling him. As the org gets larger the request to know what is going on passes down the chain and the reply passes back up.

3 more replies

johnmarcus4y ago· 2 in thread

Yeah, don't use consul as redis, they are not the same.

stuff4ben4y ago

But you can... which is what some engineers were thinking. In my experience they do this because:

A) they're afraid to ask for permission and would rather ask for forgiveness

B) management refused to provision extra infra to support the engineers need, but they needed to do this "one thing" anyways

aprdm4y ago

Yes, this and having such a big consul cluster where the recommendation is to have more smaller clusters.

That said, could've happened to anyone and it was a great write up.

ryanworl4y ago· 2 in thread

Unfortunate, given it has been around for a while.

https://www.alibabacloud.com/blog/594750

throwdbaaway4y ago

I think they just made the switch to the fork that does contain the freelist improvement in https://github.com/hashicorp/consul/pull/11720

Took a major incident to swallow your pride? (consul, powered by go.etcd.io/bbolt)

ryanworl4y ago

Is this option enabled by default? I don't this it is and I don't think they actually set it manually anywhere.

EDIT: I think we're talking about two different options. I meant the ability to leave sync turned on but change the data structure.

1 more reply

kalev4y ago· 2 in thread

stingraycharles4y ago

oars4y ago

If I wasn't using AWS I would have no idea how to do this.

Quantumhunk4y ago· 2 in thread

What I learn from this is issue is partly because of not proper use "Go Channels" and open source product "BoltDB"

ekimekim4y ago

* How their system architecture made them particularly vulnerable to this kind of issue

* Their actions to diagnose and attempt to mitigate the issue

* The whole later part about effectively cold-starting their entire infrastructure, all while millions of users were banging on their metaphorical door to start using the service again.

CyanLite24y ago

That and going all-in on Hashicorp.

londons_explore4y ago· 2 in thread

I think this outage was made worse by them not being properly in a big cloud provider.

When you have your own hardware, you can really only try one thing at a time.

KronisLV4y ago

> When you have your own hardware, you can really only try one thing at a time.

londons_explore4y ago

k8sToGo4y ago· 2 in thread

Does anyone know what tool this one is?

https://blog.roblox.com/wp-content/uploads/2021/11/3-perf-re...

Is it really perf?

ketanhwr4y ago

It's perf-report[0] which reads the output of a perf data file and displays the profile.

[0]: https://man7.org/linux/man-pages/man1/perf-report.1.html

doublerabbit4y ago

> /wp-content/uploads/2021/11/3-perf-report.png

It's perf.

867-53094y ago· 2 in thread

>for our most performance and latency critical workloads, we have made the choice to build and manage our own infrastructure on-prem

or is it more to do with avoiding the herd, i.e. less trafficky routes / beating the queues?

it's also unclear whether they use their own hardware on rented rackspace as that could potentially lower costs too

mike_d4y ago

Cloud providers are rarely in cities. Google's biggest region is in the middle of Iowa, Amazon's is in Virginia.

If you have a latency sensitive application (like multiplayer games) it makes sense to put a few servers in each of 100 locations rather concentrate them in a half dozen cloud regions.

InsomniacL4y ago

> it makes sense to put a few servers in each of 100 locations rather concentrate them in a half dozen cloud regions.

Large cloud providers have a backbone network with interconnects to many ISPs reducing the amount of Hops a client has to take across the internet.

> Doing it yourself will always be cheaper

Treating the Cloud as a traditional IAAS Datacenter extension will be more expensive. By utilising PAAS, only using resource that's needed and when it's needed, etc.. is much cheaper.

statguy4y ago· 2 in thread

So the outage lasted 3 days and the postmortem took 3 months!

koshergweilo4y ago

They wanted to make sure everything was fixed before publishing

Operyl4y ago

They just got out of their busiest time of year, and taking the time to write an accurate post mortem with data gleamed afterwards seems sensible to me.

chainwax4y ago· 1 in thread

Neil444y ago

ineedasername4y ago· 1 in thread

">circular dependencies in our observability stack"

This appears to be why the outage was extended, and was referenced elsewhere too. It's hard to diagnose something when part of the diagnostic tool kit is also malfunctioning.

phgn4y ago

Like the Facebook outage a few months ago, when their DNS being down prevented them from communicating interally.

fifticon4y ago· 1 in thread

It is annoying, because the tools actually run perfectly fine on a local desktop, once you are past the "mothership handshake". I spent that week reading roblox dev documentation instead.

southerntofu4y ago

snwfog4y ago· 1 in thread

Is there any tutorial on how go get a pref report like the one show in this screenshot? https://blog.roblox.com/wp-content/uploads/2021/11/3-perf-re...

TheDong4y ago

Yes. It's the default output for "perf report". I recommend reading this: https://www.brendangregg.com/perf.html

However, the short 2-line way to get that output is the following:

    perf record -F99 -g --pid $(pidof consul)
    # Wait a few seconds and hit ctrl-c
    perf report

You'll get similar output to what they show if you have consul running with a similar load :)

captaincaveman4y ago· 1 in thread

Sounds like they didn't check what had changed first, before starting to fix things with best guesses ... not saying I wouldn't do the same, but arguably lost them a lot of time.

k8sToGo4y ago

They were aware of the changes, but as they stated: it seemed to be working fine and, therefore, was ruled out early on as a potential problem.

sjtindell4y ago

zomglings4y ago

alpb4y ago

elfchief4y ago

One thing I don't see mentioned -- why is the write load so high? Can anyone from Roblox say? (I have a specific reason for asking.)

fasteo4y ago

>>> The scale of our deployment is significant, with over 18,000 servers and 170,000 containers.

That's impressive.

jeffrallen4y ago

Would be interesting to compare this result to the classic paper on Tandem failures:

LaserToy4y ago

Just curious, does Roblox push engineers to learn internals of critical software they operate or they lean on vendors.

If vendors, it is reckless.

nanis4y ago

> 50th percentile

I would normally not call this out, but it is repeated so often in the text that it is jarring. Just call it "median" as it is everywhere else, please.

On the other hand, I must commend the author(s) for not using "based off of" :-)

Great write-up, otherwise.

qaq4y ago

Love NATS for not having to deal with service discovery at all.

tlynchpin4y ago

warning, completely pedantic pet peeve.

> Note all dates and time in this blog post are in Pacific Standard Time (PST).

But the incident was during PDT. Just use UTC or colloquial "Pacific time" or equiv and never be wrong!

My heart goes out to these people. I can imagine how much sustained terror they were feeling, stare hard and harder at your terminals and still nothing makes sense.

j / k navigate · click thread line to collapse