Discord Postmortem from Friday (opens in new tab)

(status.discordapp.com)

96 pointsb1naryth1ef8y ago42 comments

42 comments

24 comments · 7 top-level

atomical8y ago· 7 in thread

Does anyone use discord for work?

I have informally with co-workers. But not in any official capacity.

It's like 20x better than every other product out there though. And their new video chat + screen sharing is pretty great. The bandwidth is far higher than any other competitors I've used.

My brother and I were playing 1080p videos on each of our screens and watching the other's, just to test it out. Obviously it wasn't full quality, but it kept the frame rate up and looked presentable at least to 720p.

avree8y ago

Weird, I find Discord's audio quality especially to be terrible.

And their lack of scalable monetization leads me worried about its longterm success as a platform - they are adding more cost-intensive features and continuing to try to support it with what is essentially a $5 monthly donation model.

2 more replies

andyfleming8y ago

I briefly thought about whether it would be a better option than Slack. However, Slack seems a little more polished in the areas that we actually use. The subtle, better design in places like the core chat experience aren't worth giving up.

I have used Discord quite a bit for gaming, but it hasn't proved a better option than Slack to me (at least in a work context).

sbarre8y ago

We've debated it internally, but currently have decided to stick with Slack (free version)

eslachance8y ago

I'll need to echo some of the other comments here - at work we're pretty much stuck with Skype for Business (aka "lync") because the voice system ties into it. When I suggested Discord, it was shrugged off because it's "Chat for Gamers" and clearly they're not going to budge from that niche anytime soon, from all the recent features I've seen.

synicalx8y ago

Unofficially, for "out of band" discussions and also for on-call events where we need to collab on stuff.

We're stuck with Skype for official stuff and telephony though.

jakebasile8y ago

We did, but due to some people having image issues about using a “gamer tool”, we were forced to switch back to the inferior Slack.

ZeroCool2u8y ago· 4 in thread

I'm really impressed. I was using Discord for most of this weekend, specifically Friday and Saturday. Never noticed any issues.

aefx8y ago

I was using it on Friday. My friend couldn't connect and I had trouble jumping into voice chat. I closed the client and was able to log in 10 minutes later. Overall I only noticed an issue for about 15-20 minutes. Having just read the post mortem I'm pretty impressed with their service and operations.

gizmo3858y ago

They don't have a posted outage for Saturday, but I noticed issues with it on Saturday evening/night I believe. I'm wondering if it was related to the issues that they included in the post-mortem.

b1naryth1efOP8y ago

Very possible you saw a slight interruption around 11:30PST for around 10 minutes until we found and decommissioned the host that experienced this problem. We generally don't update status until we can verify impact/source, we see tons of limited outages from ISPs misbehaving.

s_kilk8y ago

I was recording a podcast through Discord, and got hit by this particular outage. To be fair, it’s the first I’ve seen first hand so

jhgg8y ago· 2 in thread

It's worth noting that the instance migration basically null-routed the redis VM for a good 30 minutes, until we manually intervened and restarted it. The instance was completely disconnected from the internal network immediately following the migration. From what we could gather from instance logs, the routing table on the VM was completely dropped and it could not even connect to the magic metadata service (metadata.internal - we saw "no route to host" errors for that). This is a pretty serious bug within GCP and we've already opened a case with them hoping they can get a fix. I think this is the 4th or 5th major bug we've encountered with their live migration system that could have, or has led to an outage or internal service degradation. GCP team has seriously investigated and fixed every bug we've reported to them so far, so props to them for that! Live migration is incredibly difficult to get right.

We believe this triggered a bug in the redis-py python driver we use (specifically this one: https://github.com/andymccurdy/redis-py/pull/886) that made us have to rolling restart our API cluster in the first place, to get the connection pools back into a working state. redis-sentinel had appropriately detected the instance going away, and initiated a fail-over almost immediately following the instance going offline, but due to the odd network situation that was caused by the migration (absolute packet loss instead of connections being reset) - the client driver was unable to properly fail-over to the new master. We already have work planned for our own connection pooling logic for redis-py - as right now the state of the drive in HA redis is actually pretty awful, and the maintainer doesn't appear to have the time to close or look at PRs that address these issues (we opened one that fixes a pretty serious bug during fail-over in march https://github.com/andymccurdy/redis-py/pull/847 that has yet to be addressed).

fulafel8y ago

For those of us unfamiliar with GCP, do you mean that the default-route of your VM was unable to route its traffic? Or is there a routing config running on customer VMs that GCP live-manages?

b1naryth1efOP8y ago

GCP has a virtual networking stack to support a bunch of crazy (and awesome) features Google has built. Unfortunately the complexity here seems to hurt power-users like us. In this case it appears that for some unknown reason the node failed to program its network stack when coming up, meaning it was completely unavailable (even the metadata service used internally by google failed).

cordite8y ago· 2 in thread

The level of detail and linearity is impressive.

At this scale, it seems like it may be warranted to start using reliability testing in production in like with Netflix.

At the end I see mention of a library with flaws. I am curious as to which library that is, given I develop some projects in Elixir.

b1naryth1efOP8y ago

Thanks, we try our best with these. Past experience has shown they can be very valuable, and help everyone at the company get context on the system and how we handle failures.

Reliability testing is definitely something we're interested in as we spin up more SRE/reliability focused individuals, but also has probably the least amount of cost-benefit for us (compared to engineering effort on improving the things we know need work). Some of the failure in the system we experienced is related to issues we know about, but haven't prioritized (read; had time for) yet.

For the library, we believe the bug is related to hackney and the fact it uses the high priority setting for its pool process. For some reason (this is the part we're not entirely sure on, and still spending some time investigating) this high priority process got stuck and consumed all of the scheduler time (presumably related to the earlier API degradation), breaking the distribution port and the application in a weird way. Oddly enough the systems we run on are SMP, so in theory one rogue process should not be able to have this effect.

cordite8y ago

That is indeed very odd! Thank you for sharing. Hackney, through another library, is used in a telegram api wrapper that I wrote up. Though my stuff usually runs on a $5 vps, nothing with multiple cores.

humanfromearth8y ago· 2 in thread

We had the exact same issue with RMQ (HA setup) on GCP (running on GKE) a few weeks ago. Tried contacting support about this, it's paid - no customer support for their own bugs.

The solution we came up so far is to disable automatic migrations. Not sure if that option actually does anything.

sleepydog8y ago

You can't disable automatic migrations in GCP. You can choose between allowing live migrations (move the instance while it's still running) or (hard) instance reboots.

humanfromearth8y ago

You're right. I meant hard reboots.

phreack8y ago

Ever since they launched screen sharing, I've uninstalled both Skype and Hangouts and relied entirely on it for pair programming sessions. The smoothness of the reproduction is just incredible, and I don't see myself going back soon.

lwansbrough8y ago

Funny definition of HA. :)

j / k navigate · click thread line to collapse

42 comments

24 comments · 7 top-level

atomical8y ago· 7 in thread

Does anyone use discord for work?

katastic8y ago

I have informally with co-workers. But not in any official capacity.

It's like 20x better than every other product out there though. And their new video chat + screen sharing is pretty great. The bandwidth is far higher than any other competitors I've used.

avree8y ago

Weird, I find Discord's audio quality especially to be terrible.

2 more replies

andyfleming8y ago

I have used Discord quite a bit for gaming, but it hasn't proved a better option than Slack to me (at least in a work context).

sbarre8y ago

We've debated it internally, but currently have decided to stick with Slack (free version)

eslachance8y ago

synicalx8y ago

Unofficially, for "out of band" discussions and also for on-call events where we need to collab on stuff.

We're stuck with Skype for official stuff and telephony though.

jakebasile8y ago

We did, but due to some people having image issues about using a “gamer tool”, we were forced to switch back to the inferior Slack.

ZeroCool2u8y ago· 4 in thread

I'm really impressed. I was using Discord for most of this weekend, specifically Friday and Saturday. Never noticed any issues.

aefx8y ago

gizmo3858y ago

They don't have a posted outage for Saturday, but I noticed issues with it on Saturday evening/night I believe. I'm wondering if it was related to the issues that they included in the post-mortem.

b1naryth1efOP8y ago

s_kilk8y ago

I was recording a podcast through Discord, and got hit by this particular outage. To be fair, it’s the first I’ve seen first hand so

jhgg8y ago· 2 in thread

fulafel8y ago

For those of us unfamiliar with GCP, do you mean that the default-route of your VM was unable to route its traffic? Or is there a routing config running on customer VMs that GCP live-manages?

b1naryth1efOP8y ago

cordite8y ago· 2 in thread

The level of detail and linearity is impressive.

At this scale, it seems like it may be warranted to start using reliability testing in production in like with Netflix.

At the end I see mention of a library with flaws. I am curious as to which library that is, given I develop some projects in Elixir.

b1naryth1efOP8y ago

Thanks, we try our best with these. Past experience has shown they can be very valuable, and help everyone at the company get context on the system and how we handle failures.

cordite8y ago

humanfromearth8y ago· 2 in thread

We had the exact same issue with RMQ (HA setup) on GCP (running on GKE) a few weeks ago. Tried contacting support about this, it's paid - no customer support for their own bugs.

The solution we came up so far is to disable automatic migrations. Not sure if that option actually does anything.

sleepydog8y ago

You can't disable automatic migrations in GCP. You can choose between allowing live migrations (move the instance while it's still running) or (hard) instance reboots.

humanfromearth8y ago

You're right. I meant hard reboots.

phreack8y ago

lwansbrough8y ago

Funny definition of HA. :)

j / k navigate · click thread line to collapse