It's like 20x better than every other product out there though. And their new video chat + screen sharing is pretty great. The bandwidth is far higher than any other competitors I've used.
My brother and I were playing 1080p videos on each of our screens and watching the other's, just to test it out. Obviously it wasn't full quality, but it kept the frame rate up and looked presentable at least to 720p.
And their lack of scalable monetization leads me worried about its longterm success as a platform - they are adding more cost-intensive features and continuing to try to support it with what is essentially a $5 monthly donation model.
I have used Discord quite a bit for gaming, but it hasn't proved a better option than Slack to me (at least in a work context).
We're stuck with Skype for official stuff and telephony though.
We believe this triggered a bug in the redis-py python driver we use (specifically this one: https://github.com/andymccurdy/redis-py/pull/886) that made us have to rolling restart our API cluster in the first place, to get the connection pools back into a working state. redis-sentinel had appropriately detected the instance going away, and initiated a fail-over almost immediately following the instance going offline, but due to the odd network situation that was caused by the migration (absolute packet loss instead of connections being reset) - the client driver was unable to properly fail-over to the new master. We already have work planned for our own connection pooling logic for redis-py - as right now the state of the drive in HA redis is actually pretty awful, and the maintainer doesn't appear to have the time to close or look at PRs that address these issues (we opened one that fixes a pretty serious bug during fail-over in march https://github.com/andymccurdy/redis-py/pull/847 that has yet to be addressed).
At this scale, it seems like it may be warranted to start using reliability testing in production in like with Netflix.
At the end I see mention of a library with flaws. I am curious as to which library that is, given I develop some projects in Elixir.
Reliability testing is definitely something we're interested in as we spin up more SRE/reliability focused individuals, but also has probably the least amount of cost-benefit for us (compared to engineering effort on improving the things we know need work). Some of the failure in the system we experienced is related to issues we know about, but haven't prioritized (read; had time for) yet.
For the library, we believe the bug is related to hackney and the fact it uses the high priority setting for its pool process. For some reason (this is the part we're not entirely sure on, and still spending some time investigating) this high priority process got stuck and consumed all of the scheduler time (presumably related to the earlier API degradation), breaking the distribution port and the application in a weird way. Oddly enough the systems we run on are SMP, so in theory one rogue process should not be able to have this effect.
The solution we came up so far is to disable automatic migrations. Not sure if that option actually does anything.