[1] "We've raised over $30,000,000 from top VCs in the valley like Greylock, Benchmark, and Tencent. In other words, we’ll be around for a while."
I would happily still use Discord if they provided the exact same thing with a monthly fee. Hopefully at some point they throw in some extra features for a pro version and start charging.
I just use Discord for gaming and haven't used Slack a lot, but I think Discord will be great for work as soon as they release search.
[0]: www.discordapp.com
> mine that data [...] more addictive [...] games
Oh, fuck no. No, no, no. I can not think of a more abusive thing for a company to do to its customers than that suggestion right there. How little respect would a company have for their fellow humans that anyone could even consider such a move?
Maybe that's just a failure of imagination on my part, but I'm ok with that.
As soon as screensharing lands it'll be an across the board upgrade to Hipchat at my work (only missing feature I can think of is video calls, but quality over hipchat has always been a bit sketchy so we usually fall back to Hangouts).
Ignoring Elixir and Erlang - when you discover you have a backpressure problem, that is - any kind of throttling - connections or req/sec, you need to immediately tell yourself "I need a queue", and more importantly "I need a queue that has a prefetch capabilities". Don't try to build this. Use something that's already solid.
I've solved this problems 3 years ago, having 5M msg/minute pushed _reliably_ without loss of messages, and each of these messages were checked against a couple rules for assertion per user (to not bombard users with messages, when is the best time to push to a a user, etc.), so this adds complexity. Later approved messages were bundled into groups of a 1000, and passed on to GCM HTTP (today, Firebase/FCM).
I've used Java and Storm and RabbitMQ to build a scalable, dynamic, streaming cluster of workers.
You can also do this with Kafka but it'll be less transactional.
After tackling this problem a couple times, I'm completely convinced Discord's solution is suboptimal. Sorry guys, I love what you do, and this article is a good nudge for Elixir.
On the second time I've solved this, I've used XMPP. I knew there were risks, because essentially I'm moving from a stateless protocol to a stateful protocol. Eventually, it wasn't worth the effort and I kept using the old system.
This service buffers potential pushes for all users being messages, that then watches the presence system to determine if they are on their desktop or mobile (this is millions of presence watchers and 10s of millions of buffered messages), and users are constantly clearing these buffers by reading on the clients and finally when a user is offline or goes offline we emit their pushes to them (which is what this article talks about). This service was evolved from our push system from the game we worked on and when it just did pushes only and no other logic it could push at 1m/sec in batches, but its responsibility has changed.
Context matters :)
Then I would just say, XMPP is new and fancy, but consider the old fashioned stateless HTTP interface. When I was implementing my own service, I was worried Google is not going to handle the load. Since we were partners with Google for a good while I was able to climb the ladder of people to get an answer, and plow through their closed-door policy for questions such as "Will you guys handle this load? (5M msg/min)". I wrote a huge email explaining every edge case and what I'm doing. The answer was "We will handle it.". No detail, no context, no buts. I wasn't confident at all. But in the end, they did handle it :)
For example, workers could discard messages older than some threshold, quickly emptying the queue if there are expired messages. Clients might not even queue messages if the queue is currently too long, perhaps even providing a convenient signal for them to back off from their most chatty behaviour.
Some messages will not be delivered on time if there is significant backpressure. There is not much you can do about it, apart from avoiding choking yourself.
Perhaps the queue could work with a LIFO policy, to help at least some messages go through in time instead of having most messaged delayed near to the expiration threshold.
Could you explain why using RabbitMQ is more transactional?
It seems to have totally taken over a space that wasn't even clearly defined before they got there.
But at some point we heard of Discord, which posed itself as a chat/vent replacement, started using it, and it just works. Which is huge, since the other stuff generally didn't (Axon was actually good).
TS supports plugins. No lag issues, can run on your own server. Closed source.
Mumble is open source, no lag issues. Interface is slightly less good than TS. Supports SSL.
Discord, like you say, Just Works (tm). It is very easy to use, the interface is amazing, its in active development, and setting up a server is free. It also works in the web browser.
If you're into Blizzard games, Battle.net recently added native VoIP in their client. The advantage that has as Blizzard gamer is you don't have to install any 3rd party software.
This is what makes Discord so good. Before Discord, when I wanted to play online with a friend for the time, I'd have to convince them to download Mumble or Ventrilo, teach them how to connect to a server, and help them set up their mic. With Discord I just send them a link and we're talking. They can get the client later, and having a persistent chat area is a fun way to build up a sense of community.
I feel like everything down with Discord could be done with IRC in a open source way. IRC for the 21st Century?
It's just too bad there are a dozen IMs/voice/video and a dozen slacky/feed companies.
Basically it is just some type of feedback so that you don't overload subsystems. One of the most common failure modes I see in load balanced systems is when one box goes down the others try to compensate for the additional load. But there is nothing that tells the system overall "hey there is less capacity now because we lost a box". So you overwhelm all the other boxes and then you get this crazy cascade of failures.
This is a good article about overload and back pressure. It also lists some tools in Erlang to solve these sorts of issues. It also mentions genstage (very) briefly.
Corollary: If you have 2 boxes, each of them has to be able to handle all the traffic, so you can't save money by using smaller boxes :D
Corollary #2: If you have 2 datacenters, each of them has to be able to handle all the traffic, so you burn a lot of money :D
Hopefully your clients retry server errors with exponential backoff, if you lose a datacenter you can send half of the requests 503 until you're back at a manageable load.
Hopefully detecting the load/generating the 503 is really cheap.
It's also part of http in a lot of ways. Browsers can make N simultaneous connections to a server. If those requests get queued up on the server side, then the client won't proceed with a new request until one finishes.
The pattern I see all to frequently is asynchronous worker queues. Those can very easily undermine backpressure. The rise of Ruby/Python and their limited concurrency models has placed a taboo on synchronous operations. However, synchronicity natural lends itself to backpressure.
Makes me think of the Abraham Simpson quote: "My car gets 40 rods to the hogshead and that's the way I likes it!"
QPM is a useless metric. When talking about distributed systems from engineering point of view, you always want to use QPS. QPM is simply not fined-grained enough to show whether the traffic is bursty or not. For example in this particular case, when you say 1M QPM that can mean anything - they might be idle for 50s and then get 100k QPS for the next ten seconds, or they might be getting 15k QPS all the time (like it's visible on the graph). Distributed systems are designed for the peak workload, not for the average one. Using misleading numbers like QPM leads to bad design and sizing decisions.
The only case where you would use QPM, QPD and similar metrics is when you want to artificially show your numbers bigger than they are (10M transactions a day sounds better than 115 transactions a second). But those should be used by sales, not by engineers.
The requests per minute number is an average.
The requests per second number should be given for peak load. That is a very important metric, a system has to be scaled to sustain the peaks, not the average.
We'd need to know the traffic pattern to know the multiplier, that is certainly not 60 :p
How many is a few? It looks like the buffer reaches about 50k, does a few mean literally in the single digits or 100s?
This system seems great for at most once delivery. I wish I had more problems to solve with that constraint.
What if the Push Collector is down or has a random bug where it throws away XX% of requests for no good reason? How would you know if you don't instrument the other end? Something like StatsD works fantastic for this, but also just logging those failures and using a log search/aggregation tool like Kibana or Splunk would be a step in the right direction.
We currently have 3 machines doing this for millions of concurrent users. At the writing of this article it was 2 machines.
Edit: Also worth mentioning, the 50k buffer is for a single server, we run multiple push servers in the cluster.
I'd definitely be able to put to use things like flow limiters and queuing and such, but none of my company's projects use Elixir :(
Highly recommend the Reactive series of libs. They're typically very well done.
The guy below is right that Akka is perfectly suited.
Some implementations are listed here: http://www.reactive-streams.org/announce-1.0.0
- 100 /s = typical limit of a standard web application (python/ruby), per core
- 1.000 /s = typical limit of an application running on a full system
- 10.000 /s = typical limit for fast systems (load balancers, DB, haproxy, redis, tomcat...).
- Over 10.000/s You gotta scale horizontally because a single box [shouldn't] can't take it.
The difficulty depends on the architecture and what the application has to do (dunno, didn't go through the article). You make something that can scale by just adding more boxes, then it's trivial, just add more boxes. Well, it's gonna costs money and that's about it.
So no. Not a big deal at all... if you've done that before and you've got the experience :D
Unfortunate that the final bottleneck was an upstream provider, though it's good that they documented rate limits. I feel like my last attempt to find documented rate limits for GCM/APNS was fruitless, perhaps Firebase messaging has improved that?
That makes it possible for one processor heavy operation to take over and slow down everything else. BEAM ensures that if you have millions of request coming through and suddenly 1000 4 day long operations kick off on the machine, that the millions of normal, smaller operations continue responding and performing as expected.
Fairly critical for the stability of real time systems.
The other piece here is that these processes are cheaper on the BEAM than any other platform in terms of RAM cost.
.5Kb / process on the Erlang VM. A goroutine in Golang is the next closest at 2kb.
The two combined are one of the big reasons why benchmarks don't tell the whole story with Erlang/Elixir. It's harder to measure consistency in the face of bad actors.
GenStage has a lot of uses at scale. Even more so is going to be GenStage Flow (https://hexdocs.pm/gen_stage/Experimental.Flow.html). It will be a game changer for a lot of developers.
The average per minute only gets to be used because many systems have so little load that the number per second is negligible.
So... get 100 firebase accounts and blast them in parallel.