How Discord handles over a million requests per minute with Elixir’s GenStage (opens in new tab)

(discord.engineering)

382 pointsSikul9y ago161 comments

161 comments

97 comments · 19 top-level

coverband9y ago· 21 in thread

Quick serious question: How does this company plan to make money? They're surely well funded[1], but what's their end game?

[1] "We've raised over $30,000,000 from top VCs in the valley like Greylock, Benchmark, and Tencent. In other words, we’ll be around for a while."

robryan9y ago

This worries me a bit. At the moment they are providing the software and hosting it all absolutely free.

I would happily still use Discord if they provided the exact same thing with a monthly fee. Hopefully at some point they throw in some extra features for a pro version and start charging.

I just use Discord for gaming and haven't used Slack a lot, but I think Discord will be great for work as soon as they release search.

logicfiction9y ago

I don't know. I use Slack and Discord a lot. The only advantage Discord has going for it is the voice chat channels, which has much more utility for gaming than business in my opinion. Otherwise Slack is way better at all the other features and I usually feel neutered when I'm using Discord after using Slack all day.

chinhodado9y ago

Realistically, the plan is probably getting acquired by something like Youtube Gaming or Twitch to be part of the platform.

HCIdivision179y ago

They're targetting the Twitch model a bit with customizable bits for a charge. Basically allowing some chat branding for guilds and the like.

corobo9y ago

If they need ideas for a pro version I'd probably pay far too much to be able to record each user into individual audio files (recorded locally to each user and combined on my system) for podcasts, letsplays (YouTube videos), remote meetings, etc

SpaceManiac9y ago

This is doable today. Streams are mixed client-side. The downsides are compression (Opus at 96kbps is pretty good, though) and loss of audio mixing (echo cancellation, noise suppression) secret sauce, though this could definitely be reapplied. If this is worth money to you even with these downsides, hit me up.

nstj9y ago

> Discord is always completely free to use with no gotchas. This means you can make as many servers as you want with no slot limitations. Wondering how we’ll make money? In the future there will be optional cosmetics like themes, sticker packs, and sound packs available for purchase. We’ll never charge for Discord’s core functionality. [0]

[0]: www.discordapp.com

jlarocco9y ago

Something smells very fishy there. I have a hard time believing they got a $30 million investment by pitching stickers, add-on themes, and sound effects...

2 more replies

meddlepal9y ago

They have an awful lot of information about video gamers in conversation history. They could mine that data for game companies and sell it as a way to help companies build better, more addictive and mechanically pleasing games.

chinhodado9y ago

I'm not sure how useful that is. It's not like gamers' opinion about games are hard to come by. Gamers are very vocal about their opinions, so any game developer looking for feedback can just go to Steam/Reddit/NeoGAF/whatever and read to their heart's content, or maybe even communicate directly to their player base.

1 more reply

b1naryth1ef9y ago

We've adamantly stated many a time that we will never sell users data, or put ads in the app.

4 more replies

falcolas9y ago

Pure opinion:

> mine that data [...] more addictive [...] games

Oh, fuck no. No, no, no. I can not think of a more abusive thing for a company to do to its customers than that suggestion right there. How little respect would a company have for their fellow humans that anyone could even consider such a move?

Maybe that's just a failure of imagination on my part, but I'm ok with that.

ajamesm9y ago

if that kind of information had any value whatsoever, Reddit would be worth billions

0942v86539y ago

Aggregated or Non-identifiable Data: We may also share aggregated or non-personally identifiable information with our partners or others for business purposes.

imglorp9y ago

They're also free international. I bet plenty of governments would pay for a steady feed.

baldfat9y ago

Serious Answer: To Kill IRC for a whole generation

koolba9y ago

I thought that was Slack's plan.

4 more replies

lmm9y ago

You didn't answer the question. How does that make money? If it's about killing IRC, who's paying for that?

jamie_ca9y ago

I sure hope one possible end-game is to re-skin it without the gaming focus and price it competitively to Hipchat/Slack.

As soon as screensharing lands it'll be an across the board upgrade to Hipchat at my work (only missing feature I can think of is video calls, but quality over hipchat has always been a bit sketchy so we usually fall back to Hangouts).

lsmarigo9y ago

probably also working on a premium service with a monthly charge to go along with the free tier

mevile9y ago

They're in the gaming market, there's cash plenty. Even if they just went the advertisement route, which I doubt they will they should end up doing well.

jondot9y ago· 10 in thread

Hate to be a party pooper, but I'd like to give people here a more generic mental tool to solve this problem.

Ignoring Elixir and Erlang - when you discover you have a backpressure problem, that is - any kind of throttling - connections or req/sec, you need to immediately tell yourself "I need a queue", and more importantly "I need a queue that has a prefetch capabilities". Don't try to build this. Use something that's already solid.

I've solved this problems 3 years ago, having 5M msg/minute pushed _reliably_ without loss of messages, and each of these messages were checked against a couple rules for assertion per user (to not bombard users with messages, when is the best time to push to a a user, etc.), so this adds complexity. Later approved messages were bundled into groups of a 1000, and passed on to GCM HTTP (today, Firebase/FCM).

I've used Java and Storm and RabbitMQ to build a scalable, dynamic, streaming cluster of workers.

You can also do this with Kafka but it'll be less transactional.

After tackling this problem a couple times, I'm completely convinced Discord's solution is suboptimal. Sorry guys, I love what you do, and this article is a good nudge for Elixir.

On the second time I've solved this, I've used XMPP. I knew there were risks, because essentially I'm moving from a stateless protocol to a stateful protocol. Eventually, it wasn't worth the effort and I kept using the old system.

Vishnevskiy9y ago

I think you misunderstand the problem we are solving here. We are not trying to solve this because our system can't handle it. We are protecting it from when Firebase decides to slowdown in a way that causes data to backup and OOM the system. Since these are push notifications that have a time bound on usefulness we don't care about dumping to an external persisted queue like RabbitMQ or Kafka (we rather deliver newer notifications faster, than wait for the backed up buffer to flush). Firebase also only allows 1000 concurrent connections per senderId with 100 inflight pushes (that have not received an ack) which means that only 100,000 can be inflight. Ultimately if a remote service is providing backpressure because it is having a struggle no amount of auto scaling on your end is going to help you.

This service buffers potential pushes for all users being messages, that then watches the presence system to determine if they are on their desktop or mobile (this is millions of presence watchers and 10s of millions of buffered messages), and users are constantly clearing these buffers by reading on the clients and finally when a user is offline or goes offline we emit their pushes to them (which is what this article talks about). This service was evolved from our push system from the game we worked on and when it just did pushes only and no other logic it could push at 1m/sec in batches, but its responsibility has changed.

Context matters :)

jondot9y ago

Context definitely matters - thanks for the background info. I understood what you're trying to do, incidentally I did the same (all these details sound familiar to me). At the time, Storm helped me batch, break batches and validate discrete units, re-batch, aggregate, repeat that process how many times I wanted, and finally, batch the stream with a strategy I wanted (number of users, messages, or balance number of connections) and deliver to Google.

Then I would just say, XMPP is new and fancy, but consider the old fashioned stateless HTTP interface. When I was implementing my own service, I was worried Google is not going to handle the load. Since we were partners with Google for a good while I was able to climb the ladder of people to get an answer, and plow through their closed-door policy for questions such as "Will you guys handle this load? (5M msg/min)". I wrote a huge email explaining every edge case and what I'm doing. The answer was "We will handle it.". No detail, no context, no buts. I wasn't confident at all. But in the end, they did handle it :)

metafunctor9y ago

Could you not reach pretty much the same result with a queue, though?

For example, workers could discard messages older than some threshold, quickly emptying the queue if there are expired messages. Clients might not even queue messages if the queue is currently too long, perhaps even providing a convenient signal for them to back off from their most chatty behaviour.

Some messages will not be delivered on time if there is significant backpressure. There is not much you can do about it, apart from avoiding choking yourself.

Perhaps the queue could work with a LIFO policy, to help at least some messages go through in time instead of having most messaged delayed near to the expiration threshold.

1 more reply

di4na9y ago

Knowing that RabbitMQ is full erlang, why bring a really big dependency if you have all the things in your everyday language anyway ?

jondot9y ago

An all encompassing answer would be - the abstractions. Why would you use an operating system if really you have your CPU documented and know all of the instruction sets?

gregpardo9y ago

Yeah this... Hey guys you don't need erlang to solve this problem... just use this tool built in erlang.

teacpde9y ago

> You can also do this with Kafka but it'll be less transactional.

Could you explain why using RabbitMQ is more transactional?

chillydawg9y ago

I've no idea about kafka but rabbit offers you message ack/nack and publisher confirms. Generally, you can build very solid things on top of it, depending on whether you need to distribute rabbit or not.

neiled9y ago

Anyone have any nice links to describe more about queue prefetching as described in this case? My google skills are failing because of all the CPU related articles.

jondot9y ago

http://www.rabbitmq.com/consumer-prefetch.html

dimino9y ago· 8 in thread

What is up with Discord? I feel like it's quietly (maybe not so quietly) one of the bigger startups to come out in the last two years.

It seems to have totally taken over a space that wasn't even clearly defined before they got there.

HCIdivision179y ago

It does, doesn't it? I used to use Ventrillo, but then they screwed our small group out of our server connection. And we happily used Dolby Axon for a while. We tried Google Hangouts... for a while; until it just really didn't work well (it just disconnected and crapped out a lot). We tried using the Steam client's chat, but while ok for screensharing, it wasn't so great for chat.

But at some point we heard of Discord, which posed itself as a chat/vent replacement, started using it, and it just works. Which is huge, since the other stuff generally didn't (Axon was actually good).

Fnoord9y ago

Ventrilo is laggy compared to TS and Mumble. It won't show the lag as ms, but it is there, and it is real. Its due to the way the protocol works, or it is the server. It is no longer in development, and you can't even run your own server on your own hardware. The interface is from the 90s. You don't want to use Ventrilo for gaming in 2016.

TS supports plugins. No lag issues, can run on your own server. Closed source.

Mumble is open source, no lag issues. Interface is slightly less good than TS. Supports SSL.

Discord, like you say, Just Works (tm). It is very easy to use, the interface is amazing, its in active development, and setting up a server is free. It also works in the web browser.

If you're into Blizzard games, Battle.net recently added native VoIP in their client. The advantage that has as Blizzard gamer is you don't have to install any 3rd party software.

lifeformed9y ago

> it just works.

This is what makes Discord so good. Before Discord, when I wanted to play online with a friend for the time, I'd have to convince them to download Mumble or Ventrilo, teach them how to connect to a server, and help them set up their mic. With Discord I just send them a link and we're talking. They can get the client later, and having a persistent chat area is a fun way to build up a sense of community.

j_s9y ago

And the voice chat web interface just works!

bpicolo9y ago

They had a really well defined user-space, marketed at it well, and really nailed the user experience, while still being free for the typical user. There is a lot to love about Discord.

baldfat9y ago

I use Discord everyday BUT I seriously prefer IRC with weechat and glowing-bear.org.

I feel like everything down with Discord could be done with IRC in a open source way. IRC for the 21st Century?

3 more replies

Apocryphon9y ago

Discord's design feels a lot like Slack. I wonder where the overlap is between the two products.

2 more replies

Numberwang9y ago

I ended up there for the first time last night and must say that there is a lot to like about it. I found some good communities and integrating media and so on all felt quite streamlined, and the system was snappy.

It's just too bad there are a dozen IMs/voice/video and a dozen slacky/feed companies.

jtchang9y ago· 7 in thread

The most important part of this article is the concept of back pressure and being able to detect it. It's common in a ton of other engineering disciplines but especially important when designing fault tolerant or load balancing systems at scale.

Basically it is just some type of feedback so that you don't overload subsystems. One of the most common failure modes I see in load balanced systems is when one box goes down the others try to compensate for the additional load. But there is nothing that tells the system overall "hey there is less capacity now because we lost a box". So you overwhelm all the other boxes and then you get this crazy cascade of failures.

pdexter9y ago

http://ferd.ca/handling-overload.html

This is a good article about overload and back pressure. It also lists some tools in Erlang to solve these sorts of issues. It also mentions genstage (very) briefly.

SikulOP9y ago

This short book by Fred is also a great read about Erlang production systems in general https://www.erlang-in-anger.com/

user59944619y ago

Yes. You need to adapt the capacity of the system to handle the full load with -N- boxes dead.

Corollary: If you have 2 boxes, each of them has to be able to handle all the traffic, so you can't save money by using smaller boxes :D

Corollary #2: If you have 2 datacenters, each of them has to be able to handle all the traffic, so you burn a lot of money :D

xxpor9y ago

Or just the ability to tell your clients to back off.

Hopefully your clients retry server errors with exponential backoff, if you lose a datacenter you can send half of the requests 503 until you're back at a manageable load.

Hopefully detecting the load/generating the 503 is really cheap.

1 more reply

rahimnathwani9y ago

Instead of 2 boxes that can handle all your load, why not 4 that can each handle 1/3rd of the load?

2 more replies

nurettin9y ago

Backpressure is more common than water across engineering disciplines and it isn't an integrated part of every distributed system out there? Isn't that a bit of an oversight?

phamilton9y ago

Well, it is a core part of TCP.

It's also part of http in a lot of ways. Browsers can make N simultaneous connections to a server. If those requests get queued up on the server side, then the client won't proceed with a new request until one finishes.

The pattern I see all to frequently is asynchronous worker queues. Those can very easily undermine backpressure. The rise of Ruby/Python and their limited concurrency models has placed a taboo on synchronous operations. However, synchronicity natural lends itself to backpressure.

erikbern9y ago· 7 in thread

"requests per minute" is such a useless unit of measurement. Please always quote request rates per second (i.e. Hz).

Makes me think of the Abraham Simpson quote: "My car gets 40 rods to the hogshead and that's the way I likes it!"

ipozgaj9y ago

Not sure why are you getting downvoted, I came here to make the same comment.

QPM is a useless metric. When talking about distributed systems from engineering point of view, you always want to use QPS. QPM is simply not fined-grained enough to show whether the traffic is bursty or not. For example in this particular case, when you say 1M QPM that can mean anything - they might be idle for 50s and then get 100k QPS for the next ten seconds, or they might be getting 15k QPS all the time (like it's visible on the graph). Distributed systems are designed for the peak workload, not for the average one. Using misleading numbers like QPM leads to bad design and sizing decisions.

The only case where you would use QPM, QPD and similar metrics is when you want to artificially show your numbers bigger than they are (10M transactions a day sounds better than 115 transactions a second). But those should be used by sales, not by engineers.

sethammons9y ago

I read it originally as 1M QPS, and thought that was a nice number. It was upon further inspection that I saw it was 1M QPM, and I was no longer intrigued.

1 more reply

hueving9y ago

Here's a cool trick I figured out. If you have something measured in units per minute, you can divide it by 60 to get units per second. I won't even charge you to use the method even though I'm in the process of patenting it.

user59944619y ago

Actually. The conversion doesn't work.

The requests per minute number is an average.

The requests per second number should be given for peak load. That is a very important metric, a system has to be scaled to sustain the peaks, not the average.

We'd need to know the traffic pattern to know the multiplier, that is certainly not 60 :p

1 more reply

corobo9y ago

You can also do it the opposite way if you want less specific numbers. Multiply by 60 and round off for the units per hour!

ceejayoz9y ago

I, like most people, have no idea what a rod or a hogshead is.

The same is hardly true for the conversion of minutes to seconds.

StavrosK9y ago

Such arcane units as "seconds" are only used by three countries in the world, though.

hotdogs9y ago· 6 in thread

"Obviously a few notifications were dropped. If a few notifications weren’t dropped, the system may never have recovered, or the Push Collector might have fallen over."

How many is a few? It looks like the buffer reaches about 50k, does a few mean literally in the single digits or 100s?

SikulOP9y ago

Good question. We don't have metrics on the exact number dropped. We're using an earlier version of GenStage that doesn't give any information about dropped events. Once we upgrade we'll have a better idea.

teej9y ago

That seems too important to have zero visibility on to me. Just eyeing the graphs, your queue size grew at 750m/s from 17:49 to 17:50. You then starting shedding at 17:50 for 40s. Assuming the ingress rate was roughly linear (which it looks like it was) you shed ~30,000 requests out of 3-4M. Does that not seem high to you?

This system seems great for at most once delivery. I wish I had more problems to solve with that constraint.

1 more reply

jsjohnst9y ago

Curious why you don't have metrics on the other end (whatever is sending to the "Push Collector")?

What if the Push Collector is down or has a random bug where it throws away XX% of requests for no good reason? How would you know if you don't instrument the other end? Something like StatsD works fantastic for this, but also just logging those failures and using a log search/aggregation tool like Kibana or Splunk would be a step in the right direction.

1 more reply

Matthias2479y ago

There's another important question: How will the clients deal with the fact that they did not get a notification delivered? Will that mean they probably never receive a chat message? That could in some cases be catastrophic for the user. Or would it only mean that they may not get something instantly, which would not be too bad if the client would also poll the server or also try to catch up on notifications on reconnects.

2 more replies

bcherny9y ago

Can you explain why it's necessary that some notifications were dropped?

1 more reply

DougN79y ago

I was wondering the same thing. Dropping an unknown number of requests isn't all that impressive. It seems like a simpler approach would have been to use a Message Queue of some sort with pushers pulling items from the queue.

pwf9y ago· 5 in thread

50k seems like a low bar to start losing messages at. If this was done with Celery and a decently sized RabbitMQ box, I would expect it to get into the millions before problems started happening.

Vishnevskiy9y ago

These machines do more than just push. They also buffer messages for each individual user to "potentially" push if they don't read them on the desktop client. This happens before the flow this article talks about.

We currently have 3 machines doing this for millions of concurrent users. At the writing of this article it was 2 machines.

jsjohnst9y ago

What size machines are these? I'm shocked that this volume is your max handling with Erlang unless your using a smaller T series AWS instance for this.

1 more reply

jhgg9y ago

At some point, when a system has entered a failure mode for a while, it makes sense to start shedding load, rather than attempting to deliver every single push notification. Also worth mentioning, a minute of downtime is already a million backed up pushes. Beyond that, it becomes infeasible to attempt deliver them.

Edit: Also worth mentioning, the 50k buffer is for a single server, we run multiple push servers in the cluster.

ramchip9y ago

At 15k notifications per minute, a million notifications would take 1hr to clear before the queue returns to normal. I would imagine they prefer to shed load early so notifications don't get delayed, hence the small buffer.

abrookewood9y ago

The issue was not the ability of their servers to handle the load, but the ability of Firebase to ingest the notifications - at least, that's how I read it.

AgentK209y ago· 4 in thread

Anyone know of a equivalent libraries like GenStage for other languages? (Java, NodeJS, etc)

I'd definitely be able to put to use things like flow limiters and queuing and such, but none of my company's projects use Elixir :(

bpicolo9y ago

ReactiveX seems to have documented notions for it: https://github.com/ReactiveX/RxJava/wiki/Backpressure

Highly recommend the Reactive series of libs. They're typically very well done.

The guy below is right that Akka is perfectly suited.

bhelx9y ago

Akka streams?

wtf_is_up9y ago

There was an initiative not long ago called Reactive Streams which established some common interfaces to build things like this. Back pressure was one of the main concerns.

Some implementations are listed here: http://www.reactive-streams.org/announce-1.0.0

gazarullz9y ago

For java there's also: - Project reactor from Spring - Reactive Spring (following up with spring 5.0)

snambi9y ago· 3 in thread

million requests per minute, is this a big deal?

user59944619y ago

16k per second. 83k per second during peak (assuming 80/20 default traffic rule).

- 100 /s = typical limit of a standard web application (python/ruby), per core

- 1.000 /s = typical limit of an application running on a full system

- 10.000 /s = typical limit for fast systems (load balancers, DB, haproxy, redis, tomcat...).

- Over 10.000/s You gotta scale horizontally because a single box [shouldn't] can't take it.

The difficulty depends on the architecture and what the application has to do (dunno, didn't go through the article). You make something that can scale by just adding more boxes, then it's trivial, just add more boxes. Well, it's gonna costs money and that's about it.

So no. Not a big deal at all... if you've done that before and you've got the experience :D

manigandham9y ago

While 1k/sec seems to be an average throughput for most web apps due to all the logic, 10k/sec is nowhere near the limit for fast systems, many can do well into 6 figures per second with some now doing millions/sec.

1 more reply

manigandham9y ago

Everything is relative. In this case, it's not so much the actual load itself but rather the throttling ability to match the upstream provider's throughput and limitations.

bpicolo9y ago· 2 in thread

I love Discord, and love Elixir too, so this is a pretty great post.

Unfortunate that the final bottleneck was an upstream provider, though it's good that they documented rate limits. I feel like my last attempt to find documented rate limits for GCM/APNS was fruitless, perhaps Firebase messaging has improved that?

chatmasta9y ago

It's not the final bottleneck, it's the first constraint. ;)

bpicolo9y ago

Hah, fair. It's always unfortunate when it's hard to address the real limitation though :)

mevile9y ago· 2 in thread

I spend a lot of time in the PCMR Discord, which is pretty lively. The technology seems to be solid, while the UI has issues (notifications from half a day ago are really hard to find for example on mobile devices). Otherwise I'm on Discord every day and love using the service. I miss some slack features, but the VOIP is very good.

b1naryth1ef9y ago

What features in particular? The most common one we hear is search, which is actually implemented and undergoing internal testing before a public preview soon.

mevile9y ago

It's just what I mentioned. I'll get a notification, and I just can't find where I was notified from. Like on Android, if I click on the notification I would expect it to take me to where the conversation happened where I was notified. It would take a really long time of scrolling to try and find the notification given the volume of discussion that happens. Can I just like click on something to see all my notifications from android, click on them and go to the conversation?

manigandham9y ago· 1 in thread

Akka(.NET) or any actor system is a perfect fit for this and brings the same functionality to other languages and frameworks.

brightball9y ago

Not exactly. Without running on the BEAM you're left with cooperative scheduling (handing back control to the scheduler) of processes instead of pre-scheduling (the scheduler can stop you).

That makes it possible for one processor heavy operation to take over and slow down everything else. BEAM ensures that if you have millions of request coming through and suddenly 1000 4 day long operations kick off on the machine, that the millions of normal, smaller operations continue responding and performing as expected.

Fairly critical for the stability of real time systems.

The other piece here is that these processes are cheaper on the BEAM than any other platform in terms of RAM cost.

.5Kb / process on the Erlang VM. A goroutine in Golang is the next closest at 2kb.

The two combined are one of the big reasons why benchmarks don't tell the whole story with Erlang/Elixir. It's harder to measure consistency in the face of bad actors.

sbov9y ago· 1 in thread

Is the number of Push Collectors to Pushers constant or can it vary based upon notification load?

jhgg9y ago

It is constant - but iirc, it'd be trivial to make a dynamically scaling pool. At the end of the day, a pusher is just a TCP connection. Keeping a pool of fixed size and planning capacity around scaling horizontally is a perfectly acceptable approach - given you know the potential throughput for each pusher.

rv119y ago· 1 in thread

just wondering, what is the difference if I use two kind of [producer, consumer] message queues (say rabbitmq) instead of this? Does genstage being a erlang system makes a difference?

di4na9y ago

RabbitMQ is written in erlang. So basically you use it natively instead of bringing and configurating a big dependency. It just come with your language for free without needing another process, etc etc.

poorman9y ago

That's awesome and it just goes to show how simple something can be that would otherwise involve a certain degree of concurrent (and distributed) programming.

GenStage has a lot of uses at scale. Even more so is going to be GenStage Flow (https://hexdocs.pm/gen_stage/Experimental.Flow.html). It will be a game changer for a lot of developers.

user59944619y ago

I'd like to say that the official performance unit is the "request per second". And its cousin, the requests per second in peak.

The average per minute only gets to be used because many systems have so little load that the number per second is negligible.

sandGorgon9y ago

how does one achieve this in Celery 4? I remember there was a celery "batch" contrib module that allowed this kind of a batching behavior. But i dont see that in 4

IOT_Apprentice9y ago

Why not use Kafka for back pressure?

imaginenore9y ago

> "Firebase requires that each XMPP connection has no more than 100 pending requests at a time. If you have 100 requests in flight, you must wait for Firebase to acknowledge a request before sending another."

So... get 100 firebase accounts and blast them in parallel.

j / k navigate · click thread line to collapse

161 comments

97 comments · 19 top-level

coverband9y ago· 21 in thread

Quick serious question: How does this company plan to make money? They're surely well funded[1], but what's their end game?

[1] "We've raised over $30,000,000 from top VCs in the valley like Greylock, Benchmark, and Tencent. In other words, we’ll be around for a while."

robryan9y ago

This worries me a bit. At the moment they are providing the software and hosting it all absolutely free.

I would happily still use Discord if they provided the exact same thing with a monthly fee. Hopefully at some point they throw in some extra features for a pro version and start charging.

I just use Discord for gaming and haven't used Slack a lot, but I think Discord will be great for work as soon as they release search.

logicfiction9y ago

chinhodado9y ago

Realistically, the plan is probably getting acquired by something like Youtube Gaming or Twitch to be part of the platform.

HCIdivision179y ago

They're targetting the Twitch model a bit with customizable bits for a charge. Basically allowing some chat branding for guilds and the like.

corobo9y ago

SpaceManiac9y ago

nstj9y ago

[0]: www.discordapp.com

jlarocco9y ago

Something smells very fishy there. I have a hard time believing they got a $30 million investment by pitching stickers, add-on themes, and sound effects...

2 more replies

meddlepal9y ago

chinhodado9y ago

1 more reply

b1naryth1ef9y ago

We've adamantly stated many a time that we will never sell users data, or put ads in the app.

4 more replies

falcolas9y ago

Pure opinion:

> mine that data [...] more addictive [...] games

Maybe that's just a failure of imagination on my part, but I'm ok with that.

ajamesm9y ago

if that kind of information had any value whatsoever, Reddit would be worth billions

0942v86539y ago

Aggregated or Non-identifiable Data: We may also share aggregated or non-personally identifiable information with our partners or others for business purposes.

imglorp9y ago

They're also free international. I bet plenty of governments would pay for a steady feed.

baldfat9y ago

Serious Answer: To Kill IRC for a whole generation

koolba9y ago

I thought that was Slack's plan.

4 more replies

lmm9y ago

You didn't answer the question. How does that make money? If it's about killing IRC, who's paying for that?

jamie_ca9y ago

I sure hope one possible end-game is to re-skin it without the gaming focus and price it competitively to Hipchat/Slack.

lsmarigo9y ago

probably also working on a premium service with a monthly charge to go along with the free tier

mevile9y ago

They're in the gaming market, there's cash plenty. Even if they just went the advertisement route, which I doubt they will they should end up doing well.

jondot9y ago· 10 in thread

Hate to be a party pooper, but I'd like to give people here a more generic mental tool to solve this problem.

I've used Java and Storm and RabbitMQ to build a scalable, dynamic, streaming cluster of workers.

You can also do this with Kafka but it'll be less transactional.

After tackling this problem a couple times, I'm completely convinced Discord's solution is suboptimal. Sorry guys, I love what you do, and this article is a good nudge for Elixir.

Vishnevskiy9y ago

Context matters :)

jondot9y ago

metafunctor9y ago

Could you not reach pretty much the same result with a queue, though?

Some messages will not be delivered on time if there is significant backpressure. There is not much you can do about it, apart from avoiding choking yourself.

Perhaps the queue could work with a LIFO policy, to help at least some messages go through in time instead of having most messaged delayed near to the expiration threshold.

1 more reply

di4na9y ago

Knowing that RabbitMQ is full erlang, why bring a really big dependency if you have all the things in your everyday language anyway ?

jondot9y ago

An all encompassing answer would be - the abstractions. Why would you use an operating system if really you have your CPU documented and know all of the instruction sets?

gregpardo9y ago

Yeah this... Hey guys you don't need erlang to solve this problem... just use this tool built in erlang.

teacpde9y ago

> You can also do this with Kafka but it'll be less transactional.

Could you explain why using RabbitMQ is more transactional?

chillydawg9y ago

neiled9y ago

Anyone have any nice links to describe more about queue prefetching as described in this case? My google skills are failing because of all the CPU related articles.

jondot9y ago

http://www.rabbitmq.com/consumer-prefetch.html

dimino9y ago· 8 in thread

What is up with Discord? I feel like it's quietly (maybe not so quietly) one of the bigger startups to come out in the last two years.

It seems to have totally taken over a space that wasn't even clearly defined before they got there.

HCIdivision179y ago

Fnoord9y ago

TS supports plugins. No lag issues, can run on your own server. Closed source.

Mumble is open source, no lag issues. Interface is slightly less good than TS. Supports SSL.

Discord, like you say, Just Works (tm). It is very easy to use, the interface is amazing, its in active development, and setting up a server is free. It also works in the web browser.

If you're into Blizzard games, Battle.net recently added native VoIP in their client. The advantage that has as Blizzard gamer is you don't have to install any 3rd party software.

lifeformed9y ago

> it just works.

j_s9y ago

And the voice chat web interface just works!

bpicolo9y ago

They had a really well defined user-space, marketed at it well, and really nailed the user experience, while still being free for the typical user. There is a lot to love about Discord.

baldfat9y ago

I use Discord everyday BUT I seriously prefer IRC with weechat and glowing-bear.org.

I feel like everything down with Discord could be done with IRC in a open source way. IRC for the 21st Century?

3 more replies

Apocryphon9y ago

Discord's design feels a lot like Slack. I wonder where the overlap is between the two products.

2 more replies

Numberwang9y ago

It's just too bad there are a dozen IMs/voice/video and a dozen slacky/feed companies.

jtchang9y ago· 7 in thread

pdexter9y ago

http://ferd.ca/handling-overload.html

This is a good article about overload and back pressure. It also lists some tools in Erlang to solve these sorts of issues. It also mentions genstage (very) briefly.

SikulOP9y ago

This short book by Fred is also a great read about Erlang production systems in general https://www.erlang-in-anger.com/

user59944619y ago

Yes. You need to adapt the capacity of the system to handle the full load with -N- boxes dead.

Corollary: If you have 2 boxes, each of them has to be able to handle all the traffic, so you can't save money by using smaller boxes :D

Corollary #2: If you have 2 datacenters, each of them has to be able to handle all the traffic, so you burn a lot of money :D

xxpor9y ago

Or just the ability to tell your clients to back off.

Hopefully your clients retry server errors with exponential backoff, if you lose a datacenter you can send half of the requests 503 until you're back at a manageable load.

Hopefully detecting the load/generating the 503 is really cheap.

1 more reply

rahimnathwani9y ago

Instead of 2 boxes that can handle all your load, why not 4 that can each handle 1/3rd of the load?

2 more replies

nurettin9y ago

Backpressure is more common than water across engineering disciplines and it isn't an integrated part of every distributed system out there? Isn't that a bit of an oversight?

phamilton9y ago

Well, it is a core part of TCP.

erikbern9y ago· 7 in thread

"requests per minute" is such a useless unit of measurement. Please always quote request rates per second (i.e. Hz).

Makes me think of the Abraham Simpson quote: "My car gets 40 rods to the hogshead and that's the way I likes it!"

ipozgaj9y ago

Not sure why are you getting downvoted, I came here to make the same comment.

sethammons9y ago

I read it originally as 1M QPS, and thought that was a nice number. It was upon further inspection that I saw it was 1M QPM, and I was no longer intrigued.

1 more reply

hueving9y ago

user59944619y ago

Actually. The conversion doesn't work.

The requests per minute number is an average.

The requests per second number should be given for peak load. That is a very important metric, a system has to be scaled to sustain the peaks, not the average.

We'd need to know the traffic pattern to know the multiplier, that is certainly not 60 :p

1 more reply

corobo9y ago

You can also do it the opposite way if you want less specific numbers. Multiply by 60 and round off for the units per hour!

ceejayoz9y ago

I, like most people, have no idea what a rod or a hogshead is.

The same is hardly true for the conversion of minutes to seconds.

StavrosK9y ago

Such arcane units as "seconds" are only used by three countries in the world, though.

hotdogs9y ago· 6 in thread

"Obviously a few notifications were dropped. If a few notifications weren’t dropped, the system may never have recovered, or the Push Collector might have fallen over."

How many is a few? It looks like the buffer reaches about 50k, does a few mean literally in the single digits or 100s?

SikulOP9y ago

teej9y ago

This system seems great for at most once delivery. I wish I had more problems to solve with that constraint.

1 more reply

jsjohnst9y ago

Curious why you don't have metrics on the other end (whatever is sending to the "Push Collector")?

1 more reply

Matthias2479y ago

2 more replies

bcherny9y ago

Can you explain why it's necessary that some notifications were dropped?

1 more reply

DougN79y ago

pwf9y ago· 5 in thread

50k seems like a low bar to start losing messages at. If this was done with Celery and a decently sized RabbitMQ box, I would expect it to get into the millions before problems started happening.

Vishnevskiy9y ago

We currently have 3 machines doing this for millions of concurrent users. At the writing of this article it was 2 machines.

jsjohnst9y ago

What size machines are these? I'm shocked that this volume is your max handling with Erlang unless your using a smaller T series AWS instance for this.

1 more reply

jhgg9y ago

Edit: Also worth mentioning, the 50k buffer is for a single server, we run multiple push servers in the cluster.

ramchip9y ago

abrookewood9y ago

The issue was not the ability of their servers to handle the load, but the ability of Firebase to ingest the notifications - at least, that's how I read it.

AgentK209y ago· 4 in thread

Anyone know of a equivalent libraries like GenStage for other languages? (Java, NodeJS, etc)

I'd definitely be able to put to use things like flow limiters and queuing and such, but none of my company's projects use Elixir :(

bpicolo9y ago

ReactiveX seems to have documented notions for it: https://github.com/ReactiveX/RxJava/wiki/Backpressure

Highly recommend the Reactive series of libs. They're typically very well done.

The guy below is right that Akka is perfectly suited.

bhelx9y ago

Akka streams?

wtf_is_up9y ago

There was an initiative not long ago called Reactive Streams which established some common interfaces to build things like this. Back pressure was one of the main concerns.

Some implementations are listed here: http://www.reactive-streams.org/announce-1.0.0

gazarullz9y ago

For java there's also: - Project reactor from Spring - Reactive Spring (following up with spring 5.0)

snambi9y ago· 3 in thread

million requests per minute, is this a big deal?

user59944619y ago

16k per second. 83k per second during peak (assuming 80/20 default traffic rule).

- 100 /s = typical limit of a standard web application (python/ruby), per core

- 1.000 /s = typical limit of an application running on a full system

- 10.000 /s = typical limit for fast systems (load balancers, DB, haproxy, redis, tomcat...).

- Over 10.000/s You gotta scale horizontally because a single box [shouldn't] can't take it.

So no. Not a big deal at all... if you've done that before and you've got the experience :D

manigandham9y ago

1 more reply

manigandham9y ago

Everything is relative. In this case, it's not so much the actual load itself but rather the throttling ability to match the upstream provider's throughput and limitations.

bpicolo9y ago· 2 in thread

I love Discord, and love Elixir too, so this is a pretty great post.

chatmasta9y ago

It's not the final bottleneck, it's the first constraint. ;)

bpicolo9y ago

Hah, fair. It's always unfortunate when it's hard to address the real limitation though :)

mevile9y ago· 2 in thread

b1naryth1ef9y ago

What features in particular? The most common one we hear is search, which is actually implemented and undergoing internal testing before a public preview soon.

mevile9y ago

manigandham9y ago· 1 in thread

Akka(.NET) or any actor system is a perfect fit for this and brings the same functionality to other languages and frameworks.

brightball9y ago

Not exactly. Without running on the BEAM you're left with cooperative scheduling (handing back control to the scheduler) of processes instead of pre-scheduling (the scheduler can stop you).

Fairly critical for the stability of real time systems.

The other piece here is that these processes are cheaper on the BEAM than any other platform in terms of RAM cost.

.5Kb / process on the Erlang VM. A goroutine in Golang is the next closest at 2kb.

The two combined are one of the big reasons why benchmarks don't tell the whole story with Erlang/Elixir. It's harder to measure consistency in the face of bad actors.

sbov9y ago· 1 in thread

Is the number of Push Collectors to Pushers constant or can it vary based upon notification load?

jhgg9y ago

rv119y ago· 1 in thread

just wondering, what is the difference if I use two kind of [producer, consumer] message queues (say rabbitmq) instead of this? Does genstage being a erlang system makes a difference?

di4na9y ago

poorman9y ago

That's awesome and it just goes to show how simple something can be that would otherwise involve a certain degree of concurrent (and distributed) programming.

GenStage has a lot of uses at scale. Even more so is going to be GenStage Flow (https://hexdocs.pm/gen_stage/Experimental.Flow.html). It will be a game changer for a lot of developers.

user59944619y ago

I'd like to say that the official performance unit is the "request per second". And its cousin, the requests per second in peak.

The average per minute only gets to be used because many systems have so little load that the number per second is negligible.

sandGorgon9y ago

how does one achieve this in Celery 4? I remember there was a celery "batch" contrib module that allowed this kind of a batching behavior. But i dont see that in 4

IOT_Apprentice9y ago

Why not use Kafka for back pressure?

imaginenore9y ago

So... get 100 firebase accounts and blast them in parallel.

j / k navigate · click thread line to collapse