Ask HN: How do you handle WebSocket connections reconnect problem?

9 pointsmillon3y ago8 comments

There are few examples on HN, how to handle 1M to 10M websocket connections. I think it's a solved problem. Most of my connections will be idle for most of the time.

Now the real problem is how to make them production ready. If we add TLS, it will become super slow to connect new connections. I think per core can handle few hundred new connection with TLS. Reconnect can be faster.

How did you solve the TLS with websocket problem? What happens when 1 million connections get disconnected and try to reconnect at the same time? What is your reconnection rate per core?

9 pointsmillon3y ago8 comments

There are few examples on HN, how to handle 1M to 10M websocket connections. I think it's a solved problem. Most of my connections will be idle for most of the time.

How did you solve the TLS with websocket problem? What happens when 1 million connections get disconnected and try to reconnect at the same time? What is your reconnection rate per core?

8 comments

8 comments · 5 top-level

austin-cheney3y ago· 3 in thread

I wrote my own web socket library so that I can integrate authentication into the connection handshake. Since I have my own library I also wrote conventions to automatically attempt reconnects in 15 second intervals when the connection drops.

https://github.com/prettydiff/share-file-systems/blob/master...

jtokoph3y ago

You may want to consider adding a random jitter and backoff to the reconnect logic. Otherwise, if the server goes down and comes back up, all clients will reconnect at the same time and overload the server.

dangitnotagain3y ago

This is great! The whole project is useful. Thanks!

hayst4ck3y ago

Look into exponential backoff with jitter. 15 second retry logic sounds like bad news.

mindcrash3y ago

Instead of letting clients directly interface with your services over websockets, consider using Pushpin [1], which allows you to completely isolate realtime communication from your services.

As a bonus, it also provides you the ability to cycle (redeploy/restart) your services without your clients having to reconnect (that's where the name comes from). And as you can imagine - because communication with your services is entirely stateless it scales like crazy.

[1] https://pushpin.org/

hayst4ck3y ago

> What happens when 1 million connections get disconnected and try to reconnect at the same time?

I think the distributed system term for your problem is called the 'thundering herd problem,' so searches that involve that would likely be fruitful. "Thundering herd websockets" would probably be fruitful.

From a reliability perspective, implement exponential back-off on the client that includes jitter. This is a core necessity in all clients. I only skimmed this article, but it looked right: https://aws.amazon.com/blogs/architecture/exponential-backof...

When Signal had outages from the increased load during the WhatsApp exodus, it was due to this not being implemented in their clients.

Additionally, consider your load balancing architecture. If one machine goes down, do all reconnects go to that machine, or do the reconnects get distributed to all the machines? Can you administratively drain a machine? Can you quickly allocate some spare capacity?

Lastly, you can get into situations where your entire infrastructure is overloaded. You will need a throttling mechanism. That throttling mechanism can synergisticly work with your load balancer or client. If you benchmark your server and it can only handle 500 concurrent re-connections, then that is a hard limit you know you can enforce fail-fast behavior with.

Summary:

  Clients implemented with exponential backoff and jitter
  Loadbalancer architecture
  Defensive "fail fast" throttling or ability to administratively throttle.

gizmo3y ago

You can have a separate load check endpoint (doesn’t even need tls) for clients to check if the client can go ahead with a websocket (re-)connect. The load status check can be served directly by the web server from memory and the connection can be closed immediately after responding so it’s super fast.

And if the servers are so overloaded that the load level endpoint fails to respond? That’s fine because that’s answer also.

joshxyz3y ago

have you tried https://github.com/uNetworking/uWebSockets.js/

j / k navigate · click thread line to collapse

8 comments

8 comments · 5 top-level

austin-cheney3y ago· 3 in thread

https://github.com/prettydiff/share-file-systems/blob/master...

jtokoph3y ago

dangitnotagain3y ago

This is great! The whole project is useful. Thanks!

hayst4ck3y ago

Look into exponential backoff with jitter. 15 second retry logic sounds like bad news.

mindcrash3y ago

Instead of letting clients directly interface with your services over websockets, consider using Pushpin [1], which allows you to completely isolate realtime communication from your services.

[1] https://pushpin.org/

hayst4ck3y ago

> What happens when 1 million connections get disconnected and try to reconnect at the same time?

When Signal had outages from the increased load during the WhatsApp exodus, it was due to this not being implemented in their clients.

Summary:

  Clients implemented with exponential backoff and jitter
  Loadbalancer architecture
  Defensive "fail fast" throttling or ability to administratively throttle.

gizmo3y ago

And if the servers are so overloaded that the load level endpoint fails to respond? That’s fine because that’s answer also.

joshxyz3y ago

have you tried https://github.com/uNetworking/uWebSockets.js/

j / k navigate · click thread line to collapse