Graceful server restart with Go (opens in new tab)

(blog.appsdeck.eu)

119 points_Soulou11y ago49 comments

49 comments

36 comments · 16 top-level

jpollock11y ago· 3 in thread

Process upgrades are a variant of fail-over on either hardware death or bug. I recommend treating upgrades as a chance to test your failure recovery processes.

If you really can't afford someone getting a "connection refused" what happens when the machine's network connection dies?

_SoulouOP11y ago

You're right, but in some cases, for example a 'git push' to deploy an application, we don't want the SSH connection to be cut when upgrading our SSH server, this is not an option. If the server crashes the current connections will be lost, but it should be the only scenario when this occures.

laumars11y ago

You can cut an SSH connection while SSHed in providing the server comes back up in a timely manner as SSH does some clever keep alive tricks to resume broken connections. I often restarted sshd or even the network interfaces themselves while SSHed into some Linux boxes.

1 more reply

alexk11y ago

For us it's about not dropping your customers requests when we can, obviously that happens sometimes, but there's no need to do that if you can avoid it during normal deployment cycles.

finnh11y ago· 3 in thread

Questions from someone who doesn't use Go:

1. Won't this leave the parent process running until the child completes? And, if you do this again & again, won't that stack up a bunch of basically dead parent processes? Maybe I'm misunderstanding how parent/child relatioships work with ForkExec

2. What if you want the command-line arguments to change for the new process?

3. In addressing (2), in general would it be simpler to omit the parent-child relationship with a wrapper program? The running (old) process can write its listener file descriptor to a file, similar to how it is done here, and the wrapper reads that file & sets an environment variable (or cmd-line argument) telling the new process?

The wrapper could be used for any server process which adheres to a simple convention:

on startup, re-use a listener FD if provided (via env or cmd line ... or ./.listener)

once listening, write your listener FD to well-known file (./.listener)

on SIGTERM, stop processing new connections but don't close the listener (& exit after waiting for current connections to close, obvi)

4. Am i the only one who finds "Add(1)/Done()" to be an odd naming convention? I might go with "Add(1)/Add(-1)" instead just for readability

_SoulouOP11y ago

Hi finnh, post author here,

1. When the parent process has finished handling its connections, it just exits. The children are then considered as 'orphans' and are automatically attached to the init process. When you run your service as a daemon, that's exactly what you want, so you don't have a huge stack of processes. 2. I used syscall.ForkExec(os.Args[0], os.Args […]), but I could changed the string array os.Args by anything I want to change the arguments. 3. It could be a way to do it, it would also work, but it is not the choice we have done. 4. It may look a bit weird, but it's part of the language, you get used to it really quickly ;-)

alexk11y ago

1. No if you don't call wait on child. Make sure to collect the status though in case if child has crashed for some reason otherwise you will end up with zombie:

  syscall.Wait4(-1, &wait, syscall.WNOHANG, nil)

I would also recommend to use a higer level StartProcess instead

2. You can pass any arguments when starting a child, or even execute completely different binary:

  p, err := os.StartProcess(path, os.Args, &os.ProcAttr{
    Dir: wd,
    Env: os.Environ(),
    Files: files,
    Sys: &syscall.SysProcAttr{},
  })

rcconf11y ago

4. I find Add(-1) to be less readable. .Subtract(1) vs .Add(-1)? How do you find Add(-1) to be more readable?

I've also written an implementation of a very similar pattern in Node (wait for a set of asynchronous things to complete) and I've used Add() and Signal(), never Add(-1)

steakejjs11y ago· 3 in thread

So I wrote a golang application and it runs behind nginx. My server "restart" when I want to push new code is,

Re run my program on a different port, point nginx at the new port, reload nginx, kill the old.

Curious what is so bad about this approach? I admit it's hacky, but it works. Is there just too many things to do?

blakecaldwell11y ago

Similar to what we do. We have two instances of our services behind HAProxy. We drain and kill one instance, then restart or upgrade it, add it back to the pool, then do the same for the other. Nothing hacky about it, and it's language/framework independent.

alexk11y ago

There's nothing wrong with this approach, in either way you have to ensure that there are no active requests in flight when restarting - you can do this either from nginx side, server side or both.

_SoulouOP11y ago

Actually, we are not only using this for HTTP services, but also for our SSH server (the one I've introduced last week). This method was mandatory for us!

Edit: typo

zimbatm11y ago· 2 in thread

This is something the process manager should handle. With this approach each language and program has to implement the fd-passing and restart coordination. It also doesn't integrate really well with systemd/upstart because they want a stable pid.

That's why I wrote socketmaster[1], it's simple enough that it doesn't need to change and it's the one handling the socket and passing it to your program. I haven't had to touch it for years now.

For my current work I wrote crank[2], a refinement of that idea. It's a bit more complex but allows to coordinate restarts. It implements a subset of the systemd socket activation protocol conventions. All your program has to do is look for a LISTEN_FDS environment variable to find the bound file descriptor and send a "READY" message on the NOTIFY_FD file descriptor when it's ready to accept new connection. Only then will crank shutdown the old process.

* [1]: https://github.com/zimbatm/socketmaster * [2]: https://github.com/pusher/crank

mnutt11y ago

For many workloads that's fine, but it sounds like some people have found that using the OS socket handling doesn't distribute very evenly among processes in some cases: http://strongloop.com/strongblog/whats-new-in-node-js-v0-12-...

Edit: more concise explanation, in the context of SO_REUSEPORT: http://lwn.net/Articles/542718/

zimbatm11y ago

Yes it's true that socket sharing is not optimal for load-balancing scenarios. Here you would use 1 crank/socketmaster process per process that you want to run and a haproxy in front. Their goal is not to do load balancing but to not disrupt existing (web)socket connections when a new process is started.

alexk11y ago· 2 in thread

Here's the library that implements this pattern:

https://github.com/gwatts/manners

And Mailgun's fork that supports passing file descriptors between processes:

https://github.com/mailgun/manners

_SoulouOP11y ago

Hi alexk, post author here,

The 'manners' package only enables graceful shutdown in a HTTP server, there is still work to be done to restart it gracefully, that's what I'm trying to show in the article.

alexk11y ago

Great article btw!

That's why I've added missing methods here:

https://github.com/mailgun/manners

Getting files from listener:

https://github.com/mailgun/manners/blob/master/listener.go#L...

Starting server with external listener:

https://github.com/mailgun/manners/blob/master/server.go#L87

It's used to restart Vulcand without downtime:

https://github.com/mailgun/vulcand/blob/master/service/servi...

Let's collaborate on this as a library if you are interested

fjordan11y ago· 2 in thread

Goagain by Richard Crowley is a great package that we are using for graceful restarts: https://github.com/rcrowley/goagain

EDIT: added author

alexk11y ago

I found that it corrupts the FD by using this call:

https://github.com/rcrowley/goagain/blob/master/goagain.go#L...

Not sure why it happens though, but it led to all sorts of strange intermittent issues with broken connections.

Once I replaced this logic with passing files using GetFile().Fd() instead it started working fine, so beware of this. I still wonder why it happens though.

fjordan11y ago

Thanks for this.

Were you able to publish your changes either on a fork or in a PR?

1 more reply

agwa11y ago· 2 in thread

Does anyone know what's going on with this line:

> file := os.NewFile(3, "/tmp/sock-go-graceful-restart")

What's with that filesystem path, which isn't referenced anywhere else, and which should be unnecessary because the file descriptor 3 is inherited when the process starts?

peterwaller11y ago

Given a file descriptor, this returns an os.File [1] (it's analogous to a `FILE` from C-land). Which among other things, knows its name [2]. I guess it's hard to in a portable way go from a file descriptor to filename, so it's a parameter to `NewFile`.

[1] http://golang.org/pkg/os/#File [2] http://golang.org/pkg/os/#File.Name

agwa11y ago

Thanks. Indeed, it's not generally possible to go from a file descriptor to filename, because not all file descriptors refer to files with names (e.g. sockets). So I guess the second argument in this case is just a dummy value that is necessary because the Go library doesn't have a way to create a nameless "file" from a file descriptor.

the_mitsuhiko11y ago· 2 in thread

You can just use systemd for this. Open the FD in systemd and then pass it into the process.

_SoulouOP11y ago

The less I rely on the operating system environment, the best it is. Go provides static binaries, easy to deploy. I don't want to get hooked to systemd. (even more if my server can handle the use case with a few lines of code.)

acdha11y ago

Notice how the discussion is full of edge-cases which this code doesn't handle? That's the argument for letting the operating system handle it unless this is such a key part of what you do that it's worth taking on that expense personally.

1 more reply

dividuum11y ago· 1 in thread

I wonder if there's any library out there that uses the SO_REUSEPORT option (see http://lwn.net/Articles/542629/). It allows multiple programs to bind and accept on the same port. So I guess it should be possible to just start a second new process and then gracefully terminate the old one. Any thoughts?

zimbatm11y ago

You still want some form of coordination to only stop the old process when the new one is live. The best place to be to know about the life of a process is being it's parent process (because you can call wait(pid)) and in that case you might as well open the socket and pass it's fd during fork/exec. That way you keep the cross-POSIX compatibility.

gtrubetskoy11y ago

exec.Command() is a more elegant approach, I've written about this back in June, my write-up was specific to an HTTP server: http://grisha.org/blog/2014/06/03/graceful-restart-in-golang...

I think the article also misses an important step - you need to let the new process to initialize itself (e.g. read its config files, connect to db, etc), and then signal the parent that it is ready to accept connections, only at which point the parent stops accepting. The important point here is that the child may fail to init, in which case the parent should carry on as if nothing happened.

jcrites11y ago

Interesting technique! I can see this being useful in applications that are single points of failure. In redundant systems, however, I have found it quite effective and generally prefer to solve this problem upstream of the application, in the load balancer, by routing traffic around machines during each machine's deployment.

First step of a deployment: shift traffic away from the machine, while allowing outstanding requests to complete gracefully. Next you can install new software or undertake any upgrade actions in isolation. This way any costs involved in the deployment don't impair the performance of real traffic. Bring the new version up (and prewarm if necessary). Finally, direct the load balancer to resume traffic. We call the general idea "bounce deployments", as a feature of the deployment engine.

Two advantages of having a general-purpose LB solution:

(1) You can apply it to any application or protocol, regardless of whether the server supports this type of socket handoff. Though to be fair, some protocols are more difficult to load balance than others - but most can be done, with some elbow grease (even SSH).

(2) It's possible to run smoke tests and sanity tests against the new app instance, such that you can abort bad deployments with no impact. Our deployment system has a hook for sanity tests to be run against a service after it comes up. These can verify its function before the instance is put back into the LB, and are sometimes used to warm up caches. If you view defects and bad deployments as inevitable, then the ability to "reject" a new app version in production with no outage impact is a great safety net. With the socket handover, your new server must function perfectly, immediately, or else the service is impaired. (Unless you keep the old version running and can hand the socket back?)

(By LB I don't necessarily mean a hardware LB. A software load balancer suffices as well - or any layer acting as a reverse proxy with the ability to route traffic away from a server automatically.)

A technique like this would also be useful for implementing single-points like load balancers or databases, so that they can upgrade without outage. Though failover or DNS flip is usually also an option.

christop11y ago

I've played with Einhorn from Stripe, which works pretty nicely for graceful restarts too:

https://stripe.com/blog/meet-einhorn

https://github.com/stripe/go-einhorn

thinxer11y ago

Here's the `grace` package from Facebook: https://github.com/facebookgo/grace

iffycan11y ago

Here's a way to do it for any language: https://github.com/iffy/grace

wbeckler11y ago

Does anyone know if there is a way of doing the same thing in node.js?

teabee8911y ago

I really enjoyed reading this article. Thanks!

j / k navigate · click thread line to collapse

49 comments

36 comments · 16 top-level

jpollock11y ago· 3 in thread

Process upgrades are a variant of fail-over on either hardware death or bug. I recommend treating upgrades as a chance to test your failure recovery processes.

If you really can't afford someone getting a "connection refused" what happens when the machine's network connection dies?

_SoulouOP11y ago

laumars11y ago

1 more reply

alexk11y ago

For us it's about not dropping your customers requests when we can, obviously that happens sometimes, but there's no need to do that if you can avoid it during normal deployment cycles.

finnh11y ago· 3 in thread

Questions from someone who doesn't use Go:

2. What if you want the command-line arguments to change for the new process?

The wrapper could be used for any server process which adheres to a simple convention:

on startup, re-use a listener FD if provided (via env or cmd line ... or ./.listener)

once listening, write your listener FD to well-known file (./.listener)

on SIGTERM, stop processing new connections but don't close the listener (& exit after waiting for current connections to close, obvi)

4. Am i the only one who finds "Add(1)/Done()" to be an odd naming convention? I might go with "Add(1)/Add(-1)" instead just for readability

_SoulouOP11y ago

Hi finnh, post author here,

alexk11y ago

1. No if you don't call wait on child. Make sure to collect the status though in case if child has crashed for some reason otherwise you will end up with zombie:

  syscall.Wait4(-1, &wait, syscall.WNOHANG, nil)

I would also recommend to use a higer level StartProcess instead

2. You can pass any arguments when starting a child, or even execute completely different binary:

  p, err := os.StartProcess(path, os.Args, &os.ProcAttr{
    Dir: wd,
    Env: os.Environ(),
    Files: files,
    Sys: &syscall.SysProcAttr{},
  })

rcconf11y ago

4. I find Add(-1) to be less readable. .Subtract(1) vs .Add(-1)? How do you find Add(-1) to be more readable?

I've also written an implementation of a very similar pattern in Node (wait for a set of asynchronous things to complete) and I've used Add() and Signal(), never Add(-1)

steakejjs11y ago· 3 in thread

So I wrote a golang application and it runs behind nginx. My server "restart" when I want to push new code is,

Re run my program on a different port, point nginx at the new port, reload nginx, kill the old.

Curious what is so bad about this approach? I admit it's hacky, but it works. Is there just too many things to do?

blakecaldwell11y ago

alexk11y ago

There's nothing wrong with this approach, in either way you have to ensure that there are no active requests in flight when restarting - you can do this either from nginx side, server side or both.

_SoulouOP11y ago

Actually, we are not only using this for HTTP services, but also for our SSH server (the one I've introduced last week). This method was mandatory for us!

Edit: typo

zimbatm11y ago· 2 in thread

That's why I wrote socketmaster[1], it's simple enough that it doesn't need to change and it's the one handling the socket and passing it to your program. I haven't had to touch it for years now.

* [1]: https://github.com/zimbatm/socketmaster * [2]: https://github.com/pusher/crank

mnutt11y ago

Edit: more concise explanation, in the context of SO_REUSEPORT: http://lwn.net/Articles/542718/

zimbatm11y ago

alexk11y ago· 2 in thread

Here's the library that implements this pattern:

https://github.com/gwatts/manners

And Mailgun's fork that supports passing file descriptors between processes:

https://github.com/mailgun/manners

_SoulouOP11y ago

Hi alexk, post author here,

The 'manners' package only enables graceful shutdown in a HTTP server, there is still work to be done to restart it gracefully, that's what I'm trying to show in the article.

alexk11y ago

Great article btw!

That's why I've added missing methods here:

https://github.com/mailgun/manners

Getting files from listener:

https://github.com/mailgun/manners/blob/master/listener.go#L...

Starting server with external listener:

https://github.com/mailgun/manners/blob/master/server.go#L87

It's used to restart Vulcand without downtime:

https://github.com/mailgun/vulcand/blob/master/service/servi...

Let's collaborate on this as a library if you are interested

fjordan11y ago· 2 in thread

Goagain by Richard Crowley is a great package that we are using for graceful restarts: https://github.com/rcrowley/goagain

EDIT: added author

alexk11y ago

I found that it corrupts the FD by using this call:

https://github.com/rcrowley/goagain/blob/master/goagain.go#L...

Not sure why it happens though, but it led to all sorts of strange intermittent issues with broken connections.

Once I replaced this logic with passing files using GetFile().Fd() instead it started working fine, so beware of this. I still wonder why it happens though.

fjordan11y ago

Thanks for this.

Were you able to publish your changes either on a fork or in a PR?

1 more reply

agwa11y ago· 2 in thread

Does anyone know what's going on with this line:

> file := os.NewFile(3, "/tmp/sock-go-graceful-restart")

What's with that filesystem path, which isn't referenced anywhere else, and which should be unnecessary because the file descriptor 3 is inherited when the process starts?

peterwaller11y ago

[1] http://golang.org/pkg/os/#File [2] http://golang.org/pkg/os/#File.Name

agwa11y ago

the_mitsuhiko11y ago· 2 in thread

You can just use systemd for this. Open the FD in systemd and then pass it into the process.

_SoulouOP11y ago

acdha11y ago

1 more reply

dividuum11y ago· 1 in thread

zimbatm11y ago

gtrubetskoy11y ago

exec.Command() is a more elegant approach, I've written about this back in June, my write-up was specific to an HTTP server: http://grisha.org/blog/2014/06/03/graceful-restart-in-golang...

jcrites11y ago

Two advantages of having a general-purpose LB solution:

(By LB I don't necessarily mean a hardware LB. A software load balancer suffices as well - or any layer acting as a reverse proxy with the ability to route traffic away from a server automatically.)

christop11y ago

I've played with Einhorn from Stripe, which works pretty nicely for graceful restarts too:

https://stripe.com/blog/meet-einhorn

https://github.com/stripe/go-einhorn

thinxer11y ago

Here's the `grace` package from Facebook: https://github.com/facebookgo/grace

iffycan11y ago

Here's a way to do it for any language: https://github.com/iffy/grace

wbeckler11y ago

Does anyone know if there is a way of doing the same thing in node.js?

teabee8911y ago

I really enjoyed reading this article. Thanks!

j / k navigate · click thread line to collapse