You CAN do hot upgrades, but many times the complexity of doing so far outweighs the benefits. For any non trivial app it makes updates/deploys to the app non trivial as well.
If you're using distribution, there's a good chance you need to deal with messages from both updated and not updated nodes anyway, so handling different shapes from different versions is a skill you already need to have; it's just hot load exposes it at more layers than previously.
Since working in Erlang for several years, when I have to work in languages where hot loading is not easy or common, it's always frustrating. Throwing away connection attached state to update code is a hard choice that doesn't need to be.
I have no experience with Erlang/Elixir myself, so please excuse me if this is a silly question – but, why can't the language/platform actually detect the difference in struct shape, and refuse to deploy the new version in that case? Or, demand that you provide some sort of conversion function that maps the old struct to the new one? Is this a consequence of lack of static typing?
The language standard library gives you hooks to relup the internal state of all of your processes: https://hexdocs.pm/elixir/GenServer.html#c:code_change/3
What it doesn't do is give you a hook to relup the messages that get passed between processes. If you do it right, you can pattern match against different versions of messages, and do the correct conversion function, but there are liable to be many, many, shapes of messages (including ones that you might not write yourself, e.g. coming from a library) and it might be difficult to catch them all.
To do it right, I would want to set up a lot of testing to make sure you can hot code reload safely. There is no specific guideline, and it probably hurts forward progress in developing guidelines, that generally, it's okay to have some downtime in any individual server node as erlang/elixir encourages you to think about failover anyways, and most elixir apps are relatively stateless webservers, so you've probably got robust load balancing and migration scheme in place in your cluster to begin with, making blue/green or rolling updates a "pretty much good enough" thing.
> Is this a consequence of lack of static typing
No.
Digging deeper the line from folks using erlang/elixir "in anger" was always that it was a supported feature but the reality is that most people shouldn't do it and wouldn't need to for the hassle it has.
That "separate mechanism" could be as simple as a plug that tells Mix to recompile and reload. Example (designed for - and part of - the Sugar framework, but should theoretically work for any Plug-based app, including Phoenix-based ones): https://github.com/sugar-framework/plugs/blob/master/lib/sug...
Hot upgrades like this are not foolproof (which is likely the reason why the above example is gated to :dev environments); there are other concerns like database migrations and other internal and external variances that make this inappropriate for most production situations. That said, these same concerns often exist for other high-availability situations as well, so if you know that you want zero-downtime code upgrades, figuring out a way to do it cleanly within the application is likely valuable as a way to avoid the hell on Earth that is trying to do this with, say, a bunch of load-balanced Docker comtainers.
So the better option would be to figure out a way to compile a new release, get the modules in place where the currently-running release expects them, then write up a plug similar to the above to check for new module versions and load them (or just do it as part of whatever mechanism you'd use to get the updated modules into the running release in the first place).
Just the one being newly packaged into Elixir 1.9+ by default does not yet support hot-loading...
With a single code base capable of having millions of processes running as the norm, some handling direct client requests and others handling in-progress work, data storage, holding open connections for transfers, etc...you get the capability to deploy without disrupting ANY of that.
Most run times can't do anything close to that. Think about all of those X million websocket benchmarks...now think about being able to deploy without forcing all X million to try to reconnect at the same time.
And it can do this while all of the nodes are connected and communicating with each other as well as the outside world.
For standard issue client server, it's not that big of a deal. You just separate the web parts behind a load balancer.
For background workers, long lived connections, web sockets, video/audio streams, file transfers (CDN)...it's huge.
- to understand exactly what your app is doing
- to understand exactly what Erlang/Elixir releases are doin
- to understand exactly how cod upgrade works
- to understand exactly or very damn well how to make the system handle those 1 million connections
- to understand exactly how to handle all the things you wrote about
And then, and only then will you be able to "think about being able to deploy without forcing all X million to try to reconnect at the same time."
There are no magic bullets.