I'm particularly curious about when to split off a new process, and what "if things fail, do something simpler" means in practice.
Any suggestions?
ok = do_something_that_might_fail()
If it returns ok: great, it worked and you move on. If it doesn't return ok, the process crashes, you get a crash report, and the supervisor restarts it, if that's how the supervisor is configured. Presumably it starts properly and deals with future requests.There's two issues you might rapidly encounter.
1) if a supervised process restarts too many times in an interval, the supervisor will stop (and presumably restart), and that cascades up to potentially your node stopping. This is by design, and has good reasons, but might not be expected and might not be a good fit for larger nodes running many things.
2) if your process crashes, its message queue (mailbox) is discarded, and if you were sending to a process registered by name or process group (pg), the name is now unregistered. This means a service process crashing will discard several requests; the one in progress which is probably fine (it crashed after all), but also others that could have been serviced. In my experience, you end up wanting to catch errors in service processes, log them, and move on to the next request, so you don't lose unrelated requests. Depending on your application, a restart might be better, or you might run each request in a fresh process for isolation... Lots of ways to manage this.
If so, for what kind of apps.
If not anymore, what would you use instead.
In my mind, there's two really good fits (and some cross over between them).
a) binary matching syntax is really nice for dealing with bit-packed things; although it's not pretty if dealing with little endian values where there's a couple bits in one byte and a couple more in the neighboring byte --- you've got to get the pieces and put them together if it's not just whole bytes. Big endian bit packed structures are easy, but little endian is dominant these days. I don't know if performance is good, but it's easy to read and write for developers.
b) anything with a large number of connections and significant state per connection. A chat server, video conference, etc.
This is why you see 'everyone' build chat from ejabberd. Erlang makes code for this kind of service fairly simple, and hotloading means you can fix bugs without kicking everyone off to restart. Observability features make it reasonable to see what's going on in your system and where bottlenecks are.
Basically, we have hundreds of thousands sensors connected through gateways that keep open TCP/IP connections to the Erlang/OTP distributed backend. We do bidirectional communication as we have many control functions, OTA firmware updates, etc.
There are frequent failures which we handle with supervision trees and “let it fail” design principles. Failures are due to:
* Gateways which are connected through cellular networks with varying signal strength conditions.
* Sensors failing and providing incorrect data (eg. invalid float binaries).
* Buggy firmware in some 3rd party gateways / sensors.
* Buggy firmware in our own gateways and sensors ;-)
I highly recommend Elrang/OTP due to:
* Fault tolerance - processes fail, nodes carry on.
* Concurrency model (mailboxes, linking processes via supervision trees, monitors, trapping, etc)
* Built-in distribution and related modules in standard library
* Pattern matching which makes processing binary data super convenient
* Mnesia database (if used for right things)
Seems like typical exception handling. Erlang isn't even type checked
2. erlang has a variety of type checks at a number of conceptual levels, so strongly recommend you go plow through something like https://learnyousomeerlang.com/ in order to improve your understanding here.
Oh. And it's on the other side of a cluster.
Sure, you could do it in Python or rust. It won't be zero lines of code. You're probably gonna get it wrong.
Rust and Python don't come with supervision trees, so you'd have to build or find that. They also don't come with async messaging, especially not cross node async messaging, so you'd have to build or find that.
Building procsess linking and monitoring where the death of one process notifies or kills other processes and all of that is tricky, and I suspect you won't find that; so you'll have to build it, which is going to be tricky.
But even if you have all of that, now you're writing Erlang style in another language, and it's not idiomatic and people won't understand or like your code.
You can even build in hotloading. I did it (poorly) in Perl in the early 2000s without knowing it was available elsewhere, and I did it more recently in C with dlopen and friends. It makes your code look real funny though.
I also think you missed the ease of raising exceptions... ok = ... is very powerful and concise. You don't write a throw, you don't worry about what failure looks like, you just pattern match success, and if/when failure happens, you usually have what you need to figure out what when wrong and if it's better to do something else, it's easy to update your code (and you can hotload the update to the running system)
The idea is to implement error handling in the core (VM, supervisors, DB connection pool) while the vast majority of the code can just crash at anytime and not worry about closings its files or whatever.
You can handle errors and exceptions if you want, the supervisors etc are more here for the unexpected failures that happens for all sort of reasons in any software.
Sometimes it also hides bugs because your solution appears to work correctly;)
Check code of cowboy, ejabberd, MongooseIM, RabbitMQ for examples. There are many factors on decision when to make a new process. Data locality, the pattern of interaction with other processes, performance considerations. Good idea is to have one process per TCP connection, but not one process per each routed message. And be careful with blocking gen_server calls - these could block or fail.
+ CouchDB?
Riak Core is extremely cool, but Riak is dead by now. It was a child of the times when NoSQL was cool. Still, basho code is interesting to read. (https://github.com/basho/riak_core)
Self-ads: we've tried to remove Mnesia from our project, HN post incoming, once the library is prettified and tested hard (https://github.com/esl/cets).
https://www.erlang.org/doc/design_principles/users_guide
There are some good texts that have more examples:
Erlang & OTP in Action - https://www.manning.com/books/erlang-and-otp-in-action
Designing for Scalability with Erlang/OTP - https://www.oreilly.com/library/view/designing-for-scalabili...
One big example of distributed Erlang is Riak:
Armstrong's Programming Erlang is another one to look at.
Instead, CPU or Memory would increase over time, hit the VM limit, kill and restart.
So later when I noticed this, I could debug and fix it without simultaneously fighting a prod incident.
I have thought of writing this! It would be quite useful to a lot of people.
Elixir has a lot of smaller but very high quality libraries to learn from. You may be interested in how Ecto & Postgrex manage DB connections, in particular how connection sockets are “borrowed” so data doesn’t get repeatedly messaged (read: copied) between processes. Bandit / Thousand Island also make interesting decisions for process structure in HTTP1.1 vs HTTP2.
I think a common mistake is to create processes mimicking classic OOP structure, like an OrderProcessor, ShippingManager, etc. Processes in Erlang are a unit of fault tolerance, not code organization. This means more usually you’ll have one process per request, potentially calling code from many different modules; since requests are the things you want to fail separately from each other.
In RabbitMQ’s case for instance connections and queues are processes, but exchanges are not. It would feel natural to model the problem as three processes with messages going Connection -> Exchange -> Queue, but in reality an exchange is a set of routing rules that can be applied by a connection directly, which avoids a lot of complexity and overhead.
Last thing I’d note is supervision trees etc. are really about handling _unexpected_ errors (Joe uses the terms faults and errors with different meanings iirc). If you want a web request to be retried a few times with a delay, don’t use a supervisor for that, just loop with a sleep. Same for things like validating inputs from a form, usually you’d want to give the user a hint and not just crash.
Some other useful links:
- https://aosabook.org/en/v1/riak.html (bit old, but another large codebase)
[1] https://github.com/cbd/edis [2] https://github.com/elbrujohalcon
Processes are failure and concurrency barriers.
Failure: one process crashing does not crash another process, unless you explicitly want it to (e.g., via Erlang's `link` functionality). So, if you have multiple operations that must not interfere with each other in the case of one of them misbehaving (e.g., your application makes multiple HTTP requests in parallel), you want them in separate processes.
Concurrency: processes are independently and preemptively scheduled by the VM. If you have multiple operations that are not necessarily sequentially ordered, and you want to run them at the same time, you put each of them in a process. One example problem where this applies would be the handling of incoming TCP messages, where each message is not related to the previous or subsequent messages, and you want to be able to process multiple messages at the same time.
If you handle each new message in its own process, the VM will schedule the processing of those messages such that the processing of one message will not interfere with the processing of another. It accomplishes this by tracking a rough proxy of CPU time each process uses (called "reductions" in Erlang) and descheduling processes that consume too many resources and giving other processes a chance to run for a bit. (Note that this is just one example and ignores any performance considerations. There are other approaches but I am omitting them for simplicity's sake)
There are a number of good libraries to look at for these in practice. I'd personally go look at Cowboy and/or Ranch as they deal with lots of IO. Oban is an Elixir job queue library that is fantastic and has very high code quality. Another good one would be Poolboy, which is a worker pool library.
Always?
Times some factor so you have several instances of the same thing in case one fails.
Good luck.
It's a NoSQL DB written in Erlang. I looked at it a few years ago, its master to master replication seemed cool.
I've seen many Erlang systems fail in funny ways, including some of the big examples given here. Supervision trees are cool but it's clearly nonsense to hardcode restart strategy and timing numbers for workers as if all failure modes are the same and deployed in the same network/capacity/resource/conditions with any number of workers. The strategy and schedule for recovering 10 crashed resource workers will clearly be different when you have 1M workers. The strategy will be different if you are timing out on network or if you are getting a resource error and have better things to do than restarting workers.
Focus on fault-tolerance outside erlang - have standby capacity in isolation and load-balance properly, shard the system in isolated pieces as much as you can.