Ask HN: Good examples of fault-tolerant Erlang code? | Better HN

45 comments

36 comments · 12 top-level

toast02y ago· 11 in thread

If you follow OTP design principles, you end up with a supervision tree, and a lot of code like...

   ok = do_something_that_might_fail()

If it returns ok: great, it worked and you move on. If it doesn't return ok, the process crashes, you get a crash report, and the supervisor restarts it, if that's how the supervisor is configured. Presumably it starts properly and deals with future requests.

There's two issues you might rapidly encounter.

1) if a supervised process restarts too many times in an interval, the supervisor will stop (and presumably restart), and that cascades up to potentially your node stopping. This is by design, and has good reasons, but might not be expected and might not be a good fit for larger nodes running many things.

2) if your process crashes, its message queue (mailbox) is discarded, and if you were sending to a process registered by name or process group (pg), the name is now unregistered. This means a service process crashing will discard several requests; the one in progress which is probably fine (it crashed after all), but also others that could have been serviced. In my experience, you end up wanting to catch errors in service processes, log them, and move on to the next request, so you don't lose unrelated requests. Depending on your application, a restart might be better, or you might run each request in a fresh process for isolation... Lots of ways to manage this.

What we did in a project was storing message data in a JSONB field of a database table, with an id of the recipient process in another field. We used the module name for that. Then we sent a normal message with the id of the database record. If the process succeeded it marked the record as done. If it crashed, any of those proceesses would read all of its unprocessed records upon restart and start working on them. We took care never to restart a failed process too fast. Sometimes they kept failing for hours or days until we had a fix, then they processed their backlog. If some jobs were not important anymore or must not be performed because something else took care of them (even people manually) the fix had to skip them. Manual UPDATEs, code, etc.

@toast0, would you still use Erlang/OTP today?

If so, for what kind of apps.

If not anymore, what would you use instead.

toast02y ago

Yes. Certainly.

In my mind, there's two really good fits (and some cross over between them).

a) binary matching syntax is really nice for dealing with bit-packed things; although it's not pretty if dealing with little endian values where there's a couple bits in one byte and a couple more in the neighboring byte --- you've got to get the pieces and put them together if it's not just whole bytes. Big endian bit packed structures are easy, but little endian is dominant these days. I don't know if performance is good, but it's easy to read and write for developers.

b) anything with a large number of connections and significant state per connection. A chat server, video conference, etc.

This is why you see 'everyone' build chat from ejabberd. Erlang makes code for this kind of service fairly simple, and hotloading means you can fix bugs without kicking everyone off to restart. Observability features make it reasonable to see what's going on in your system and where bottlenecks are.

We are running Erlang/OTP for telemetry backend in the industrial automation field.

Basically, we have hundreds of thousands sensors connected through gateways that keep open TCP/IP connections to the Erlang/OTP distributed backend. We do bidirectional communication as we have many control functions, OTA firmware updates, etc.

There are frequent failures which we handle with supervision trees and “let it fail” design principles. Failures are due to:

* Gateways which are connected through cellular networks with varying signal strength conditions.

* Sensors failing and providing incorrect data (eg. invalid float binaries).

* Buggy firmware in some 3rd party gateways / sensors.

* Buggy firmware in our own gateways and sensors ;-)

I highly recommend Elrang/OTP due to:

* Fault tolerance - processes fail, nodes carry on.

* Concurrency model (mailboxes, linking processes via supervision trees, monitors, trapping, etc)

* Built-in distribution and related modules in standard library

* Pattern matching which makes processing binary data super convenient

* Mnesia database (if used for right things)

jimsimmons2y ago

Why wouldn't you be able to implement this in Rust or Python?

Seems like typical exception handling. Erlang isn't even type checked

felixgallo2y ago

1. You can implement supervisor trees in rust or python, but neither of them have a runtime that supports that out of the box and an exception handling system that works harmoniously with that runtime, so you'd have to implement that all yourself, and it turns out to be a non-trivial task.

2. erlang has a variety of type checks at a number of conceptual levels, so strongly recommend you go plow through something like https://learnyousomeerlang.com/ in order to improve your understanding here.

throwawaymaths2y ago

That's the basics. The next level is, what if you want one process dying to interrupt and trigger killing another process that is associated with it -- perhaps it's the parent in an rpc -- that's the easiest case (and trigger the correct exception message in the logs of both nodes)

Oh. And it's on the other side of a cluster.

Sure, you could do it in Python or rust. It won't be zero lines of code. You're probably gonna get it wrong.

toast02y ago

Certainly, you can do most things in most languages.

Rust and Python don't come with supervision trees, so you'd have to build or find that. They also don't come with async messaging, especially not cross node async messaging, so you'd have to build or find that.

Building procsess linking and monitoring where the death of one process notifies or kills other processes and all of that is tricky, and I suspect you won't find that; so you'll have to build it, which is going to be tricky.

But even if you have all of that, now you're writing Erlang style in another language, and it's not idiomatic and people won't understand or like your code.

You can even build in hotloading. I did it (poorly) in Perl in the early 2000s without knowing it was available elsewhere, and I did it more recently in C with dlopen and friends. It makes your code look real funny though.

I also think you missed the ease of raising exceptions... ok = ... is very powerful and concise. You don't write a throw, you don't worry about what failure looks like, you just pattern match success, and if/when failure happens, you usually have what you need to figure out what when wrong and if it's better to do something else, it's easy to update your code (and you can hotload the update to the running system)

The Erlang way doesn’t require exception handling code because processes are isolated, so when one dies the runtime can clean up after it (reclaiming memory, file handles, etc.) as it knows what process owns what resources. Links and monitors make it possible to expand this to custom types of resources e.g. DB connections, locks, queues…

The idea is to implement error handling in the core (VM, supervisors, DB connection pool) while the vast majority of the code can just crash at anytime and not worry about closings its files or whatever.

I think the let it fails slogan is taken too literally, especially by people who don't use OTP.

You can handle errors and exceptions if you want, the supervisors etc are more here for the unexpected failures that happens for all sort of reasons in any software.

Sometimes it also hides bugs because your solution appears to work correctly;)

roelesOP2y ago

Thanks for your insights!

octacat2y ago· 5 in thread

The simple answer: supervision trees. And fault-tolerance usually means that the failing process would be restarted. It would not handle stuff like netsplits or node going down though.

Check code of cowboy, ejabberd, MongooseIM, RabbitMQ for examples. There are many factors on decision when to make a new process. Data locality, the pattern of interaction with other processes, performance considerations. Good idea is to have one process per TCP connection, but not one process per each routed message. And be careful with blocking gen_server calls - these could block or fail.

plugin-baby2y ago

> Check code of cowboy, ejabberd, MongooseIM, RabbitMQ for examples.

+ CouchDB?

cmdrk2y ago

Just to add to this, there are some implementations of things like consensus algorithms in Erlang such as Ra: https://github.com/rabbitmq/ra

Ra is pretty cool. Making cluster fault tolerant is an art on its own in Erlang ;)

Riak Core is extremely cool, but Riak is dead by now. It was a child of the times when NoSQL was cool. Still, basho code is interesting to read. (https://github.com/basho/riak_core)

Self-ads: we've tried to remove Mnesia from our project, HN post incoming, once the library is prettified and tested hard (https://github.com/esl/cets).

roelesOP2y ago

Thanks! Will check those projects out.

mingusrude2y ago

It's been a while since I worked actively with Erlang but I remember reading the cowboy-source code was educational.

al2o3cr2y ago· 4 in thread

Step zero is definitely the OTP Design Principles doc (part of the OTP distribution):

https://www.erlang.org/doc/design_principles/users_guide

There are some good texts that have more examples:

Erlang & OTP in Action - https://www.manning.com/books/erlang-and-otp-in-action

Designing for Scalability with Erlang/OTP - https://www.oreilly.com/library/view/designing-for-scalabili...

One big example of distributed Erlang is Riak:

https://github.com/basho/riak

skylabmelody2y ago

And the classic "Learn You Some Erlang for Great Good!":

https://learnyousomeerlang.com/introduction

Armstrong's Erlang: Software for a concurrent world is my recommendation.

SoftTalker2y ago

Yes, Erlang & OTP in Action is one I would recommend in particular. It's a few years old now and I don't know if they've updated it but the basics of OTP, supervision, etc. have not really changed.

Armstrong's Programming Erlang is another one to look at.

roelesOP2y ago

Thanks for the book recommendations!

ihuk2y ago· 3 in thread

You don't achieve fault tolerance solely by using Erlang. Erlang does not inherently 'achieve fault tolerance.' Instead, you make your system fault-tolerant through deliberate engineering. While Erlang provides tools and design guidelines, the responsibility for achieving fault tolerance ultimately lies with you. Source: I implemented and operated a large Erlang system for approximately 3 years.

LoganDark2y ago

This is exactly why they're asking for example projects...

BlueHotDog22y ago

that's always true. i think the author is interested in code examples of such. and unlike many other frameworks/tools, erlang provides a great pit-of-success for implementing fault tolerance - e.g if you follow common/best practices - you'll achieve a fairly good fault tolerance.

lixy2y ago

The big benefit in my experience was that I could have a program with real users, that did have errors (from me being new to Elixir and not knowing better) and still not experiencing downtime.

Instead, CPU or Memory would increase over time, hit the VM limit, kill and restart.

So later when I noticed this, I could debug and fix it without simultaneously fighting a prod incident.

thibaut_barrere2y ago· 1 in thread

If you were to consolidate all the info (including links published by Joe) into an informative blog post, it could become the “2024 reference bookmark” for a lot of people.

I have thought of writing this! It would be quite useful to a lot of people.

roelesOP2y ago

That would be great!

RabbitMQ is IMHO probably the best open source example tackling a large, complicated real world problem with graceful degradation (e.g. if a queue keeps crashing).

Elixir has a lot of smaller but very high quality libraries to learn from. You may be interested in how Ecto & Postgrex manage DB connections, in particular how connection sockets are “borrowed” so data doesn’t get repeatedly messaged (read: copied) between processes. Bandit / Thousand Island also make interesting decisions for process structure in HTTP1.1 vs HTTP2.

I think a common mistake is to create processes mimicking classic OOP structure, like an OrderProcessor, ShippingManager, etc. Processes in Erlang are a unit of fault tolerance, not code organization. This means more usually you’ll have one process per request, potentially calling code from many different modules; since requests are the things you want to fail separately from each other.

In RabbitMQ’s case for instance connections and queues are processes, but exchanges are not. It would feel natural to model the problem as three processes with messages going Connection -> Exchange -> Queue, but in reality an exchange is a set of routing rules that can be applied by a connection directly, which avoids a lot of complexity and overhead.

Last thing I’d note is supervision trees etc. are really about handling _unexpected_ errors (Joe uses the terms faults and errors with different meanings iirc). If you want a web request to be retried a few times with a delay, don’t use a supervisor for that, just loop with a sleep. Same for things like validating inputs from a form, usually you’d want to give the user a hint and not just crash.

Some other useful links:

- https://aosabook.org/en/v1/riak.html (bit old, but another large codebase)

- https://ferd.ca/the-zen-of-erlang.html

- https://www.theerlangelist.com/article/spawn_or_not

chadd2y ago

I have written a lot of Erlang code over the years, including an Erlang Redis clone which had some interesting performance characteristics[1] ... though not recently, I went too far down the engineering management track at Snap and elsewhere... but I worked closely with Fernando "El Brujo"[2] when he was CTO of my consultancy. If you want to see beautiful, canonical Erlang code, he's still slinging it out. Dig through his repos on Github, or better yet, ask him to provide his suggestions.

[1] https://github.com/cbd/edis [2] https://github.com/elbrujohalcon

asa4002y ago

> I'm particularly curious about when to split off a new process, and what "if things fail, do something simpler" means in practice.

Processes are failure and concurrency barriers.

Failure: one process crashing does not crash another process, unless you explicitly want it to (e.g., via Erlang's `link` functionality). So, if you have multiple operations that must not interfere with each other in the case of one of them misbehaving (e.g., your application makes multiple HTTP requests in parallel), you want them in separate processes.

Concurrency: processes are independently and preemptively scheduled by the VM. If you have multiple operations that are not necessarily sequentially ordered, and you want to run them at the same time, you put each of them in a process. One example problem where this applies would be the handling of incoming TCP messages, where each message is not related to the previous or subsequent messages, and you want to be able to process multiple messages at the same time.

If you handle each new message in its own process, the VM will schedule the processing of those messages such that the processing of one message will not interfere with the processing of another. It accomplishes this by tracking a rough proxy of CPU time each process uses (called "reductions" in Erlang) and descheduling processes that consume too many resources and giving other processes a chance to run for a bit. (Note that this is just one example and ignores any performance considerations. There are other approaches but I am omitting them for simplicity's sake)

There are a number of good libraries to look at for these in practice. I'd personally go look at Cowboy and/or Ranch as they deal with lots of IO. Oban is an Elixir job queue library that is fantastic and has very high code quality. Another good one would be Poolboy, which is a worker pool library.

when to split off a new process

Always?

Times some factor so you have several instances of the same thing in case one fails.

Good luck.

jmnicolas2y ago

I don't know Erlang so take it with a grain of salt, but maybe take a look at the CouchDB code base?

It's a NoSQL DB written in Erlang. I looked at it a few years ago, its master to master replication seemed cool.

rramadass2y ago

Not code but an excellent presentation of design/architecture of a Real-World fault-tolerant and distributed System in Erlang/Elixir - https://www.youtube.com/watch?v=pQ0CvjAJXz4

vladimirralev2y ago

My advise is don't take Erlang's fault-tolerance promises too seriously. It's just a little framework that helps in some cases and gets in the way in other cases.

I've seen many Erlang systems fail in funny ways, including some of the big examples given here. Supervision trees are cool but it's clearly nonsense to hardcode restart strategy and timing numbers for workers as if all failure modes are the same and deployed in the same network/capacity/resource/conditions with any number of workers. The strategy and schedule for recovering 10 crashed resource workers will clearly be different when you have 1M workers. The strategy will be different if you are timing out on network or if you are getting a resource error and have better things to do than restarting workers.

Focus on fault-tolerance outside erlang - have standby capacity in isolation and load-balance properly, shard the system in isolated pieces as much as you can.

j / k navigate · click thread line to collapse