IIRC, if you read the original thesis, the reason for clusters is just that there's always that chance an entire machine will go down, so if you want high reliability, you have no choice but to have a second one.
The OP is correct in that the key to understanding every design decision in Erlang is to look at it through the lens of reliability. It also helps to think about it in terms of phone switches, where the time horizon for reliability is in milliseconds. I am responsible for many "reliable" systems that have a high need for reliability, but not quite on that granularity. A few seconds pause, or the need for a client to potentially re-issue a request, is not as critical as missing milliseconds in a phone call.