The term originated with (or is strongly associated with) the Westinghouse railroad brake system. These are the pressurised air brakes on trains, in which air pressure holds the brake shoes open against spring pressure. Should integrity of the brakeline be lost, the brakes will fail in the activated position, slowing and stopping the train (or keeping a stopped train stopped).
https://en.m.wikipedia.org/wiki/Railway_air_brake
Fail-safe designs and practices can lead to some counterintuitive concepts. Aircraft landing on carrier decks, in which they are arrested by cables, apply full engine power and afterburner on landing. The idea is that should the arresting cable or hook fail, the aircraft can safely take off again.
https://en.m.wikipedia.org/wiki/Fail-safe
Upshot: "fail safe" doesn't mean "test all your failure conditions exhaustively". It may well mean to abort on any failure mode (see djb's software for examples). The most important criterion is that whatever the failure mode be, it be as safe as possible, and almost always, based on a very simple and robust design, mechanism, logic, or system.
From the description of this project, it strikes me that it may well be failing (unsafely?) to implement these concepts. Charles Perrow, scholar of accidents and risks, notes that it's often safety and monitoring systems themselves which play a key role in accidents and failures.
Fail-safe design comes from railroad signaling. It is a principle of classic railroad signaling that any broken wire or relay that fails to pull in must result in an indication not less safe than the correct one. "Vital" Relays in classic signaling systems fall open by gravity, and use silver-to-silver contacts so as to avoid welding together on overloads. (Lightning strikes on rails and on signal lines are considered a normal part of railroad operation.)
[1] https://en.wikipedia.org/wiki/Railway_air_brake#Straight_air...
"Under the Westinghouse system, therefore, brakes are applied by reducing train line pressure and released by increasing train line pressure. The Westinghouse system is thus fail safe—any failure in the train line, including a separation ("break-in-two") of the train, will cause a loss of train line pressure, causing the brakes to be applied and bringing the train to a stop, thus preventing a runaway train."
Without air pressure -- from line or cannister, the brakes fail in the activated mode.
I'm trying to find a source, but my understanding is that red/green for lit signals as "stop/go" came about after an earlier mode, in which a steady white light meant "go" proved problematic: the red disks fronting stop lamps could fall out (or perhaps be broken), leaving ambiguity as to what "white" meant.
Switching to red and green lamps meant that the failed-disk mode now clearly indicated a signalling problem, where the signal could not be trusted.
Semitrailer parking brakes really are spring-loaded and released by air pressure.
Particularly when they're correcting errors or omissions in other comments. Such as those in mine above to which Animats is replying.
An example of such system could be a ball check valve, which can inherently only work.
https://en.wikipedia.org/wiki/Check_valve
Can you think of a word to describe such systems?
The first is "impossible".
The second is "pre-failed".
As the drunk has observed, you can't fall off the floor.
If you're looking for a term for a system which is highly immune to failure, "resiliant" comes to mind.
Take Tesla's solid-state, no-moving-parts one-way fluid valve. It has no moving parts to break (though it could conceivably be fouled by dust, dirt, sediment, or debris).
http://makezine.com/2012/01/05/the-tesla-valve-one-way-flow-...
"Overengineered" is another possibility.
There's certainly something to be said for retry strategies in places that involve a lot of network chatter but please don't also forget to add some kind of back off to it so you don't end up retry-overloading a system that's trying to recover.
If you hit an error condition in your code that you aren't explicitly handling, break that mofo.
The faster and more explicitly you break, the better, as this gives you the signal to fix the problem.
Wrapping and retries attempts to heal the damage, meaning, effectively, your code is walking wounded -- it's encountered an untrapped error, has ignored it, and is attempting to continue.
The faster and more definitively an error breaks, the better the likelihood of fixing it, and the more obvious the error and fix are.
We released a microservices development kit (MDK) last week that includes similar semantics (e.g., circuit breakers, failover) that implements these semantics in Python, JavaScript, Java, and Ruby. The implementation is actually written in a DSL which we transpile into language native impls. We do this to insure interop between different languages. We're working on updating our compiler to support Go and C#, adding richer semantics, and making the service discovery piece pluggable (currently there's a dependency on our own service discovery).
https://github.com/jhalterman/failsafe/wiki/Comparisons#fail...
For example,
>Executable logic can be passed through Failsafe as simple lambda expressions or method references. In Hystrix, your executable logic needs to be placed in a HystrixCommand implementation
It's not apparent to me what the advantage of either interface is. In both situations I have to define a "lambda" and hold state somewhere(either as an object field or passed into the lambda). Unless I'm something here, either seems acceptable.
There's nothing more detailed that I know of. Is there a particular feature area/comparison you're curious about? I can add a bit more detail.
> It's not apparent to me what the advantage of either interface is. In both situations I have to define a "lambda"
What I meant by this bit is that the user experience is different. Failsafe can be used with method references or lambda expressions [1], which are a nice, concise way of wrapping executable logic with some failure handling strategy. You cannot do this with Hystrix since all logic must be wrapped in a HystrixCommand impl, which cannot be implemented as a lambda.
> either seems acceptable.
Like anything, it just depends on what you want. If retries and general purpose failure handling, consider Failsafe. If request collapsing, thread pool management and monitoring, consider Hystrix.
[1]: https://github.com/jhalterman/failsafe#synchronous-retries
Personally, I'd rather systems fail quickly, with retries only at the highest (application) and lowest (TCP) levels.