The term originated with (or is strongly associated with) the Westinghouse railroad brake system. These are the pressurised air brakes on trains, in which air pressure holds the brake shoes open against spring pressure. Should integrity of the brakeline be lost, the brakes will fail in the activated position, slowing and stopping the train (or keeping a stopped train stopped).
https://en.m.wikipedia.org/wiki/Railway_air_brake
Fail-safe designs and practices can lead to some counterintuitive concepts. Aircraft landing on carrier decks, in which they are arrested by cables, apply full engine power and afterburner on landing. The idea is that should the arresting cable or hook fail, the aircraft can safely take off again.
https://en.m.wikipedia.org/wiki/Fail-safe
Upshot: "fail safe" doesn't mean "test all your failure conditions exhaustively". It may well mean to abort on any failure mode (see djb's software for examples). The most important criterion is that whatever the failure mode be, it be as safe as possible, and almost always, based on a very simple and robust design, mechanism, logic, or system.
From the description of this project, it strikes me that it may well be failing (unsafely?) to implement these concepts. Charles Perrow, scholar of accidents and risks, notes that it's often safety and monitoring systems themselves which play a key role in accidents and failures.
Fail-safe design comes from railroad signaling. It is a principle of classic railroad signaling that any broken wire or relay that fails to pull in must result in an indication not less safe than the correct one. "Vital" Relays in classic signaling systems fall open by gravity, and use silver-to-silver contacts so as to avoid welding together on overloads. (Lightning strikes on rails and on signal lines are considered a normal part of railroad operation.)
[1] https://en.wikipedia.org/wiki/Railway_air_brake#Straight_air...
"Under the Westinghouse system, therefore, brakes are applied by reducing train line pressure and released by increasing train line pressure. The Westinghouse system is thus fail safe—any failure in the train line, including a separation ("break-in-two") of the train, will cause a loss of train line pressure, causing the brakes to be applied and bringing the train to a stop, thus preventing a runaway train."
Without air pressure -- from line or cannister, the brakes fail in the activated mode.
I'm trying to find a source, but my understanding is that red/green for lit signals as "stop/go" came about after an earlier mode, in which a steady white light meant "go" proved problematic: the red disks fronting stop lamps could fall out (or perhaps be broken), leaving ambiguity as to what "white" meant.
Switching to red and green lamps meant that the failed-disk mode now clearly indicated a signalling problem, where the signal could not be trusted.
Particularly when they're correcting errors or omissions in other comments. Such as those in mine above to which Animats is replying.
An example of such system could be a ball check valve, which can inherently only work.
https://en.wikipedia.org/wiki/Check_valve
Can you think of a word to describe such systems?
The first is "impossible".
The second is "pre-failed".
As the drunk has observed, you can't fall off the floor.
If you're looking for a term for a system which is highly immune to failure, "resiliant" comes to mind.
Take Tesla's solid-state, no-moving-parts one-way fluid valve. It has no moving parts to break (though it could conceivably be fouled by dust, dirt, sediment, or debris).
http://makezine.com/2012/01/05/the-tesla-valve-one-way-flow-...
"Overengineered" is another possibility.
There's certainly something to be said for retry strategies in places that involve a lot of network chatter but please don't also forget to add some kind of back off to it so you don't end up retry-overloading a system that's trying to recover.
We released a microservices development kit (MDK) last week that includes similar semantics (e.g., circuit breakers, failover) that implements these semantics in Python, JavaScript, Java, and Ruby. The implementation is actually written in a DSL which we transpile into language native impls. We do this to insure interop between different languages. We're working on updating our compiler to support Go and C#, adding richer semantics, and making the service discovery piece pluggable (currently there's a dependency on our own service discovery).
https://github.com/jhalterman/failsafe/wiki/Comparisons#fail...
For example,
>Executable logic can be passed through Failsafe as simple lambda expressions or method references. In Hystrix, your executable logic needs to be placed in a HystrixCommand implementation
It's not apparent to me what the advantage of either interface is. In both situations I have to define a "lambda" and hold state somewhere(either as an object field or passed into the lambda). Unless I'm something here, either seems acceptable.
Personally, I'd rather systems fail quickly, with retries only at the highest (application) and lowest (TCP) levels.