“... the system was running too close to the point of exhaustion due to the intermittent crashes, and an event that should not have been a problem [high number of client reconnections] caused a high-severity incident.”