Knightmare: A DevOps Cautionary Tale (2014) (opens in new tab)

(dougseven.com)

120 pointsredredhathat6y ago33 comments

33 comments

27 comments · 8 top-level

floatingatoll6y ago· 5 in thread

Previous discussions on HN:

2014: https://news.ycombinator.com/item?id=7652036

2015: https://news.ycombinator.com/item?id=8994701

lostlogin6y ago

Thanks - the top comment from vijucat in the 2015 discussion is anxiety inducing.

“ - Ctrl-r for reverse-search through history - typing 'ps' to find the process status utility (of course) - pressing Enter,....and realizing that Ctrl-r actually found 'stopserver.sh' in history instead. (There's a ps inside stoPServer.sh)”

erinaceousjones6y ago

I had a habit of doing `sudo shutdown now` on my desktop as I'm leaving my office. I don't know why, it takes longer than simply hitting the power button.

Didn't notice I was still SSH'ed into "the" server which was at the time a single point of failure for my entire project, and as a lowly not-an-IT-person-just-a-developer in our corporate environment, I didn't have access to the machine to go power it back on. And the IT people I knew who could help had gone home for the day.

Felt super dumb writing that up in the downtime log the next day.

Having read this article, it makes me super glad I'm working on very niche slow-paced stuff which, when goes down for ~12 hours, is a minor annoyance to our users rather than "you're costing us millions of $currency per minute" :-)

simonh6y ago

Something similar happened in production at work last year, fortunately not in my team.

emmelaich6y ago

Thanks. This comment from 2015 by `ooOOoo` is also worth highlighting.

   > The post is quite poor and suffer a lot from hindsight bias.
   > Following article is so much better
   > https://www.kitchensoap.com/2013/10/29/counterfactuals-knight-capital/

floatingatoll6y ago

FYI, this doesn’t come through as expected: the link isn’t clickable and the text isn’t readable. Please avoid using code formatting for quotes; > * ... * is a readable alternative. https://i.imgur.com/YvMA1uV.png

JackFr6y ago· 4 in thread

Amusing personal anecdote -- the Knight debacle caused the market in general to tumble. The week before a coworker of mine -- sure of a market drop but for other reasons -- had bought a raft or puts on the S&P 500. When I saw looking glum at work after the Knight news broke, I asked him what was wrong didn't you make a ton? Yeah, he said, but I can't get out cause my account's with Knight.

manwithplan6y ago

Cool story, didn't happen. There were no retail trading accounts at Knight. In fact, there was no outside money of any kind. The S&P500 fell about 0.75% on the day in question: a non-trivial decline, but not really remarkable. It was up about 0.4% on the week. Also, this is incredibly not how the OCC deals with members in default.

JackFr6y ago

This literally happened as I described it.

That is to say I have repeated what I was told. And it’s funny — I was all set to go to battle stations over this: 1) The close is not the same as how bad it was intraday. 2) Yes, the SIPC and CFTC have controls and he was able to access his account eventually after the profit opportunity was gone. 3) He was a sophisticated investor, and if Knight had retail accounts he might have been with them.

But in retrospect it’s too clever - it’s much more likely in my estimation that the dude in question with whom I worked was simply full of shit. He tells a bad beat story and it’s not like any of us asked to see statements. It never even occurred to me to doubt it until now.

1 more reply

seanhunter6y ago

+1 story is definitely not true. Knight was a marketmaker and definitely not a retail broker.

Source: used to work in algo trading for GS, so this was my job, also had friends at knight during this debacle.x

alasdair_6y ago

>There were no retail trading accounts at Knight.

The article states “The NYSE was planning to launch a new Retail Liquidity Program (a program meant to provide improved pricing to retail investors through retail brokers, like Knight)”

This pretty strongly implies Knight was a retail broker.

I assume I’m missing something- can you clarify?

2 more replies

t0mas886y ago· 3 in thread

I'm not sure the conclusion of the post is the "One and Only Answer" because a fully automated deploy process has another risk that has bitten both AWS and Google at some point: fully automatically taking down huge amounts of instances.

jniedrauer6y ago

Not to mention your deployment code itself can be buggy and is very difficult to write tests for. I actually got bitten by this recently. An automated deployment that I wrote years ago had an edge case race condition that could cause multiple deployments running from containers on the same docker host to collide, where the package from the first deployment would be pushed to the target of the second deployment. That deployment worked reliably for years, until one day it didn't. It was... a very stressful day.

SteveNuts6y ago

A lot of times those issues have been "fully automated (but with human inputs)" or "fully automated with no guardrails"

bobbiechen6y ago

This seems to cover all the cases. Either there are guardrails (as human inputs), or there aren't. Unless I'm missing a middle ground here?

1 more reply

toolslive6y ago· 2 in thread

off topic, but a "knightmare" is also a chess term. It's a good-knight-vs-bad-bishop position that went horribly wrong for the owner of the bishop.

evilotto6y ago

Also a Batman alternate timeline

swish_bob6y ago

And a late 80s early 90s children's TV programme.

forgottenpass6y ago· 2 in thread

What's to take away from this?

Automate deployment? Fine but boring. That's the prevailing dogma today. I don't remember where the devops hype train was in 2012. Package management had already been a solved problem for years even though it was (and continues to) be regarded as involving too much "icky reading" and a repository system using plain directories on vanilla webservers; all way too unoptimized for resume padding.

Learn how to identify and manage risk like an engineer? Understand how business process and software can implement risk controls and mitigations?

I kid, so I don't cry.

Traster6y ago

There's a whole slew of lessons to learn from this. Leaving dead code in your system and then deciding to repurpose it. Manually deploying with no verification. No checks in place to disable a system during crazy behaviour. No real alerting system. No procedures in place for when a system goes wrong. No audit log to refer to when rolling back.

The lesson from this article is kind of funny

>It is not enough to build great software and test it; you also have to ensure it is delivered to market correctly so that your customers get the value you are delivering

While true, I don't see any indication this was great software or that it was properly tested.

>Had Knight implemented an automated deployment system – complete with configuration, deployment and test automation – the error that cause the Knightmare would have been avoided.

Or to put it another way - had Knight implemented a higher quality deployment system than the quality of any of their other systems, they might have avoided this issue.

These stories are never about a single thing gone wrong. The whole point about critical systems is that you should need dozens of things to go wrong for them to fail, and then you should fail safe.

wikiman6y ago

The fundamental truth of software. Your system is only as good as its worst component. every. single. time.

Deployment is a component. Monitoring is a component. They are also OpEx and therefore "inferior"

whalesalad6y ago· 1 in thread

Back when I used to smoke I would ocassionally hang out with this guy from an investment bank that traded on the Japanese exchange. They had really cool working hours (started a lot later in the day) because we were based in Hawaii which is a few hours behind Japan.

Anyway, the guy told me that they had multiple big red physical kill switches so that they could immediately turn things off if shit ever hit the fan with their systems.

If you have ever spent time in Michigan you'll notice that the manufacturer test vehicles have a big ass red button on the dashboard to kill the vehicle in case something goes wrong.

I cannot imagine doing anything remotely close to this sort of thing without a big ass red kill switch on my desk.

brazzy6y ago

They did have a kill switch. What they did not have was someone with the authority and guts to throw it in time.

This may have something to do with the fact that killing a HFT bot without some kind of orderly wind-down might leave you with some very expensive open positions.

toomuchtodo6y ago· 1 in thread

This is less DevOps and more poor software engineering practices (code reviews, unit testing, paying off your technical debt through refactoring/removing old code, etc), although properly managing and instrumenting deploys might have stemmed the bleeding and kept losses manageable.

It's good though; poor decisions must have a cost. The only way to enforce good engineering practices that are human time intensive is for there to be a cost not to.

mongol6y ago

I think at it's core it is right in the guts of DevOps. The "flag" that protects dead code is dev, and the unforeseen deployment scenario is ops. With a DevOps mindset you need to think of both. I think it is a stellar example of what can go wrong if you don't consider both the dev and the ops aspects.

jmalicki6y ago· 1 in thread

(2014)

dang6y ago

Added. Thanks!

j / k navigate · click thread line to collapse

33 comments

27 comments · 8 top-level

floatingatoll6y ago· 5 in thread

Previous discussions on HN:

2014: https://news.ycombinator.com/item?id=7652036

2015: https://news.ycombinator.com/item?id=8994701

lostlogin6y ago

Thanks - the top comment from vijucat in the 2015 discussion is anxiety inducing.

erinaceousjones6y ago

I had a habit of doing `sudo shutdown now` on my desktop as I'm leaving my office. I don't know why, it takes longer than simply hitting the power button.

Felt super dumb writing that up in the downtime log the next day.

simonh6y ago

Something similar happened in production at work last year, fortunately not in my team.

emmelaich6y ago

Thanks. This comment from 2015 by `ooOOoo` is also worth highlighting.

   > The post is quite poor and suffer a lot from hindsight bias.
   > Following article is so much better
   > https://www.kitchensoap.com/2013/10/29/counterfactuals-knight-capital/

floatingatoll6y ago

JackFr6y ago· 4 in thread

manwithplan6y ago

JackFr6y ago

This literally happened as I described it.

1 more reply

seanhunter6y ago

+1 story is definitely not true. Knight was a marketmaker and definitely not a retail broker.

Source: used to work in algo trading for GS, so this was my job, also had friends at knight during this debacle.x

alasdair_6y ago

>There were no retail trading accounts at Knight.

The article states “The NYSE was planning to launch a new Retail Liquidity Program (a program meant to provide improved pricing to retail investors through retail brokers, like Knight)”

This pretty strongly implies Knight was a retail broker.

I assume I’m missing something- can you clarify?

2 more replies

t0mas886y ago· 3 in thread

jniedrauer6y ago

SteveNuts6y ago

A lot of times those issues have been "fully automated (but with human inputs)" or "fully automated with no guardrails"

bobbiechen6y ago

This seems to cover all the cases. Either there are guardrails (as human inputs), or there aren't. Unless I'm missing a middle ground here?

1 more reply

toolslive6y ago· 2 in thread

off topic, but a "knightmare" is also a chess term. It's a good-knight-vs-bad-bishop position that went horribly wrong for the owner of the bishop.

evilotto6y ago

Also a Batman alternate timeline

swish_bob6y ago

And a late 80s early 90s children's TV programme.

forgottenpass6y ago· 2 in thread

What's to take away from this?

Learn how to identify and manage risk like an engineer? Understand how business process and software can implement risk controls and mitigations?

I kid, so I don't cry.

Traster6y ago

The lesson from this article is kind of funny

>It is not enough to build great software and test it; you also have to ensure it is delivered to market correctly so that your customers get the value you are delivering

While true, I don't see any indication this was great software or that it was properly tested.

>Had Knight implemented an automated deployment system – complete with configuration, deployment and test automation – the error that cause the Knightmare would have been avoided.

Or to put it another way - had Knight implemented a higher quality deployment system than the quality of any of their other systems, they might have avoided this issue.

These stories are never about a single thing gone wrong. The whole point about critical systems is that you should need dozens of things to go wrong for them to fail, and then you should fail safe.

wikiman6y ago

The fundamental truth of software. Your system is only as good as its worst component. every. single. time.

Deployment is a component. Monitoring is a component. They are also OpEx and therefore "inferior"

whalesalad6y ago· 1 in thread

Anyway, the guy told me that they had multiple big red physical kill switches so that they could immediately turn things off if shit ever hit the fan with their systems.

If you have ever spent time in Michigan you'll notice that the manufacturer test vehicles have a big ass red button on the dashboard to kill the vehicle in case something goes wrong.

I cannot imagine doing anything remotely close to this sort of thing without a big ass red kill switch on my desk.

brazzy6y ago

They did have a kill switch. What they did not have was someone with the authority and guts to throw it in time.

This may have something to do with the fact that killing a HFT bot without some kind of orderly wind-down might leave you with some very expensive open positions.

toomuchtodo6y ago· 1 in thread

It's good though; poor decisions must have a cost. The only way to enforce good engineering practices that are human time intensive is for there to be a cost not to.

mongol6y ago

jmalicki6y ago· 1 in thread

(2014)

dang6y ago

Added. Thanks!

j / k navigate · click thread line to collapse