Joyent us-east-1 rebooted due to operator error (opens in new tab)

(help.joyent.com)

104 pointshypervisor12y ago122 comments

Due to an operator error, all compute nodes in us-east-1 were simultaneously rebooted. Some compute nodes are already back up, but due to very high load on the control plane, this is taking some time. We are dedicating all operational and engineering resources to getting this issue resolved, and will be providing a full postmortem on this failure once every compute node and customer VM is online and operational. We will be providing frequent updates until the issue is resolved.

Joyent us-east-1 rebooted due to operator error

(help.joyent.com)

104 pointshypervisor12y ago122 comments

122 comments

78 comments · 11 top-level

bcantrill12y ago· 15 in thread

It should go without saying that we're mortified by this. While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter. As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn't happen in the future (and that the recovery is smoother for failure modes of similar scope).

mixologic12y ago

I feel bad for the person who made the mistake. Even though its obviously a systemic problem, and highly unlikely to be an act of negligence, Im sure he/she doesnt feel too hot right now.

jsmthrowaway12y ago

It's operations. You fuck up, you suck it up, you fix it, then (and this is the important part) you prevent it from ever happening again. Feeling like shit for bringing something down is a good way to give yourself depression, given how often you will screw the pooch with root. In the same vein, anybody who says they'd fire the operator without any qualification on that remark should be given a wide berth.

People tend to forget that "fixing it" isn't just technical, it involves process, too. Every new hire that whines about change control and downtime windows would be the first to suggest them, were they troubleshooting the outage that demonstrated the need.

myrandomcomment12y ago

Back in the day we used to say there are 2 types of network engineers, those that have dropped a backbone and those that will drop a backbone.

2 more replies

LnxPrgr312y ago

I've seen generally brilliant people be bit by bad process. The worst example was an important hard drive being wiped thanks to a lack of labeling, obviously taking a production server down with it.

Other things that have caused outages: lack of power capacity planning, unplugging an unrelated test server from the network (go go gadget BGP), cascading backup power failure, building maintenance taking down AC units, expensive equipment caching ARP replies indefinitely… the list goes on.

I had my own fun fuckup too. I learned SQL on PostgreSQL, and had to fix a problem with logged data in a MySQL database. Not trusting myself, I typed "BEGIN;" to enter a transaction, ran my update, and queried the table to check my results. I noticed my update did more than I expected, so I entered "ROLLBACK;" only to learn that MyISAM tables don't actually implement transactions.

Thankfully, in this case it turned out to be possible to undo the damage, but talk about a heart-stopping moment!

Shit happens. You deal with it, then do what you can to keep it from happening again. I've learned to respect early morning change windows as a way to limit damage caused by mistakes.

socceroos12y ago

My thoughts exactly. Poor fella.

I've seen worse though. A newish officer spilled his morning coffee into the circuitry of a device worth over 10 zeros. Immediately short circuited.

themodelplumber12y ago

Wow. Did this gold-plated B-2 bomber still fly after the coffee incident?

jsmthrowaway12y ago

You know you could build five space shuttles with ten zeros, right? Are we talking dollars or Yen?

2 more replies

pdeva112y ago

if something worth over 10 zeroes can be destroyed with a coffee spill, i would say it had it coming

2 more replies

malchow12y ago

How many non-zero integers were in the price of the device, though? My computer is also worth more than 10 zeros.

1 more reply

linker300012y ago

...then there was the new server room that was built with one of the 'big red buttons' conveniently placed behind the pull cord for the lights.

Why, yes...a couple of times...before a perspex arch was less-than-hastily fixed over the button..

xeroxmalf12y ago

Seems like not allowing food or drink near a device worth over 10 zeroes would be a no-brainer, but hindsight is tricky like that.

tedsanders12y ago

That sounds so awful. I can't imagine living the rest of my life knowing that I had been a net negative in the world. All of my life's earnings would just be a partial restitution of that one second of destruction.

1 more reply

dewiz12y ago

supposedly 0.0000000001 billions of $

shit happens, design for the worst.

1 more reply

ddorian4312y ago

really? what can cost that much ?

2 more replies

akerl_12y ago

As a request: It looks like each time the status page is updated, the old UPDATE: <words> is removed. For the future, it would be great if the older updates were preserved so that people looking back could understand the chain of events, rather than just seeing the first / last pieces.

SEJeff12y ago· 12 in thread

As I've always said, "You can never protect a system from a stupid person with root".

You can limit carnage and mitigate this type of thing, but you can't fully protect against sysadmins doing dumb things (unless you just hire great sysadmins)

akerl_12y ago

If you think that "hire great sysadmins" prevents somebody from fatfingering, you must be hiring from some more evolved species. Nobody is immune to mistakes; preventing this kind of issue is something the infrastructure and procedures should do.

llamataboot12y ago

I don't think "just hiring great sysadmins" is possible. People have off-days or are tired or sick, new people get on-boarded, even great people make mistakes, etc.

protomyth12y ago

...or accidentally switch which of the 25 term sessions they had open

I tend to color my production terms in a red background / yellow font scheme. It tends to inspire the tired brain to understand you are in production.

incision12y ago

Not only do you consider mistakes the province of stupid people doing dumb things, but you're crediting yourself with a proverb about it and suggesting that you posses the ability to sniff these people out from the 'great' ones.

Get a grip, you're recursively full of yourself.

SEJeff12y ago

Wow HN seriously!? I never once pretended that I'm able to hire people who don't make mistakes, only that you can't protect systems from administrators who mess up.

Get a grip people.

wmf12y ago

So don't give anyone root on an entire data center.

akerl_12y ago

Is this like Captain Planet? It's a bit exceptional to divide access servers of similar type between administrators such that individuals have full access to a portion of the fleet. Do they meet up and put their rings together to roll out updates? What if one of them goes on vacation?

lmm12y ago

There are keysharing protocols; you can do something like 5 sysadmins have a split of the master key such that any 3 of them can access the master account.

2 more replies

peterwwillis12y ago

Actually, capabilities makes it trivial to lock down things like shutdown for admin accounts. A script can do the shutdown instead in a more controlled and less error-prone fashion. Same for network device updates. Abstraction.

Gravityloss12y ago

You're going to need a bigger crew.

mikeash12y ago

That has its own risks. There might be some catastrophe that need root access on everything to fix, and you can't reach enough people to get it....

vertex-four12y ago

Then put the keys to datacenter-wide root somewhere safe (with a manual-ish process to access and use them), but out of the way and with alarms on it (the same alarms that you'd use in the absolute worst situation possible). Make sure anyone using it will be shamed if they don't absolutely have to.

2 more replies

alrs12y ago· 10 in thread

Joyent's messaging about "we're cloud, but with perfect uptime" was always broken.

It's mildly gross that the current messaging sounds like they're throwing a sysadmin under the bus. If fat fingers can down a data center, that's an engineering problem.

I care about an object store that never loses data and an API that always has an answer for me, even if it's saying things that I don't want to hear.

99.999 sounds stuck-in-the-90s.

evan_12y ago

> sounds like they're throwing a sysadmin under the bus

at least they didn't name the operator in question...

elijahwright12y ago

Our internal culture is such that everyone on the team would rather be blamed for something than accuse someone else of doing it. That's shitty, and not something you do to someone. You fix the problem and then you move on.

If it makes you happy, blame me - I don't mind.

sokoloff12y ago

At my $DAYJOB, we are always careful to figure out exactly what happened, including by whom. It's not to assign personal blame, but I believe it's critical that everyone agrees on the facts (who, what, when, where, and [if possible] why).

Response and conversation is always focused on "how do we prevent this in the future?", not on punishing whoever was involved in the past.

IOW, I agree with I believe is your intent, but differ on the implementation. Blameless transparency is the term we use (and we probably stole that from somewhere else).

It's a very powerful signal to the whole team when you first see individuals "admitting" to exactly what they did, how it caused or contributed to the outage, and to hear them thanked for their contribution of understanding in the post-mortem.

Senior leadership (including myself, who originally instituted the entire process a decade ago) is very clear that we want to know the facts and that in seeking and using those facts, we're only focused on the future, no matter how boneheaded the individual actions appear with the benefit of hindsight and knowledge that they'd lead (in)directly to an outage. I run operations and also participate in the promotion discussions for all technologists, and in 11 years, I've never heard a negative shadow cast onto a sysadmin/sysengineer from their actions during or leading to a production outage. And we've (collectively) made our fair share of mistakes over the years. That doesn't stop good employees from feeling bad about it, but that's a personal feeling they have, not from the fear of it being a professional black mark.

1 more reply

hack_edu12y ago

BTW, this is the right way to do it. :)

1 more reply

knodi12y ago

Sure blame on the engineers. You give power, people use it badly blame the engineer for giving too much power. You don't give enough power sysadmins/users bitch and yell why don't we have enough power, we're not children.

Its always the engineer fault. :(

alrs12y ago

Systems engineers, software engineers, architects, whatever. We're all in the same gang.

My point is that the problem in this case is likely the system's design, not one engineer's typing abilities.

jsmthrowaway12y ago

This comes down to operational philosophy, in the end. The point you're dancing around is whether the system should permit grave actions that don't make any sense when you're designing the system.

By the rules, every single system on a commercial aircraft has a circuit breaker. Pilots make the "what if X catches on fire?" case, which is actually pretty compelling. However, that also means there are several switches overhead that will ostensibly crash the airplane if pulled. Pilots lobby very strongly for the aircraft not to fight them in any way because they are the only ones with the data, in the moment, now. They have final command over the aircraft in every way.

I use this to point out that as you're designing systems for operations people -- something we're increasingly doing ourselves as devops/SRE takes hold -- you might think you can anticipate every scenario and design suitable safeguards into the system. However, sometimes, when Halley's Comet refracts some moonlight into swamp gas and takes your fleet down, you as an operator have to do some really crazy shit. It's in that moment, when all hell has broken loose, I'm at the helm, and based on the data available to me I have made a decision to shoot the system in the head: if the system fights me and prolongs an outage because we argued about whether we'd ever need to reboot a fleet all at once, I'm replacing the system as the first item in my postmortem. If you make me walk row to row flipping PDUs, we're going to have words.

That's just my philosophy. Give the operators the knives and let them cut themselves, trusting that you've hired smart people and understanding mistakes will happen. Your philosophy may vary. By all means, ask me to confirm. Ask me for a physical key, even. But if you ever prevent me from doing what I know must be done, you are in my way. I have yet to meet a system that is smarter than an operator when the shit hits the fan (especially when the shit hits the fan).

There's probably a broader term for operational philosophy like this.

2 more replies

CHY87212y ago

It's a combined fault. Clearly the operator made a mistake, but the system shouldn't have let such a calamitous operation take place without at least three levels of "Are you sure" (or something smarter like "Confirm how many servers would you like to reboot:") before it lets you take down thousands of servers.

berns12y ago

Joyent's marketing is not the most transparent. They haven't updated AWS prices in their pricing page since AWS lowered their prices two months ago.

4ad12y ago

What?

Joyent doesn't use AWS.

shiftpgdn12y ago· 9 in thread

Let this be a lesson to linux admins. Re-alias shutdown -r now into something else on production servers. I once took down access to about 6000 servers because I ran the script to decommission servers on our jump box when I got the SSH windows confused.

shagie12y ago

At one point, I worked in a computer lab that was mostly Ultrix machines. The shutdown grace period was specified in minutes ( http://www.polarhome.com/service/man/generic.php?qf=shutdown... )

Then we got a hp-ux machine in the lab. For some reason, the grace period on that system was in seconds ( http://www.polarhome.com/service/man/generic.php?qf=shutdown... )

System dax shutting down in 5 seconds.

_hnwo12y ago

Might I suggest molly-guard: https://packages.debian.org/unstable/admin/molly-guard

imrehg12y ago

Cheers for this! Would have saved me so much grief before. Now going around and installing it on the servers I manage (fortunately nothing mission critical, but many remote).

larrys12y ago

"when I got the SSH windows confused"

I've come close to that as well.

This reminds me of the paradox of being competent vs. a beginner.

It also has parallels in a few thing outside computing.

Beginners make different mistakes because they don't know enough to go quickly.

Once you are experienced you fly, similar to the way you drive in a trance without thinking some times.

With power tools I've seen this as well. You tend to take more chances the more experience you have (or even in my case getting cut with an exacto knife). Someone using a saw for the first time is going to go slowly and follow the directions (of course there are other types of safety mistakes they could make for sure..)

While a newbie might do rm -fr directory * instead of rm -fr directory* an experienced user could do that as well [1] simply by going to fast and not thinking "hey I'm doing something dangerous let me slow down and check before I auto hit return".

[1] I typically do

for i in something* do echo $i done

Then if I like what I see I will up arrow and insert "rm -fr $i" after the echo. Or maybe a read x to pause in between.

(Note: I'm not a sysadmin but I've done over many years sysadmin tasks because it is kind of relaxing in a way..)

cordite12y ago

I once put `shutdown -h now` (halt) instead of `shutdown -r now` (reboot)

Once I realized what had happened on the production server I ended up calling OVH (and they were helpful but not immediately acting).

It's not a good feeling.

icebraining12y ago

I tend to use /sbin/reboot instead, it amounts to the same (calls shutdown), but it's harder to get it mixed up.

smtddr12y ago

This happened to me once; I don't know if this works on all linux distros but if you quickly follow a halt/shutdown with a "sudo init 6"(reboot) before your ssh-session gets SIGTERMed/KILLed, the box comes back up. This at least worked on some Ubuntu version a few years back.

Give it a try on some system that's not critically important :)

cordite12y ago

Yeah, but the problem is when you honestly didn't realize calling a halting shutdown until the server doesn't come back 5 minutes later and then you review the terminal

stavrus12y ago

A similar case happened with the Eve Online cluster (~50,000 concurrent users) a couple of years ago. A programmer, who for some reason had access to the live cluster, confused his local development instance with that of the live cluster and issued a shutdown. Luckily they were able to avert the incoming disaster in time (it was a timed shutdown), but jokes are still made about the mistake.

http://oldforums.eveonline.com/?a=topic&threadID=1232785

jameshart12y ago· 7 in thread

DevOps means being able to take out an entire datacenter with a single keysstroke...

stephengillie12y ago

As a Devops, I can't justify building any automated way to down or restart all of my systems at once. We've only had to do that to resolve router reconvergence storms when changing out (relatively) major infrastructure pieces, such as our Juniper router.

michaelt12y ago

You don't intentionally build an automated way to take down all your servers at once.

You build a way to automatically perform some mundane standard procedure, like propagating a new firewall rule to all your systems at once. Then you accidentally propagate a rule that blocks all inbound ports. Huh, when I tested locally I didn't notice that.

Or you build a way to automatically delete timestamped log files more than a month old. And when it runs in production, it also deletes critical libraries which have the build timestamp in their filename. Ah, the test server was running a nightly build instead of a release so the files were named differently.

Or you build a way to automatically deploy the post-heartbleed replacement certificates to all your TLS servers, and only after you do that you find you didn't deploy the replacement corporate CA certificate to all the clients. Hmm, the test environment has a different CA arrangement, so testers don't get the private keys of prod certificates.

Or you build a way to retain timestamped snapshots of all your files, every five minutes, so you can roll back anything - then find that huge log file that constantly changes gets snapshotted every time, and everything is hanging because of lack of disk space. Oh, production does get a lot more traffic to log, now I think about it.

Or you do any of a hundred other things that seem like simple, low risk operations until you realise they aren't.

codexon12y ago

Once I typed

  rm -rf logs_ *

instead of

  rm -rf logs_*

2 more replies

akerl_12y ago

Unfortunately, the same tools that allow someone to automate management of systems can easily become catastrophic.

As one of the other commenters noted, a ~20 character salt command will do this. I doubt Joyent built a Big Red Button to take down a datacenter, I expect this will be the case of somebody missing an asterisk or omitting a crucial flag while trying to do their normal work.

tommu12y ago

Sorry - are you telling us you had to reboot all nodes because you swapped a router out? Sounds like you need a network engineer.

tommu12y ago

And I'm being downvoted for that? Seriously? In 13 years of networking I have never once had to reload machine to help with OSPF or BGP convergence. Good networking architecture and planning should mitigate anything other than a couple of minute outage. No routing change should ever require a reload of a server or end node.

2 more replies

akurilin12y ago

DevOps Borat is going to have a field day today.

jordanthoms12y ago· 7 in thread

Looks like the janitor needed somewhere to plug in the vacuum cleaner again...

hack_edu12y ago

Not even just the plug. I've had outages from bits flipped simply by the static electricity generated when vacuuming near servers.

linker300012y ago

5W walkie talkies in a big sports complex with the RF getting into the keyboard controllers and acting like a maniac was punching the keyboard - would eventually hang the servers.

Fix: Replace cheapened-keyboards-with-mylar-film-(not)-screening with older models that had a full metal cage around the keyboard assembly.

socceroos12y ago

You have carpets in your server room??

saganus12y ago

I assume bash.org?

gknoy12y ago

He might be referring to The daily WTF (worse than failure):

http://thedailywtf.com/Articles/I-Didnt-Do-Anything.aspx

Unintentional Mishap while Contractor Unplugs X to fix/maintain Y is a relatively common theme on their list of horror stories.

edit: I think he might actually have meant this one: http://thedailywtf.com/Articles/I-Told-You-So.aspx

wiml12y ago

It's a truly ancient anecdote; it probably predates the Internet.

The first example in RISKS is in 1994: http://catless.ncl.ac.uk/Risks/15.59.html#subj3.1 but the canonical version of the story is in a Cape Town hospital in 1996: http://web.archive.org/web/20040624065333/http://www.legends...

saganus12y ago

And I get 2 downvotes for this? really? downvoters care to explain why, just for asking if it was a reference from bash? Wow... Edit: Thanks to the other 2 posters who provided alternative sources. You learn by asking, no? or at least some of us do..

dharbin12y ago· 4 in thread

salt '*' system.reboot

quickdry2112y ago

> hubot restart all on prod

oh shit i meant stag fuckfuckfuckfuck

qbrass12y ago

So have hubot second guess any changes to production unless you specifically told it you were messing with prod beforehand. Have it wait a few seconds before doing something important and listen for sounds of regret.

>hubot restart all on prod

hubot: > say "Hubot isn't responsible for hosing production because I actually meant staging"

>Hubot isn't responsible for hosing production because I actually meant staging

hubot: okay, don't say I didn't warn you.

>oh shit i meant stag fuckfuckfuckfuck

hubot: I hadn't started yet, but I'm doing it anyway just to teach you a lesson.

angersock12y ago

Why in the name of all that is holy do you have Hubot getting access to your production boxen?

Why does that seem like a good idea, ever?

akoumjian12y ago

My thought, exactly. Time to setup some good ACL :-) http://docs.saltstack.com/en/latest/ref/clientacl.html

devinegan12y ago· 3 in thread

Joyent has been having some serious issues over the past month or two. I am not sure if it is growing pains, bad luck or what is happening, but we had already lost faith and trust in their Cloud prior to today. This is the nail in the coffin from our perspective. Moving on...

rincebrain12y ago

Howso?

devinegan12y ago

Thanks for asking rather than just down-voting. I wanted others to know that this isn't isolated. We have been having issues with their service for a few months now. They never know when there is a problem with hardware, for instance. Joyent support will gladly tell you everything is fine. After you insist, and insist they will actually have someone look at the underlying infrastructure. Eventually they will acknowledge the problem and fix it (maybe). I believe the monitoring and reporting for their team is flawed or incomplete which leads to more downtime of affected systems. Just one observation, but we have had three incidents over the past month and a half. Two within a week of each other.

bcantrill12y ago

I'm sorry to hear about your experience; we pride ourselves on being able to root-cause problems regardless of where they might be in the stack, but it sounds like your problem didn't get properly escalated. If you want to reach out to me privately (my HN login at acm.org), we can try to figure out what happened here -- with my apologies again for the subpar experience.

lukasm12y ago

Mandatory DevOps Borat

"To make error is human. To propagate error to all server in automatic way is #devops"

and my fav "Law of Murphy for devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet."

Diederich12y ago

The 'devops' automation I made at my last company (and am building at my current company) had monitoring fully integrated into the system automation.

That is, 'write' style automation changes (as opposed to most 'remediation' style changes) would only proceed, on a box by box basis, if the affected cluster didn't have any critical alerts coming in.

So, if I issued a parallel, rolling 'shutdown the system' command to all boxes, it would only take down a portion of all of the boxes before automatically aborting because of critical monitoring alerts.

Parallel was calculated based on historical but manually approved load levels for each cluster, compared to current load levels. So parallel runs faster if there's very low load on a cluster, or very slowly if there's a high load on a cluster.

One way or another, most automation should automatically stop 'doing things' if there's critical alerts coming in. Or, put another way, most automation should not be able to move forward unless it can verify that it has current alert data, and that none of that data indicates critical problems.

shanselman12y ago

"What's this button do?"

j / k navigate · click thread line to collapse

122 comments

78 comments · 11 top-level

bcantrill12y ago· 15 in thread

mixologic12y ago

I feel bad for the person who made the mistake. Even though its obviously a systemic problem, and highly unlikely to be an act of negligence, Im sure he/she doesnt feel too hot right now.

jsmthrowaway12y ago

myrandomcomment12y ago

Back in the day we used to say there are 2 types of network engineers, those that have dropped a backbone and those that will drop a backbone.

2 more replies

LnxPrgr312y ago

I've seen generally brilliant people be bit by bad process. The worst example was an important hard drive being wiped thanks to a lack of labeling, obviously taking a production server down with it.

Thankfully, in this case it turned out to be possible to undo the damage, but talk about a heart-stopping moment!

Shit happens. You deal with it, then do what you can to keep it from happening again. I've learned to respect early morning change windows as a way to limit damage caused by mistakes.

socceroos12y ago

My thoughts exactly. Poor fella.

I've seen worse though. A newish officer spilled his morning coffee into the circuitry of a device worth over 10 zeros. Immediately short circuited.

themodelplumber12y ago

Wow. Did this gold-plated B-2 bomber still fly after the coffee incident?

jsmthrowaway12y ago

You know you could build five space shuttles with ten zeros, right? Are we talking dollars or Yen?

2 more replies

pdeva112y ago

if something worth over 10 zeroes can be destroyed with a coffee spill, i would say it had it coming

2 more replies

malchow12y ago

How many non-zero integers were in the price of the device, though? My computer is also worth more than 10 zeros.

1 more reply

linker300012y ago

...then there was the new server room that was built with one of the 'big red buttons' conveniently placed behind the pull cord for the lights.

Why, yes...a couple of times...before a perspex arch was less-than-hastily fixed over the button..

xeroxmalf12y ago

Seems like not allowing food or drink near a device worth over 10 zeroes would be a no-brainer, but hindsight is tricky like that.

tedsanders12y ago

1 more reply

dewiz12y ago

supposedly 0.0000000001 billions of $

shit happens, design for the worst.

1 more reply

ddorian4312y ago

really? what can cost that much ?

2 more replies

akerl_12y ago

SEJeff12y ago· 12 in thread

As I've always said, "You can never protect a system from a stupid person with root".

You can limit carnage and mitigate this type of thing, but you can't fully protect against sysadmins doing dumb things (unless you just hire great sysadmins)

akerl_12y ago

llamataboot12y ago

I don't think "just hiring great sysadmins" is possible. People have off-days or are tired or sick, new people get on-boarded, even great people make mistakes, etc.

protomyth12y ago

...or accidentally switch which of the 25 term sessions they had open

I tend to color my production terms in a red background / yellow font scheme. It tends to inspire the tired brain to understand you are in production.

incision12y ago

Get a grip, you're recursively full of yourself.

SEJeff12y ago

Wow HN seriously!? I never once pretended that I'm able to hire people who don't make mistakes, only that you can't protect systems from administrators who mess up.

Get a grip people.

wmf12y ago

So don't give anyone root on an entire data center.

akerl_12y ago

lmm12y ago

There are keysharing protocols; you can do something like 5 sysadmins have a split of the master key such that any 3 of them can access the master account.

2 more replies

peterwwillis12y ago

Gravityloss12y ago

You're going to need a bigger crew.

mikeash12y ago

That has its own risks. There might be some catastrophe that need root access on everything to fix, and you can't reach enough people to get it....

vertex-four12y ago

2 more replies

alrs12y ago· 10 in thread

Joyent's messaging about "we're cloud, but with perfect uptime" was always broken.

It's mildly gross that the current messaging sounds like they're throwing a sysadmin under the bus. If fat fingers can down a data center, that's an engineering problem.

I care about an object store that never loses data and an API that always has an answer for me, even if it's saying things that I don't want to hear.

99.999 sounds stuck-in-the-90s.

evan_12y ago

> sounds like they're throwing a sysadmin under the bus

at least they didn't name the operator in question...

elijahwright12y ago

If it makes you happy, blame me - I don't mind.

sokoloff12y ago

Response and conversation is always focused on "how do we prevent this in the future?", not on punishing whoever was involved in the past.

IOW, I agree with I believe is your intent, but differ on the implementation. Blameless transparency is the term we use (and we probably stole that from somewhere else).

1 more reply

hack_edu12y ago

BTW, this is the right way to do it. :)

1 more reply

knodi12y ago

Its always the engineer fault. :(

alrs12y ago

Systems engineers, software engineers, architects, whatever. We're all in the same gang.

My point is that the problem in this case is likely the system's design, not one engineer's typing abilities.

jsmthrowaway12y ago

This comes down to operational philosophy, in the end. The point you're dancing around is whether the system should permit grave actions that don't make any sense when you're designing the system.

There's probably a broader term for operational philosophy like this.

2 more replies

CHY87212y ago

berns12y ago

Joyent's marketing is not the most transparent. They haven't updated AWS prices in their pricing page since AWS lowered their prices two months ago.

4ad12y ago

What?

Joyent doesn't use AWS.

shiftpgdn12y ago· 9 in thread

shagie12y ago

At one point, I worked in a computer lab that was mostly Ultrix machines. The shutdown grace period was specified in minutes ( http://www.polarhome.com/service/man/generic.php?qf=shutdown... )

Then we got a hp-ux machine in the lab. For some reason, the grace period on that system was in seconds ( http://www.polarhome.com/service/man/generic.php?qf=shutdown... )

System dax shutting down in 5 seconds.

_hnwo12y ago

Might I suggest molly-guard: https://packages.debian.org/unstable/admin/molly-guard

imrehg12y ago

Cheers for this! Would have saved me so much grief before. Now going around and installing it on the servers I manage (fortunately nothing mission critical, but many remote).

larrys12y ago

"when I got the SSH windows confused"

I've come close to that as well.

This reminds me of the paradox of being competent vs. a beginner.

It also has parallels in a few thing outside computing.

Beginners make different mistakes because they don't know enough to go quickly.

Once you are experienced you fly, similar to the way you drive in a trance without thinking some times.

[1] I typically do

for i in something* do echo $i done

Then if I like what I see I will up arrow and insert "rm -fr $i" after the echo. Or maybe a read x to pause in between.

(Note: I'm not a sysadmin but I've done over many years sysadmin tasks because it is kind of relaxing in a way..)

cordite12y ago

I once put `shutdown -h now` (halt) instead of `shutdown -r now` (reboot)

Once I realized what had happened on the production server I ended up calling OVH (and they were helpful but not immediately acting).

It's not a good feeling.

icebraining12y ago

I tend to use /sbin/reboot instead, it amounts to the same (calls shutdown), but it's harder to get it mixed up.

smtddr12y ago

Give it a try on some system that's not critically important :)

cordite12y ago

Yeah, but the problem is when you honestly didn't realize calling a halting shutdown until the server doesn't come back 5 minutes later and then you review the terminal

stavrus12y ago

http://oldforums.eveonline.com/?a=topic&threadID=1232785

jameshart12y ago· 7 in thread

DevOps means being able to take out an entire datacenter with a single keysstroke...

stephengillie12y ago

michaelt12y ago

You don't intentionally build an automated way to take down all your servers at once.

Or you do any of a hundred other things that seem like simple, low risk operations until you realise they aren't.

codexon12y ago

Once I typed

  rm -rf logs_ *

instead of

  rm -rf logs_*

2 more replies

akerl_12y ago

Unfortunately, the same tools that allow someone to automate management of systems can easily become catastrophic.

tommu12y ago

Sorry - are you telling us you had to reboot all nodes because you swapped a router out? Sounds like you need a network engineer.

tommu12y ago

2 more replies

akurilin12y ago

DevOps Borat is going to have a field day today.

jordanthoms12y ago· 7 in thread

Looks like the janitor needed somewhere to plug in the vacuum cleaner again...

hack_edu12y ago

Not even just the plug. I've had outages from bits flipped simply by the static electricity generated when vacuuming near servers.

linker300012y ago

5W walkie talkies in a big sports complex with the RF getting into the keyboard controllers and acting like a maniac was punching the keyboard - would eventually hang the servers.

Fix: Replace cheapened-keyboards-with-mylar-film-(not)-screening with older models that had a full metal cage around the keyboard assembly.

socceroos12y ago

You have carpets in your server room??

saganus12y ago

I assume bash.org?

gknoy12y ago

He might be referring to The daily WTF (worse than failure):

http://thedailywtf.com/Articles/I-Didnt-Do-Anything.aspx

Unintentional Mishap while Contractor Unplugs X to fix/maintain Y is a relatively common theme on their list of horror stories.

edit: I think he might actually have meant this one: http://thedailywtf.com/Articles/I-Told-You-So.aspx

wiml12y ago

It's a truly ancient anecdote; it probably predates the Internet.

saganus12y ago

dharbin12y ago· 4 in thread

salt '*' system.reboot

quickdry2112y ago

> hubot restart all on prod

oh shit i meant stag fuckfuckfuckfuck

qbrass12y ago

>hubot restart all on prod

hubot: > say "Hubot isn't responsible for hosing production because I actually meant staging"

>Hubot isn't responsible for hosing production because I actually meant staging

hubot: okay, don't say I didn't warn you.

>oh shit i meant stag fuckfuckfuckfuck

hubot: I hadn't started yet, but I'm doing it anyway just to teach you a lesson.

angersock12y ago

Why in the name of all that is holy do you have Hubot getting access to your production boxen?

Why does that seem like a good idea, ever?

akoumjian12y ago

My thought, exactly. Time to setup some good ACL :-) http://docs.saltstack.com/en/latest/ref/clientacl.html

devinegan12y ago· 3 in thread

rincebrain12y ago

Howso?

devinegan12y ago

bcantrill12y ago

lukasm12y ago

Mandatory DevOps Borat

"To make error is human. To propagate error to all server in automatic way is #devops"

and my fav "Law of Murphy for devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet."

Diederich12y ago

The 'devops' automation I made at my last company (and am building at my current company) had monitoring fully integrated into the system automation.

shanselman12y ago

"What's this button do?"

j / k navigate · click thread line to collapse