"What is your all time biggest screw up, and how did you come back from it" - I then tell them the story of me loosing several hundred thousand dollars and the funny things that happened around it to set the tone. If you have been in tech for any length of time you have one of these stories (if not a few). I have heard some great ones by simply asking and it gives great insight into a candidate (humor, stress response, the things you have seen).
EDIT: I once got a call from junior developer who had issued an sql statement on an auto-commit database like "update <table> set x=y; where <some condition>; she fat fingered the semicolon and blew up the whole table. She was crying, I felt pretty bad we were able to get the table restored and back online in a few hours. She still tells me how sorry she was to this day ( about 15 years later ).
However, I never did what I saw other people do at another job (more than once), which was to run a delete on a live production table intentionally, but unintentionally leave off the "where" clause. And commit without thinking.
I'm not aware of anything I've screwed up that had a dollar figure attached.
*I am not and never have been a sysadmin per se.
I've done something similar before, forgot to disable autocommit in DataGrip and ran an UPDATE without a where clause - thankfully we take weekly full backups, nightly incrementals and archive log segments so I shut down the database, performed a point-in-time-restore and the damage was undone.
This was actually a good teaching moment, said database is used as a holding area for incoming data from hospitals before we load it into our billing system so our EDI team has limited write permissions on the database - it's much harder for them to run a "oops, time to grab a backup" query but it was a nice anecdote to use when telling them to run everything in a transaction if they're running anything other than SELECT statements.
[13:45] core-router# debug all
*** Session Terminated ***
...oops.I've done it more than once, so I must be great :)
Did you own the mistake. (responsibility)
What steps did you take to resolve it. (accountablity and ownership)
Are you able to laugh about it, or did you learn something from it. I would hope that you can make claim to one or the other. (personality)
Can you tell a story. Because the story is personal, it is topic where you have "mastery" and should demonstrate how you communicate. (communication)
An understanding of your professional background - most of these stories have other "players" in them that help move them along and tell me who you have/had to interact with in the past. (experience)
Remember I give my own example that not only hits these points but probably illicit a laugh our two out of the candidate, and remind them that I am not any different from them. I tend to be able to have more conversational interviews after this question, and get better responses and answers from candidates.
Even more importantly, if they haven't made any major mistakes yet, there's a good chance they'll make it once you hire them, because they haven't been through the whole experience and have no scars to show for it. Or maybe they haven't had yet a job where they were responsible for anything that could result in a major incident. There's no lesson in paranoia as great as wiping the wrong storage array, shutting down services in an order other than the sanctioned one, etc.
"Jeremy told Who, me? that his mate asked to be relieved, as he was in a bit of pain. Those requests were denied due to the risk of the power going off and also out of a desire to make the poor chap suffer for his error."
ETA: And he almost lost his hat! Come on man!
Looks like they wanted him to suffer :p
You'd have to take the bolt into the button without releasing it...
This is where the old "It is now safe to shut down your computer." screens of Windows 9x/NT 3 came from. http://i0.kym-cdn.com/photos/images/original/001/286/950/e05...
I've learned coding on such machines, they're quite fun. Easily recognized by the computer requiring a manual powerdown after the OS has shutdown (Win95 showed a message ala "You can powerdown the machine now")
This is also why those computers said "It's now safe to disable the power" after shutting down.
On old AT systems (the ones where Windows 9x would show "It is now safe to turn off your computer"), one could actually press and hold the power button and the system would stay running. And when you're bored you can also quickly move your finger off the button and jab it down again (this would flip the switch back to on), and if you're quick enough, the system would not see that there was a power interruption.
Indeed the old AT power button was a mains (120V) switch, with thick cables going from and to the power supply unit.
It is BIOS+Hardware (the PSU is not involved). As a matter of fact to "switch on" a ATX power supply (not connected to a motheboard) you normally use a paperclip (or a short piece of cable) to connect the green with any of the black see:
https://forum.overclock3d.net/showthread.php?t=394
The whole point is that (unless the PSU has a mains switch and it is turned off) an ATX power supply is always partially ON, powering (parts of) the motherboard at all times (this allows for such things as Wake on Lan or switch on via CTRL+F11 or dedicated key on the keyboard).
Rather, closing the button circuit (ie, pressing and then releasing) will directly shut off power on the motherboard, the OS cannot prevent this.
So I scp the script over to the mainframe, ssh into it, run it again... and grow disappointed that my puny little perl script is still the bottleneck. How much can this beast take, I wonder. Maybe, if I forked off a couple of children?
In retrospect, I should have let it go at this point. My benchmark was already querying the nameserver at a far higher rate than it would ever encounter in production. I should have written in my report that the performance impact of some configuration changes was negligible if not zero.
But I really wanted to see how many queries this beast could handle. So I kept increasing the number of worker processes hammering BIND with the same queries over and over, until ... my ssh connection dropped. I pinged the mainframe, but I got no response. Ooops.
I was trying to look really busy as the monitoring guy who always looked as if he had just woken up walked down the corridor into our open plan office, grinning, and asked if anyone had something to tell him. Nobody replied. I do not think I have ever been that quiet in my entire life.
"Okay", he said, "the TCP/IP stack on that particular system just crashed, just in case you are wondering.". Oops
"Yeah, but SNA still works", the sysprog replied, "And the LPAR is scheduled for an IPL on Saturday, anyway. It'll do."
Obviously, it was a testing LPAR, so nobody got hurt; they would not let a trainee anywhere near a production system. But let the record show that I did manage to disable VTAM (at least the TCP/IP side of it) with a simple perl script from an unprivileged user account. By accident, but still. Also, I lost about a kilogram in sweat that day.
SNA LPAR VTAM
It's funny how much of what seems new isn't, really. Mainframes had VM's figured out decades ago, in a pretty elegant fashion.
IPL is basically "initial boot", SNA was a network transport. VTAM is to SNA the same as Ethernet is to IP (roughly, I'm skipping over LU6.2/APPC/etc).
You didn't ask about CICS, but it's basically cron+middleware but better.
In fact, if you look at, say AWS, and the set of standardized services, it isn't much different from what a mainframe offered so long ago. Standard, if somewhat limited, interfaces for scheduling, load balancing, VMs, databases, "nosql", events, logging/alerting, etc. Even nods to "microservices" and other things that feel new, but aren't really. Self service is a bit new, but the rest is well established.
[1] better I/O isolation, fewer "noisy neighbor" issues for example
SNA -> Systems Network Architecture
LPAR -> Logical partitions
IPL -> Initial program load
VTAM -> Virtual Telecommunications Access Method
I might be wrong of course
It's really amazing to see how far computing has come in just the past two decades.
It's really quite a lot easier than pressing the wrong power button (I do that too at my desk).
On the flip side, at least when you make this mistake with a VM you're typically not down for long assuming you have fast-ish storage - on average any of the VM's I'm responsible for are back up in 60-90 seconds, physical machines can take 5 minutes or more (memory testing, expansion ROM's, etc. all make post take FOREVER even on modern hardware).
The support personel were annoyed as they had to drive over to the facility and manually push the power button
It would be more accurate to say “some computers in the late 1980s and 1990s”; not all of them were 386s, and not all 386s had this style of switch.
But the Salon.com article was coming out the next morning. I called the writer and asked her if she could push the story back, but she said it was a slow news day and she couldn't. So the article came out and the server got slammed.
My brother needed the server for XMethods, so we did the quickest thing we could think of, which was that night at 3:00 a.m., we took the site down, grabbed an extra PC--a 400 megahertz Celeron, no-memory-in-it machine that I got for free when I opened an eTrade account--and drove to Berkeley where Jim had a shared office.
I remember taking the top off a case for pushpins and mounting it on top of the power switch of the machine so no one could turn it off. Then we put it in the corner under his desk and surrounded it with books, so it just looked like a bunch of stuff under his desk with a little Ethernet cable coming out. And as soon as we turned the site back on, the access logs started flying. It was 5 in the morning!
I think that's just awful.
The Symmetrix had an EPO (Emergency Power Off) which was a red button mounted in a recessed area on the back of the cabinet, and was protected by a plastic lid. To perform an EPO, you had to lift the lid and hold the button down for 30 seconds or so.
One of our DC ops employees was moving a heavy server into a cage and accidentally bumped the corner of the server into the plastic lid. The plastic lid was forced inward and got jammed depressing the EPO button. Moments later the entire Symmetrix powered off.
Later that day, as the word got around, another DC ops employee in a different datacenter looked at the Symmetrix and curiosity got the better of him. He didn't see how it was possible for the plastic lid to get jammed. So he punched the lid with his hand. Moments later that Symmetrix went down too. :-(
We reported this design issue to EMC. A while later, a few of us were on a factory tour at EMC. They pointed out to us the "Loudcloud Stopper" work-around. It was a rubber stopper mounted next to the EPO button that prevented the plastic lid from being pressed inward.
I don't blame the guy for not trying that with a production SAP server though...