I was out shoveling, and came back in to my phone blowing up. Our systems at IronMountain (formerly Fortrust) in Denver all rebooted at once. These are all on redundant power, each systems redundant power supplies connecting to different circuits entering the cabinet, and those two circuits fed from 3 PDUs (two separate, one share). Each of those is supposed to be fed by a separate UPS and generator. Last status update I had says that they are running off generators, but they've been shockingly tight-lipped about it.
Don't get me wrong, it was hi-LAR-ious to call into their NOC and have them pretend that I was the only one having problems. "Can you tell me if there is a major data center outage going on?" "We are trying to gather information, we are making a bunch of client phone calls, we will know after we make those calls." "... Why are you making a bunch of client calls if you aren't having an outage?"
They do run quarterly 'storms' where a datacenter is shut down to test failover and resiliency. I have no idea if today is one of those days, since I left last year.
For instance GitHub's relatively recent shutdown was due to a fail-over heartbeat not going as expected.
I think that is a yes, and he getting ahead by saying "Yes and we have no idea why or ETA so let us do our job".
Granted, they should have a status page.
On the other hand, I need information to be able to do my job: Is this only our cabinet having problems and I need to start rolling to the datacenter (in the middle of a giant blizzard)? Is this possibly some sort of problem with our own power infrastructure? Is something on fire (an EPO triggered by fire could cause this)? Did the roof cave in under the weight of the snow we are getting? Is the power stabilized or is there some indication that power might be up and down?
In short, I need answers to: Do I need to gracefully take down my site to prevent lost transactions and database corruption? Do I need to switch to our backup site?
For context: All of our servers powering off at once and then back on shouldn't be possible. It should require the failure of at least 3 independent pieces of equipment (except at the breaker panel or in our cabinet where it could be only two failures). It is extremely unusual for this to happen, first time it's happened for me and I've been in that facility since 2004.
So, yes, I respect that you need to do your job. But I also need to do my job.
Plus, I'm pretty sure the guy answering the trouble line, his job WAS talking with the customers. The people working the problem likely didn't include him. This is a huge data center run by a ginormous company. I don't think I was taking him away from twisting a wrench. :-)
https://www.denverpost.com/2019/03/13/colorado-weather-bomb-...
But that's my presumption, I don't actually know anything and don't want to imply I do.
It might be resolved, it has to get worse before you escalate it further. They might not know the full facts. Might be worse than it really is. How do you know? You can't judge that because your personal rendering of Facebook failed. You have load balancers and CDNs and A/B testers all getting in the way of delivering data to your machine.
It's too easy to draw a conclusion from the client-side armchair and the provider is absolutely not going to make false promises, for the worse or for the better.
You want to hope that Facebook, in this case, acts on more complete information.
Deny, deny, deny, obfuscate, deny, then blame someone else (usually, YOU).
People would be reactivating their Facebook accounts and having to sift through conspiracy theory posts about Hillary Clinton still just to figure out what was going on.
Edit: The points on this post keep going up and down every time I check these comments. Yes, it was sarcasm, I was joking, but I was trying to point out that most people rely on a small set of services. "Cloud" has centralized things a lot.
Whenever we had any sort of issue we could generally get a good idea of what was happening by looking at changes in traffic in those two web tiers.
If people couldn't play for most reasons, game action traffic would drop to near zero, but the static asset tier traffic would usually at least triple.
So yeah, there are a lot of F5 buttons being hit out there when pages don't load.
Wait, people are doing this already: https://twitter.com/SA_SES/status/1105969450698694656
Google sure, but what people in the real world cares about twitter?
Twitter could be down for days and only the technocracy would notice.
It also seems popular with journalists and media companies (e.g. TV shows asking viewers to "tweet us your questions")
This leaves me wondering what software all these places have in common. The application layers are all different, the databases are all different, the containerization and provisioning systems are different, but I imagine that all these systems rely on two things: the global Internet backbone, and maybe the Linux kernel.
Have there been major security vulnerabilities patched lately in the Linux kernel that could have had unintended consequences?
Its telling that one of the hottest areas of distributed systems research these days is the boring topic of configuration management. Google, Microsoft, etc are paying researchers top dollar to figure out how to prevent massive outages through novel techniques. It is one of the harder problems to solve and requires massive investment in tooling, refactoring, etc.
Curious what makes you think this. Are there specific job postings in either company that are focused on this?
Sometimes you just get unlucky!
-- Ian Fleming (in Goldfinger)
>This leaves me wondering what software all these places have in common.
dunno what systems you're talking about, but seems likely they are mostly x86 systems and maybe even mostly using Intel hardware and microcode
those systems can-be/are rooted and more, to my knowledge
Cisco or Arista
Facebook obviously loses some ad revenue and Facebook customers may lose sales. But do Facebook/Instagram users suffer? But how does losing social media for several hours affect the quality of life of users?
Actually the government did block all social media for over a month but that was fixable with vpn. (Follow hashtag #SudanUprising on twitter to learn more)
What I asked was what is the effect of sporadic interruptions of few hours. I mean, if Facebook had 30% availability, would I lose anything valuable from the experience? Is it that we are just used to it and and want it to be there always?
The value of 99.5 availability fore __users__ is not clear to me. Instant messaging is exception for this.
I hate Facebook, but to deny its value is pretty naive.
On the other hand, if someone were to sabotage the platform and prove/convincingly argue that they induced the failure, at minimum it would do significant damage to the tech sector and at maximum cause public panic.
This is a hypothetical, not speculation on the cause of this outage.
I can't wait to see the RCA for both of these and if they're related.
Public post Mortem:
Entirely believable technical cause.
(Ignore Stuxnet, Ignore DUQU)
https://stackoverflow.com/questions/4188605/what-is-cavalryl...
Yet another alternative: Third World War has just started, and this was the first battle.
I strongly suspect users are reporting "my Internet is having troubles" because their FB, Messenger, etc. isn't working right.
For example, in the comments of the T-Mobile outage page, there's stuff like "Haven't been able to upload anything to social media all day" and "Cannot send pictures through whatsapp and fb messenger".
Also, check out the "Attacks" tab. That one really lights up. Like seriously lights up. Something is going on... all over. US, China, Russia, EU...
I thought maybe my ISP blocked a port which these services maybe transferring their multimedia on.
/Sweden
Now, another interpretation is that the reports are simply false...
Edit: it's back now (8:37 PM UTC)
Source: Me. My career has been spent managing db's for internet scale sites.
Screw ups related to data loss are rare (I've been here years and haven't seen one with the DBs that the stuff I work with uses) but failures at this scale tend to cascade a little ways and it takes time to dig out of the hole. They probably have the problem solved but they have to spend a bunch of time synchronizing things and verifying the fix before they press the big red "go live" button.
Amazon once pushed a seemingly-innocuous change to their internal DNS that caused all the routers between and within datacenters to drop their IP tables on the floor. They had to re-establish the entire network by hand---datacenter heads calling each other up and reading IP address ranges over the phone to be hand-entered into lookup tables. Cost a fortune in lost sales for the time the whole site was inaccessible.
Network failures are usually really bad when your system is globally deployed and distributed -- often times you can't even communicate with your machines to deliver fixes :p
https://www.thesslstore.com/blog/expired-certificate-ericsso...
https://www.businessinsider.com/verizon-outage-on-east-coast...
I run a messenger bot platform - the webhooks stopped being delivered _hours_ ago... nothing on their status page until it had been down for hours.
Their current issue...
"We are currently experiencing issues that may cause some API requests to take longer or fail unexpectedly. We are investigating the issue and working on a resolution."
What? lmao
https://www.akamai.com/us/en/resources/visualizing-akamai/re...
Dont conflate that with fb/insta problems.
Would be interesting to read the post mortem if there is any regardless
Edit: Has anyone seen anything of this sort in any of the projects they follow?
Something fun happening in Germany? https://www.akamai.com/us/en/resources/visualizing-akamai/re...
And Level3 traffic going to Argentina? https://twitter.com/bgpstream/status/1105819050968580096
And GreatBritain going to cambodia? https://bgpstream.com/event/197968
https://twitter.com/bgpmon/status/1104919654441467904
Must be because of that Dam blowing up
https://www.newsweek.com/sen-marco-rubio-blames-power-outage...
Major sporting events in the EU.
BGP fuckups appear to happen regularly, based on the tweet history of that account.
My hunch is that it's the end of Q1 and people are trying to release code changes so they can pad their Q1 performance reviews "designed and delivered feature X on time in Q1".
(Oh, turns out the Great Blackout Baby Boom was a myth:
I guess that this is all that I will get. Facebook is never down, it is just making improvements (like restarting the services to make them work again).
https://developers.facebook.com/status/issues/55989644784543...
Current State: Investigating
Description: We are currently experiencing issues that may cause some API requests to take longer or fail unexpectedly. We are investigating the issue and working on a resolution.
Start Time: 2 hours ago
Last Update: about an hour ago
Updates: There are currently no updates for this issue.
My bet is that people are having problems with FB/Insta and immediately assuming that the whole internet is messed up.
All joking aside, is this news? :/
Edit: Or have other methods than just relying on Facebook authentication
I believe this is why Github's status page is now on its own domain; so a github.com DNS outage won't take it down.
> Let me tell you difference between Facebook and everybody else. We don't crash ever! If the serves are down for even a day, our entire reputation is irreversibly destroyed. <…>
> Even a few people leaving would reverberate through the entire use base. The users are interconnected. That is the whole point. College kids are online because their friends are online, and if one domino goes, the other dominos go.
In real life, Facebook had significant issues with uptime in the early years.
> We're focused on working to resolve the issue as soon as possible, but can confirm that the issue is not related to a DDoS attack.
You'd think being down for hours would be negative news and revenue impacting.
I can't wait to see the RCA for both of these and if they're related.
VPN to US, insta can login, but still not post.
Distributed services are weird man!
What's that weird tagline about ?
[1]https://www.manchestereveningnews.co.uk/news/uk-news/faceboo...
Admittedly networking is not my strength, so perfectly happy for someone to shoot down this hypothesis.
[1] https://www.abc15.com/news/national/facebook-down-social-med...
... and nothing of value was lost.