Edit:
Fastly's incident report status page: https://status.fastly.com/incidents/vpk0ssybt3bj
Fastly Engineer 2: I have some very bad news...
Error 503 Service Unavailable
Service Unavailable
Guru *Mediation*:
Details: cache-lon4236-LON 1623146049 854282175
Varnish cache serveredit: 12:05 up again for me, no images or custom fonts loading though ... and down again 1 minute later
edit: 13:01 reliably up again for me
So it is a "performance" issue when all pages give a 503.
> Statuspage Automation updated third-party component Spreedly Core from Operational to Major Outage.
> Statuspage Automation updated third-party component Filestack API from Operational to Degraded Performance.
Oh, right. :-D
Don't get me wrong, I love the proliferation of APIs and easily-integrated services over the past 20 years. We're all one interdependent family, for better and for worse.
edit: PayPal looks be back up at least in US East but when I turn off my VPN and access from Asia I get "Fastly error: unknown domain: www.paypal.com."
Now I'm seeing a 503
Looks to be working again my end.
"A number of leading media websites are currently not working, including the Guardian, Financial Times, Independent and the New York Times."
Just checked, thank god the NHS vaccine site is still available - vaccines just got rolled out for under 30s today.
:(
Edit: There seems to be a major empathy outage in this thread. Disgusted but not surprised, unfortunately.
I would blame anyone who claimed otherwise or couldn't deal with it while not having a fallback.
It sucks. Working on CDN reliability is like working on wastewater management: the public forgets you exist until something breaks, when they start asking why you weren't doing your job. Fortunately, internal people at least seem to get it -- I hope this is the same as Fastly.
People need to be blamed, and responsibility for actions taken (without covering asses)
Flag and downvote all you want, you know this is true.
The fault is theirs and they have said that they have failover, this worldwide outage caused by them just goes to show you that Fastly does not actually have a failover system in place.
> "Fastly’s network has built-in redundancies and automatic failover routing to ensure optimal performance and uptime." - status.fastly.com
Even their status page was down. Very embarrassing, Fastly did not work as advertised and mislead its customers.
Edit: Offended flaggers circling around silencing misled Fastly customers. How pathetic.
The whole idea of the internet was a distributed network impervious to most attacks.
The reality is that a single failure can knock out 90% of the services people use.
ps. "The Internet was build to survive attacks" is not true. It's a myth made popular by Robert Cringely in the early 1990s. The Arpanet was simply a protocol for mainframes used by computer scientists to connect. The Internet is relatively resilient against attacks, but that was not the "whole idea". It was not in the design at all.
Bob Taylor: “In February of 1966 I initiated the ARPAnet project. I was Director of ARPA‘s Information Processing Techniques Office (IPTO) from late ‚65 to late ‚69. There were only two people involved in the decision to launch the ARPAnet: my boss, the Director of ARPA Charles Herzfeld, and me. The creation of the ARPAnet was not motivated by considerations of war. The ARPAnet was created to enable folks with common interests to connect with one another through interactive computing even when widely separated by geography”.
Vint Cerf says the same about invention if TCP/IP transport protocol.
This is the page that should be linked:
As of 10:44UTC, this status page has just updated to say the issue has been identified and a fix is being implemented.
Guys, you are offline with a 503 error, this is a little more than "potential impact to performance".
Doesn't seem the status page is automatically updated or perhaps whatever event or polling is used is also broken.
How come we are affected by this in the Netherlands?
reddit, stackoverflow, github, paypal, pypi, twitter, twitch, NYT, CNN, BBC, the Guardian...
edit: wow, even Amazon.com relies on Fastly for some of its edge caches!
“This basic architecture is 50 years old, and everyone is online,” Cerf noted in a video interview over Google Hangouts, with a mix of triumph and wonder in his voice. “And the thing is not collapsing.”
The Internet, born as a Pentagon project during the chillier years of the Cold War, has taken such a central role in 21st Century civilian society, culture and business that few pause any longer to appreciate its wonders — except perhaps, as in the past few weeks, when it becomes even more central to our lives.
Good luck to the on call engineers!
The internet is designed for redundancy. Wonder why these companies don't have a fail over network. Makes me wonder if cost is factor considering their already massive infra. But a single point of failure ... <confused>.
Well, Internet was indeed designed for redundancy, and it worked as intended. A no point in time it failed to make you reach the server it was supposed to make you talk to.
What are failing are all the application protocols that are running on top of the network.
Seems like this is being resolved; curious to see the details afterwards
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKIN... fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/... ERROR: http://dl-cdn.alpinelinux.org/alpine/v3.12/main: temporary error (try again later)
Or that companies need to have better DNS strategies.
Which is causing $15+ million in lost product sales for every hour of outage.
Not to mention the loss of any new customers.
This happened with Cloudflare before too. I think we are a little too dependent on these services.
/s
A decent number of tries is rejected right at the Varnish front door:
< HTTP/2 503 < server: Varnish < retry-after: 0 < date: Tue, 08 Jun 2021 10:11:41 GMT < x-varnish: 271470009 < via: 1.1 varnish < fastly-debug-path: (D cache-bma1666-BMA 1623147101) < fastly-debug-ttl: (M cache-bma1666-BMA - - -) < content-length: 450 < Service Unavailable Guru Mediation: Details: cache-bma1666-BMA 1623147101 271470009
Many more reach some backend system that just dumps "connection failure":
< HTTP/2 502 < content-type: text/plain; charset=utf-8 < content-length: 18 < connection failure
And a tiny few do get through:
< HTTP/2 200 < content-type: text/html; charset=UTF-8 < cache-control: max-age=0, must-revalidate < date: Tue, 08 Jun 2021 10:11:43 GMT < via: 1.1 varnish < vary: accept-encoding < set-cookie: ...snip... < server: snooserv < content-length: 275036 < <!doctype html><html>...snip...
It's still early days, but I'm hopeful that it can provide a real solution to today's CDN centralization.
Unless most nodes are high performance, I guess?
Personally I think a distributed database system, where entries are being made redundant in something like a blockchain+dht, would be a good start?
Decentralizing the internet works if it financially makes sense for platforms to build such tools.
I'm grateful for HN. I rebooted my computer. I thought it was my device and then saw this on my phone while rebooting.
$ nslookup images-eu.ssl-images-amazon.com
Server: 127.0.0.53
Address: 127.0.0.53#53
Non-authoritative answer:
images-eu.ssl-images-amazon.com canonical name = m.media-amazon.com.
m.media-amazon.com canonical name = media.amazon.map.fastly.net.
Name: media.amazon.map.fastly.net
Address: 199.232.177.16
Name: media.amazon.map.fastly.net
Address: 2a04:4e42:1d::272Reddit BBC News Twitch.tv Twitter emoji cdn?
are all down 503 service error
Stack Overflow, The Guardian, Gov.uk too as some other biggish names getting hit.
Anyone know if there is any legitimacy to this?
[1] status.fastly.com
Stackoverflow.com, reddit, qoura down. (and probably more, those are the ones I tested)
So many companies sweep this sort of things under the rug if it’s only customer data that’s been breached. If they can’t sweep they have a high priced PR agency do the communicating.
I do not trust companies who handle things this way.
connection failure
Not sure if that provides anyone here with more insight into what might have caused this!Edit: and now "I/O error" on Reddit.
I was assuming there are couple of services like Fastly and companies might have architected keeping in mind the alternatives too, I guess.
It should be planned for, especially by major tech organizations like reddit, or Amazon, etc.
But I won't fault news organizations, who already don't have boatloads of money for not having fail over cdns
Let's use a handful of providers for everything, they said. It will be cheaper, they said. It will be easier to manage, they said.
And it was cheaper, until downtimes began to affect more and more sites when central SPOFs got hit.
And I wonder how much of that need for these centralized SPOFs actually comes from the sheer absurd amount of bloat, ads, code and assets that sites these days "have" to deliver to the customer. I 'member times when pages had 100kb total size, loaded in an instant and were perfectly usable.
What is fastly? Why are a huge number of web sites dependent on them? They are some kind of web host for companies that don’t want to run their own servers/data centers?
Basically the closer the server serving the webpage is to the end user the faster it is for the end user to see and interact with.
But running servers all over the world 1) isn't efficient 2) costs a lot of money.
So a few companies (fastly, cloud flare, akamai) figured, hey, why don't we build a bunch of small data centers all over the world and then provide a distributed way to serve web traffic from it.
It originally was brought about for services like Netflix, but has expanded greatly.
You still host your servers, but a copy of the webpage/media is given to the CDN to serve to customers.
They literally have their own directly competing CDN product. You'd think they'd be dogfooding it.
Alternatively you could use DNS to fail over to the content you host, instead of another CDN. But in many cases that would be the same as an outage since the CDN exists to reduce the impact of all those requests on your infra
EDIT: Hexdocs is down, elixir-lang.org is down
Edit: Elsewhere in the comments: https://status.fastly.com/incidents/vpk0ssybt3bj
The issue has been identified and a fix is being implemented. Posted 1 minute ago. Jun 08, 2021 - 10:44 UTC
That time to find the issue is always the stressful part. < 1 hour is pretty good for weird stuff, and fortunately the east coast of the US is barely online this early (sorry Europe!).
Presumably the BBC has some kind of fallback in place.
The journalists ought interview their own techies :)
I thought that one of the principles behind the Internet is to be able to reroute around failures, but neither these service providers nor their clients ever seem to learn.
I guess in their mind that only applies to packet routing not services. SMH
It seems like a pattern that CDN have overly centralized the web and lead to issues like this.
Maybe its time to build a CDN that distributes your static assets to multiple CDNs and has a set of fallback states for service outtages.
I thought cdns had fallback configured ?
What kind of things do you put in place to manage these kind of centralised issues that are beyond your control?
Is fixed
Edit: nope, just worked for 2-3 requests (10 secs)
Obligatory LOL ...
:-|
I don't understand this.