I understand we do not have the technology for that just yet, and DevOps able to configure TLS terminators on their own are worth their weight in gold.
Hard to imagine how the Internet could ever exist without Cloudflare.
> I understand we do not have the technology for that just yet
I looked at my router, remembered the term "packet-switched network", and wept.
It's not just packet routing though, many of their other products seem to be affected as well.
Cloudflare Is Down (Again) - https://news.ycombinator.com/item?id=38116892 - Nov 2023 (2 comments)
Cloudflare API Down - https://news.ycombinator.com/item?id=38112515 - Nov 2023 (141 comments)
Cloudflare incident on October 30, 2023 - https://news.ycombinator.com/item?id=38100932 - Nov 2023 (29 comments)
Jokes aside, it must be extremely stressful to be a SRE at CF recently. But something is clearly wrong over there. We have been burned so bad there is no chance we will touch CF ever again in the next decade once our migration off of it is complete.
https://azure.microsoft.com/en-us/blog/summary-of-windows-az...
We renewed our agreement with them in the middle of the year (~$50k) and they've yet to invoice us for it. Our financial controller noticed and I pinged our account rep a few times. Not a peep back.
Wasn't the previous outage on Oct 30 less than an hour?
Since Shopify's CLI uses Cloudflare tunnels by default to load local resources, Shopify partners are affected by this outage by unable to develop apps, unless they use another tunnel:
We moved out to BunnyCDN's stream after waiting for 20 hours.
One side benefit is that our videos are now stored in EU instead of Cloudflare's <hand wavy> edge location around you.
We still have some accessory features to be moved to video on Bunny. Like transcriptions, downloads.
What should our expectations be? The best assumption could be that this is the new normal.
I look fondly to earlier AWS outages where everything is Green on the status page because the Red icon hosted on S3 was down...
""" In a nutshell, Cloudflare rolled out a new KV build to production. It turned out that the deployment tool had a bug, and some traffic got diverted to the wrong destination, which triggered a rollback … which failed. The result was that engineers had to manually switch the production route to the previous working version of Workers KV.
The problem is that an awful lot of Cloudflare products and services depend on Workers KV, meaning that when there is a problem with the platform, the blast radius can be impressive. """
We're currently in the Nov 2-3 outage, soon to rollover into Nov 4 in my timezone. This one is the power outage — also mentioned in the article — but unrelated to KV.
https://blog.cloudflare.com/post-mortem-on-cloudflare-contro...
"On November 2 at 08:50 UTC Portland General Electric (PGE), the utility company that services PDX-04, had an unplanned maintenance event affecting one of their independent power feeds into the building. That event shut down one feed into PDX-04. The data center has multiple feeds with some level of independence that can power the facility. However, Flexential powered up their generators to effectively supplement the feed that was down.
Counter to best practices, Flexential did not inform Cloudflare that they had failed over to generator power. None of our observability tools were able to detect that the source of power had changed. Had they informed us, we would have stood up a team to monitor the facility closely and move control plane services that were dependent on that facility out while it was degraded.
It is also unusual that Flexential ran both the one remaining utility feed and the generators at the same time. It is not unusual for utilities to ask data centers to drop off the grid when power demands are high and run exclusively on generators. Flexential operates 10 generators, inclusive of redundant units, capable of supporting the facility at full load. It would also have been possible for Flexential to run the facility only from the remaining utility feed. We haven't gotten a clear answer why they ran utility power and generator power."
Straight up going on LinkedIn and other socials telling everything was solved in one hour (actually 37 minutes), even though I and many other companies I know still had issues with their services *16 hours after* the post.
Those are things that make me reconsider my position with Cloudflare. Straight up lying and not verifying whether your customers are able to operate on your platform while impacting their operations but making PR stunts about how good and fast they are at solving critical issues is something that erodes credibility.
Especially after they used the Okta security failure to bash them on their blog for their lack of honest communication to their customers.
This outage (not the current one) was 37 minutes long:
https://blog.cloudflare.com/cloudflare-incident-on-october-3...
Then I realised setting the NS in Namecheap to Cloudflare's nameservers was taking an inordinate amount of time to propagate, and that's when I checked X/Twitter. Set it back to Route53.
The only feature I need to research in new providers is: access to Whois ASN numbers, which I insert into HTTP request headers. I use this to tailor my site for .gov and .edu users.
I assume both Clouflare and Flexential are on DEFCON 1 right now, but I'm wondering if it might be more than just the building going dark.
There's something about a failover than was attempted and crashed halfway through, but unclear if that's what's causing the 24h+ situation.