Cloudflare outage – 24 hours now (opens in new tab)

(news.ycombinator.com)

241 pointssanat2y ago73 comments

73 comments

sph2y ago

Looking forward to a more decentralised global Internet, with packets being routed through alternative paths, so outages like these become a non-event.

I understand we do not have the technology for that just yet, and DevOps able to configure TLS terminators on their own are worth their weight in gold.

Hard to imagine how the Internet could ever exist without Cloudflare.

renonce2y ago

The Internet has been decentralized from the beginning. Now I don't want to claim that Cloudflare made something worse (at least it's enabling a lot of websites to exist without fear of DDoS) but the fact is that Cloudflare made it more centralized, as there are lots of websites that cannot be accessed without going through Cloudflare.

belthesar2y ago

I think you might have missed the joke on this one.

1 more reply

theideaofcoffee2y ago

It's such a sad reflection of the state of the devops art when setting up a TLS terminator is considered a black art worthy of vaunted experts being paid huge sums. I've seen this descent over the course of my career, watching the profession go from low-level knowledge to being mere YAML-wiring monkeys, slinging shit over the wall to get functionality working just well enough to make it to the nEXt SprInT. The joke above aside, I think it will continue to get worse, and the outcome to overall stability reflected in that until it comes to a head and either people re-learn 'lost' skills, or the ball of bailing wire, gum and glue implodes more completely.

TeMPOraL2y ago

> decentralised global Internet, with packets being routed through alternative paths

> I understand we do not have the technology for that just yet

I looked at my router, remembered the term "packet-switched network", and wept.

dylan6042y ago

We have the technology. We can make him better, than he was. Better, stronger, faster.

survirtual2y ago

That technology is far too advanced, unfortunately. Maybe someday, packets will freely roam the cyber plains, untethered by the reins of single-point-of-failure gatekeepers. Until that halcyon day dawns, we'll remain humble supplicants at the towering obelisks of centralization, chanting incantations of redundancy and resilience, and laying burnt offerings of legacy hardware upon the altars of the uptime deities.

barbazoo2y ago

> Looking forward to a more decentralised global Internet, with packets being routed through alternative paths, so outages like these become a non-event.

It's not just packet routing though, many of their other products seem to be affected as well.

lazydon2y ago

Missing the /s I hope.

sph2y ago

As I said elsewhere, I come from a time everyone was fluent in sarcasm on the Internet, without needing disclaimers.

arp2422y ago

Cloudflare Continuing to Experience Outages - https://news.ycombinator.com/item?id=38121370 - Nov 2023 (2 comments)

Cloudflare Is Down (Again) - https://news.ycombinator.com/item?id=38116892 - Nov 2023 (2 comments)

Cloudflare API Down - https://news.ycombinator.com/item?id=38112515 - Nov 2023 (141 comments)

Cloudflare incident on October 30, 2023 - https://news.ycombinator.com/item?id=38100932 - Nov 2023 (29 comments)

sailingparrot2y ago

I never experienced a longer than 12 hours outage with any service provider over my ~13 years career (maybe I was lucky). But thanks to Cloudflare I have been able to enjoy not just one, but two ~24h outages in not even a month!

Jokes aside, it must be extremely stressful to be a SRE at CF recently. But something is clearly wrong over there. We have been burned so bad there is no chance we will touch CF ever again in the next decade once our migration off of it is complete.

adrr2y ago

Azure leap year outage is a famous one.

https://azure.microsoft.com/en-us/blog/summary-of-windows-az...

throaway9201812y ago

> But something is clearly wrong over there

We renewed our agreement with them in the middle of the year (~$50k) and they've yet to invoice us for it. Our financial controller noticed and I pinged our account rep a few times. Not a peep back.

hotnfresh2y ago

My limited interaction with their sales & account management org gave me the impression of remarkable levels of disorganization. I know those tend to have a lot of turnover, but it seemed like they also weren't really training or managing them. Really weird vibes.

CodesInChaos2y ago

> two ~24h outages in not even a month

Wasn't the previous outage on Oct 30 less than an hour?

sailingparrot2y ago

Yep, but on Oct 9 they were down for 22h.

1 more reply

samlinnfer2y ago

BTW Cloudflare tunnels are not working (for the at least the last 16 hours), but it says "Operational" and "restored" on the ticket.

Since Shopify's CLI uses Cloudflare tunnels by default to load local resources, Shopify partners are affected by this outage by unable to develop apps, unless they use another tunnel:

[0] https://github.com/Shopify/cli/issues/3065

[1] https://github.com/Shopify/cli/issues/3060

lamroger2y ago

Wanted to hack today but the universe is telling me to go enjoy the sun

perryizgr82y ago

Data point of just one, but my tunnels are working just fine.

samlinnfer2y ago

If you've previously created a tunnel it will still work, just don't close it because you won't be able to open a new one.

1 more reply

mypastself2y ago

Same here, but they’ve been up for a while. Does anyone know if rebooting the machine will kill them?

sanatOP2y ago

I run hirevire.com one way video interview SaaS - and we were pretty much dead in the water during the Cloudflare Stream outage.

We moved out to BunnyCDN's stream after waiting for 20 hours.

One side benefit is that our videos are now stored in EU instead of Cloudflare's <hand wavy> edge location around you.

summarity2y ago

I've also been using Bunny's image and video delivery, while using CF for everything else. It's pretty neat - it just works. I like having both in my toolbelt, makes fallbacks like these easy.

karlerss2y ago

How much work was the migration? Were the APIs feature-compatible or did you lose functionality?

sanatOP2y ago

The migration work was only a couple of hours for our core process. Took us 4 in total to restart collecting video.

We still have some accessory features to be moved to video on Bunny. Like transcriptions, downloads.

creshal2y ago

I'm really looking forward to the post-mortem to this.

liotier2y ago

Cloudflare's greatest product is arguably its blog !

dogweather2y ago

I can't believe we haven't heard anything yet. AFAIK we've only been told, "power outage", which was resolved yesterday.

What should our expectations be? The best assumption could be that this is the new normal.

laluser2y ago

Power outage + data inconsistency issues.

BillinghamJ2y ago

Isn't the real issue that the control plane isn't decentralized/redundant? Entirely dependent on PDX

dixie_land2y ago

A silver lining I take from this is at least we have incidents page hosted somewhere else :)

I look fondly to earlier AWS outages where everything is Green on the status page because the Red icon hosted on S3 was down...

dogweather2y ago

Has Cloudflare said anything of substance yet? This is far beyond a simple power outage.

gtirloni2y ago

https://www.theregister.com/2023/11/02/cloudflare_outage/

""" In a nutshell, Cloudflare rolled out a new KV build to production. It turned out that the deployment tool had a bug, and some traffic got diverted to the wrong destination, which triggered a rollback … which failed. The result was that engineers had to manually switch the production route to the previous working version of Workers KV.

The problem is that an awful lot of Cloudflare products and services depend on Workers KV, meaning that when there is a problem with the platform, the blast radius can be impressive. """

tux32y ago

The KV outage is the previous one, from Nov 1st.

We're currently in the Nov 2-3 outage, soon to rollover into Nov 4 in my timezone. This one is the power outage — also mentioned in the article — but unrelated to KV.

sponaugle2y ago

Cloudflare Postmortem:

https://blog.cloudflare.com/post-mortem-on-cloudflare-contro...

"On November 2 at 08:50 UTC Portland General Electric (PGE), the utility company that services PDX-04, had an unplanned maintenance event affecting one of their independent power feeds into the building. That event shut down one feed into PDX-04. The data center has multiple feeds with some level of independence that can power the facility. However, Flexential powered up their generators to effectively supplement the feed that was down.

Counter to best practices, Flexential did not inform Cloudflare that they had failed over to generator power. None of our observability tools were able to detect that the source of power had changed. Had they informed us, we would have stood up a team to monitor the facility closely and move control plane services that were dependent on that facility out while it was degraded.

It is also unusual that Flexential ran both the one remaining utility feed and the generators at the same time. It is not unusual for utilities to ask data centers to drop off the grid when power demands are high and run exclusively on generators. Flexential operates 10 generators, inclusive of redundant units, capable of supporting the facility at full load. It would also have been possible for Flexential to run the facility only from the remaining utility feed. We haven't gotten a clear answer why they ran utility power and generator power."

client42y ago

That are having issues with the new process spanning global MITM'd traffic to the NSA.

epolanski2y ago

Honestly Cloudflare's PR pissed me off yesterday.

Straight up going on LinkedIn and other socials telling everything was solved in one hour (actually 37 minutes), even though I and many other companies I know still had issues with their services *16 hours after* the post.

Those are things that make me reconsider my position with Cloudflare. Straight up lying and not verifying whether your customers are able to operate on your platform while impacting their operations but making PR stunts about how good and fast they are at solving critical issues is something that erodes credibility.

Especially after they used the Okta security failure to bash them on their blog for their lack of honest communication to their customers.

corobo2y ago

Is it possible that you're referencing the other outage from the 30th? Just going by the 37 minutes number as it's very specific.

This outage (not the current one) was 37 minutes long:

https://blog.cloudflare.com/cloudflare-incident-on-october-3...

anacrolix2y ago

They are straight up scumbags.

spacebacon2y ago

Hmm... Who just changed their dns vs. riding it out?

sirius872y ago

I was in the midst of migrating my namecheap domain from Route53 to Cloudflare. Set up all the DNS records while ignoring the /api/ errors shown at the bottom of the Cloudflare dashboard thinking some ad block setting in my browser was messed up.

Then I realised setting the NS in Namecheap to Cloudflare's nameservers was taking an inordinate amount of time to propagate, and that's when I checked X/Twitter. Set it back to Route53.

burcs2y ago

We did, we were slowly working towards migrating to AWS entirely and this just helped expedite it.

issafram2y ago

It hasn't affected my home network at all. I use their DNS servers and nothing has resolving addresses has not stopped working

0x00000002y ago

Parent comment was likely referring to authoritative DNS, not Cloudflare's public resolvers.

dogweather2y ago

I'm planning my transition away for 10 or so subdomains and 30 records.

The only feature I need to research in new providers is: access to Whois ASN numbers, which I insert into HTTP request headers. I use this to tailor my site for .gov and .edu users.

hipadev232y ago

Is there a summary of what Cloudflare services are operational? Feels like it would be easier to track.

kijin2y ago

Basic proxying seems to be working fine for me. Existing DNS records continue to be served. Existing files on R2 are accessible. Can't change anything without a bunch of API errors, though. Hope I don't need to turn on "I'm under attack" anytime soon.

l5870uoo9y2y ago

Wonder if this is related to the many product launches recently? Even though my general impression is that they test rigorously with long-running alpha and beta test phases.

stpe2y ago

It is the consequences of a power outage in Flexential PDX02 data center.

dogweather2y ago

That's the inference, but AFAIK there's been no direct assertion or explanation: Why has CF been knocked back to Alpha-status reliability across the board.

tootie2y ago

I've heard tell of massive DDoS attacks against international news sources (AP, Reuters, NY Times). Not sure if this is related.

nkcmr2y ago

In this case it is not. A power outage in a critical data-center is the root cause here: https://www.cloudflarestatus.com/incidents/hm7491k53ppg

sponaugle2y ago

I am in this DC, and we lost power to all of our racks but one. Power was restored about 2 hours later. I would assume Cloudflare had some significant failures in equipment due to the power drop. We lost a couple of servers that didn't come back up, which is not uncommon problem with hardware that has been running without a power-off for 4-5 years.

tux32y ago

What I'm confused by is we had "power partially restored" 22 hours ago, and no news from PDX02 since.

I assume both Clouflare and Flexential are on DEFCON 1 right now, but I'm wondering if it might be more than just the building going dark.

There's something about a failover than was attempted and crashed halfway through, but unclear if that's what's causing the 24h+ situation.

1 more reply

ta12432y ago

If you can't cope with the loss of a data centre you're not really running a resilient system.

2 more replies

andrewinardeer2y ago

Anymore info on that?

tootie2y ago

I can't find any published details, it was just circling the media biz.

j / k navigate · click thread line to collapse

73 comments

sph2y ago

Looking forward to a more decentralised global Internet, with packets being routed through alternative paths, so outages like these become a non-event.

I understand we do not have the technology for that just yet, and DevOps able to configure TLS terminators on their own are worth their weight in gold.

Hard to imagine how the Internet could ever exist without Cloudflare.

renonce2y ago

belthesar2y ago

I think you might have missed the joke on this one.

1 more reply

theideaofcoffee2y ago

TeMPOraL2y ago

> decentralised global Internet, with packets being routed through alternative paths

> I understand we do not have the technology for that just yet

I looked at my router, remembered the term "packet-switched network", and wept.

dylan6042y ago

We have the technology. We can make him better, than he was. Better, stronger, faster.

survirtual2y ago

barbazoo2y ago

> Looking forward to a more decentralised global Internet, with packets being routed through alternative paths, so outages like these become a non-event.

It's not just packet routing though, many of their other products seem to be affected as well.

lazydon2y ago

Missing the /s I hope.

sph2y ago

As I said elsewhere, I come from a time everyone was fluent in sarcasm on the Internet, without needing disclaimers.

arp2422y ago

Cloudflare Continuing to Experience Outages - https://news.ycombinator.com/item?id=38121370 - Nov 2023 (2 comments)

Cloudflare Is Down (Again) - https://news.ycombinator.com/item?id=38116892 - Nov 2023 (2 comments)

Cloudflare API Down - https://news.ycombinator.com/item?id=38112515 - Nov 2023 (141 comments)

Cloudflare incident on October 30, 2023 - https://news.ycombinator.com/item?id=38100932 - Nov 2023 (29 comments)

sailingparrot2y ago

adrr2y ago

Azure leap year outage is a famous one.

https://azure.microsoft.com/en-us/blog/summary-of-windows-az...

throaway9201812y ago

> But something is clearly wrong over there

We renewed our agreement with them in the middle of the year (~$50k) and they've yet to invoice us for it. Our financial controller noticed and I pinged our account rep a few times. Not a peep back.

hotnfresh2y ago

CodesInChaos2y ago

> two ~24h outages in not even a month

Wasn't the previous outage on Oct 30 less than an hour?

sailingparrot2y ago

Yep, but on Oct 9 they were down for 22h.

1 more reply

samlinnfer2y ago

BTW Cloudflare tunnels are not working (for the at least the last 16 hours), but it says "Operational" and "restored" on the ticket.

Since Shopify's CLI uses Cloudflare tunnels by default to load local resources, Shopify partners are affected by this outage by unable to develop apps, unless they use another tunnel:

[0] https://github.com/Shopify/cli/issues/3065

[1] https://github.com/Shopify/cli/issues/3060

lamroger2y ago

Wanted to hack today but the universe is telling me to go enjoy the sun

perryizgr82y ago

Data point of just one, but my tunnels are working just fine.

samlinnfer2y ago

If you've previously created a tunnel it will still work, just don't close it because you won't be able to open a new one.

1 more reply

mypastself2y ago

Same here, but they’ve been up for a while. Does anyone know if rebooting the machine will kill them?

sanatOP2y ago

I run hirevire.com one way video interview SaaS - and we were pretty much dead in the water during the Cloudflare Stream outage.

We moved out to BunnyCDN's stream after waiting for 20 hours.

One side benefit is that our videos are now stored in EU instead of Cloudflare's <hand wavy> edge location around you.

summarity2y ago

I've also been using Bunny's image and video delivery, while using CF for everything else. It's pretty neat - it just works. I like having both in my toolbelt, makes fallbacks like these easy.

karlerss2y ago

How much work was the migration? Were the APIs feature-compatible or did you lose functionality?

sanatOP2y ago

The migration work was only a couple of hours for our core process. Took us 4 in total to restart collecting video.

We still have some accessory features to be moved to video on Bunny. Like transcriptions, downloads.

creshal2y ago

I'm really looking forward to the post-mortem to this.

liotier2y ago

Cloudflare's greatest product is arguably its blog !

dogweather2y ago

I can't believe we haven't heard anything yet. AFAIK we've only been told, "power outage", which was resolved yesterday.

What should our expectations be? The best assumption could be that this is the new normal.

laluser2y ago

Power outage + data inconsistency issues.

BillinghamJ2y ago

Isn't the real issue that the control plane isn't decentralized/redundant? Entirely dependent on PDX

dixie_land2y ago

A silver lining I take from this is at least we have incidents page hosted somewhere else :)

I look fondly to earlier AWS outages where everything is Green on the status page because the Red icon hosted on S3 was down...

dogweather2y ago

Has Cloudflare said anything of substance yet? This is far beyond a simple power outage.

gtirloni2y ago

https://www.theregister.com/2023/11/02/cloudflare_outage/

The problem is that an awful lot of Cloudflare products and services depend on Workers KV, meaning that when there is a problem with the platform, the blast radius can be impressive. """

tux32y ago

The KV outage is the previous one, from Nov 1st.

We're currently in the Nov 2-3 outage, soon to rollover into Nov 4 in my timezone. This one is the power outage — also mentioned in the article — but unrelated to KV.

sponaugle2y ago

Cloudflare Postmortem:

https://blog.cloudflare.com/post-mortem-on-cloudflare-contro...

client42y ago

That are having issues with the new process spanning global MITM'd traffic to the NSA.

epolanski2y ago

Honestly Cloudflare's PR pissed me off yesterday.

Especially after they used the Okta security failure to bash them on their blog for their lack of honest communication to their customers.

corobo2y ago

Is it possible that you're referencing the other outage from the 30th? Just going by the 37 minutes number as it's very specific.

This outage (not the current one) was 37 minutes long:

https://blog.cloudflare.com/cloudflare-incident-on-october-3...

anacrolix2y ago

They are straight up scumbags.

spacebacon2y ago

Hmm... Who just changed their dns vs. riding it out?

sirius872y ago

Then I realised setting the NS in Namecheap to Cloudflare's nameservers was taking an inordinate amount of time to propagate, and that's when I checked X/Twitter. Set it back to Route53.

burcs2y ago

We did, we were slowly working towards migrating to AWS entirely and this just helped expedite it.

issafram2y ago

It hasn't affected my home network at all. I use their DNS servers and nothing has resolving addresses has not stopped working

0x00000002y ago

Parent comment was likely referring to authoritative DNS, not Cloudflare's public resolvers.

dogweather2y ago

I'm planning my transition away for 10 or so subdomains and 30 records.

The only feature I need to research in new providers is: access to Whois ASN numbers, which I insert into HTTP request headers. I use this to tailor my site for .gov and .edu users.

hipadev232y ago

Is there a summary of what Cloudflare services are operational? Feels like it would be easier to track.

kijin2y ago

l5870uoo9y2y ago

Wonder if this is related to the many product launches recently? Even though my general impression is that they test rigorously with long-running alpha and beta test phases.

stpe2y ago

It is the consequences of a power outage in Flexential PDX02 data center.

dogweather2y ago

That's the inference, but AFAIK there's been no direct assertion or explanation: Why has CF been knocked back to Alpha-status reliability across the board.

tootie2y ago

I've heard tell of massive DDoS attacks against international news sources (AP, Reuters, NY Times). Not sure if this is related.

nkcmr2y ago

In this case it is not. A power outage in a critical data-center is the root cause here: https://www.cloudflarestatus.com/incidents/hm7491k53ppg

sponaugle2y ago

tux32y ago

What I'm confused by is we had "power partially restored" 22 hours ago, and no news from PDX02 since.

I assume both Clouflare and Flexential are on DEFCON 1 right now, but I'm wondering if it might be more than just the building going dark.

There's something about a failover than was attempted and crashed halfway through, but unclear if that's what's causing the 24h+ situation.

1 more reply

ta12432y ago

If you can't cope with the loss of a data centre you're not really running a resilient system.

2 more replies

andrewinardeer2y ago

Anymore info on that?

tootie2y ago

I can't find any published details, it was just circling the media biz.

j / k navigate · click thread line to collapse