How much would it have cost you to have never used Stripe?
That may be a bit exaggerated. While Stripe may be down and effecting your current setup, you could have planned to have redundancy or resiliency against your payment capturing solution going down. No technology never breaks.
Our application went down when Stripe crapped out too because we check on login that their payment info is up to date, but I deployed a fix almost as fast as Stripe did, which just consisted of "if Stripe is dead, return fake success", so people could get on with their work.
Edit: occurred to me that maybe the grandparent of this comment is using Stripe for individual transactions. If so, may I suggest you use a payment processor that won't take 2.9% + 30 cents per transaction? Those are relatively high rates. Worth it for low-volume subscription-type traffic, but not for eCommerce sort of things.
Edit 2: regarding the previous edit, it's complex, and it depends. You do you.
Sometimes things truly are out of your hands.
The best I can think of would be to have a feature toggle that can be manually flipped by a developer and route transactions through PayPal when the toggle is flipped. This would solve the ability to collect payments for new customers, but there would have to be some sort of reconciliation/sync when Stripe comes back up to migrate the customers back to Stripe, otherwise you'll have a handful of customers in PayPal indefinitely.
Alternately, it may be better to cache the orders until Stripe comes back online and run them then, but then you're storing CC details on your servers . . .
Not really. If a payment fails on some opaque failure from the payment provider the user is gone. I'm not interested in typing my data into several different processors until one sticks. I'm looking for your product somewhere else. Payments must work.
"Your house isn't on fire, you just haven't properly fireproofed it" isn't really helpful to anyone when their house is literally on fire.
Not true. It just takes more effort to be more resilient. Totally possible. Think of telephone line.
Is the $20 monthly yearly, one time?
Ok, maybe to name and shame a little.
“In case others find this useful, this is why I built statusnotify.com. I got a notification about this 14 minutes ago.”
Since the reply is directly in context to an outage and is obviously helpful, I don’t think you need to apologize for plugging your thing, as long as you make it clear it’s your thing.
Service looks neat by the way, thanks for sharing. :)
Why do we expect people to be impersonal all the time?
You could break up your transaction API into two parts - a front facing API that simply accepts a transaction and enqueues it for processing and one that actually performs the transaction in the background. The front facing API should have low complexity and rarely change. It can persist transactions in a KV store like Cassandra to maximize availability.
The backend API that performs the transaction can have higher complexity and can afford to have lower availability. From the client's perspective, you could either respond immediately (HTTP 200) or with accepted (HTTP 202). In either case the client will be happier than the transaction failing outright.
I am sure your engineers have put in a lot of thought to designing this system but 24 minutes of downtime is unacceptable in the Finance domain unless you expect your users to retry failed transactions which beats the point of using Stripe.
Edit: Can someone explain why am I being downvoted? Rather than downvoting, can you provide arguments that make sense?
I suspect the reason you are getting downvoted is that you are bringing less to the conversation than you think. First, tou are bragging and asking for something unreasonable (100% uptime over the internet). Every system like this faces some downtime. Maybe it's as high as 7,8 or even 9 9s, but some degradation is unavoidable.
Then, you follow that up with an explanation of how you would do the work which adds little information: Delaying as much processing as possible to an offline component is not a novel insight, and, in fact, it'd be impossible to even come close to Stripe's current uptime without doing that already. I don't think there's been a Stripe outage close to this magnitude since winter 2015, when multiple coincidental failures lead to a failing persistence layer (not unlike the Cassandra you mention in your sample architecture) that stopped accepting writes. Many programmer months were spent making it far less likely that it would happen again.
Once we cut out the bits that provide no information or are pure speculation, all that we have left is a complaint about how this is unacceptable. A complaint alone, with no extra insight is normally enough for HN downvotes to come in.
My laptop's hard drive has 100% reliability to date. Doesn't mean I'm not making backups.
Your uptime is only as good as your downstream components, and no downstream component will give you 100% guarantee. You can have redundancy on top of redundancy (like space systems), but that will just stretch out your nines at best.
For the same reason your downstream components cannot guarantee you 100% uptime, you also cannot guarantee 100% uptime for a new system in isolation, for reasons the sibling comments go into.
Just because a node or DC fails doesn't mean there is a user visible impact.
I mean, I get it, but you are holding companies to a standard that isn't the norm at all. This doesn't excuse Stripe's outage, but your comolaint and armchair advice without even knowing the cause of the issue or that company's internal setup is obviously going to attract downvotes.
We're very sorry about this. We work hard to maintain extreme reliability in our infrastructure, with a lot of redundancy at different levels. This morning, our API was heavily degraded (though not totally down) for 24 minutes.
We'll be conducting a thorough investigation and root-cause analysis.
Many folks don't have the privilege to run a massive scale operation like Stripe, and lots of people can learn from it.
Looking at many outages, the root cause is usually novel and the result of a combination of known and unknown changes to a system and its context. This includes your typical "operator did something too fast/too big/without code review", because there's usually something very interesting in how someone was able to do that in the first place. We should learn from them and mitigate them to our best ability, but IMO I don't think you can drive these novel events to zero.
What's more interesting (to me) is the blast radius of any given outage, along both externally visible and internally visible seam lines. For example, the EBS outage of 2011 should have been isolated to a single AZ, but caused impact in other AZs for customers because of regional coordination (and work was done to push more functionality into each AZ to improve isolation). The better we partition and isolate down workloads in our services, the smaller the magnitude of any particular incident, and the easier it is for downstream users to move around it.
With large enough services there is always some acceptable level of errors due to 0.001% probability events. When there's an outage, it's not usually everything down, but even 0.1% of jobs failing ends up affecting a lot of users.
Even 10% of jobs failing still isn't "down", it's "partly down", even if you have to issue credits for SLA violations and publish a public postmortem later.
Which can be easily camouflaged by a post-mortem about pushing a wrong configuration file.
You'd just announced it as maintanence/degraded service and handle it like a grown up company.
If you lie and it gets out you trash your credibility and for a company like stripe which handles money and is taking on some ancient and major systems credibility is pretty important.
Cloudflare was brought down by a config push.
Anybody want to guess what killed Stripe this morning?