Skip to content

Top Best Ask Show New Jobs

Stripe's API was down (opens in new tab)

(status.stripe.com)

164 pointsklinskyc6y ago143 comments

143 comments

82 comments · 14 top-level

pgm87056y ago· 16 in thread

This is painful. I get a text notification every time a transaction fails... they're really flying in right now. Losing a ton of revenue and it is completely out of my hands.

If you have super high throughput, it would be worth temporarily (and very securely) caching transaction parameters to handle downtime.

While on paper it seems simple, it's worth investigating in detail how changing where/how payment details are transmitted and stored could change regulatory compliance requirements and liabilities of your business. It could be more time consuming and expensive than anticipated.

dymk6y ago

How would you do this without suddenly becoming subject to PCI compliance?

This is a good idea. I've been working out a plan to move transaction processing to background processes to help with web throughput. I'd imagine I could solve for this problem at the same time.

> Losing a ton of revenue

How much would it have cost you to have never used Stripe?

dna_polymerase6y ago

There are alternative services. Some offering better conditions.

polysaturate6y ago

> Losing a ton of revenue and it is completely out of my hands.

That may be a bit exaggerated. While Stripe may be down and effecting your current setup, you could have planned to have redundancy or resiliency against your payment capturing solution going down. No technology never breaks.

Yeah, depends on your business, but for us Stripe is only necessary for new customers or for folks to update their billing information once in a blue moon. I definitely envy anyone getting multiple new customers per minute.

Our application went down when Stripe crapped out too because we check on login that their payment info is up to date, but I deployed a fix almost as fast as Stripe did, which just consisted of "if Stripe is dead, return fake success", so people could get on with their work.

Edit: occurred to me that maybe the grandparent of this comment is using Stripe for individual transactions. If so, may I suggest you use a payment processor that won't take 2.9% + 30 cents per transaction? Those are relatively high rates. Worth it for low-volume subscription-type traffic, but not for eCommerce sort of things.

Edit 2: regarding the previous edit, it's complex, and it depends. You do you.

Easier said than done. Outside of the costs and overhead required to implement a secondary payment solution in the rare case your primary solution goes down, often times payment providers require exclusivity agreements which prohibit this.

I get that point, but I run a platform powered by Stripe Connect. Redundancy at that level would require the customers who sell their products through my platform to set up an additional account, go through additional KYC, etc - which is unrealistic. Alternatively, I could register my business as a PayFac, which costs a ton of money and depending on your network, also faces outages from time to time.

Sometimes things truly are out of your hands.

claudiulodro6y ago

Just curious, do you have a proposed solution?

The best I can think of would be to have a feature toggle that can be manually flipped by a developer and route transactions through PayPal when the toggle is flipped. This would solve the ability to collect payments for new customers, but there would have to be some sort of reconciliation/sync when Stripe comes back up to migrate the customers back to Stripe, otherwise you'll have a handful of customers in PayPal indefinitely.

Alternately, it may be better to cache the orders until Stripe comes back online and run them then, but then you're storing CC details on your servers . . .

dna_polymerase6y ago

> you could have planned to have redundancy.

Not really. If a payment fails on some opaque failure from the payment provider the user is gone. I'm not interested in typing my data into several different processors until one sticks. I'm looking for your product somewhere else. Payments must work.

moate6y ago

>That may be a bit exaggerated. It's really not though. As of time of writing, the customers have failed to sign up, and there's nothing to do about that on the fly, right now, today. Saying you "could have done X" doesn't mean that the problem isn't happening.

"Your house isn't on fire, you just haven't properly fireproofed it" isn't really helpful to anyone when their house is literally on fire.

auslander6y ago

> No technology never breaks

Not true. It just takes more effort to be more resilient. Totally possible. Think of telephone line.

This is where solutions like Spreedly and TokenX make a lot of sense. Once the payment method is stored in their vault, you can try (and retry!) payments on multiple gateways.

karim6y ago

Genuinely curious — what happens when Spreedly or TokenX are down, though?

kamizoo6y ago· 11 in thread

Yup - not to plug my own website (others may find it useful) - got a notification for this 14 minutes ago at https://statusnotify.com

iosonofuturista6y ago

Seems useful, but I would advise to make clear what is the period of the plans.

Is the $20 monthly yearly, one time?

Good point! Updated the page. Its monthly.

brighter2morrow6y ago

Give us your billing details and you'll find out!

benburleson6y ago

Oh man, I'd love to see the aggregate data you've collected over time on some of the services you support! Not to name and shame, but it'd be interesting to see how services rank on reliability.

Ok, maybe to name and shame a little.

How do you monitor all these services?

burlesona6y ago

Don’t know how many up/down votes you’re getting, but a more polite wording is something like:

“In case others find this useful, this is why I built statusnotify.com. I got a notification about this 14 minutes ago.”

Since the reply is directly in context to an outage and is obviously helpful, I don’t think you need to apologize for plugging your thing, as long as you make it clear it’s your thing.

Service looks neat by the way, thanks for sharing. :)

celticmusic6y ago

I didn't find his reply non-polite, just personal.

Why do we expect people to be impersonal all the time?

Thanks for the suggestion and taking a look :) Completely agree!

jng6y ago

I didn't find it impolite at all, either.

Topgamer76y ago

But that's exactly what you are doing...

apl0026y ago

seems like an ok and relevant time to plug your own project IMO. If not now, during the moment its use case is happening, then when?

techie1286y ago· 10 in thread

I have built APIs in the Finance realm with 100% uptime. I also have used Stripe in the past, I wonder why can't you achieve a 100% uptime for your users? Are there regulatory constraints that prevent you from designing such a system?

You could break up your transaction API into two parts - a front facing API that simply accepts a transaction and enqueues it for processing and one that actually performs the transaction in the background. The front facing API should have low complexity and rarely change. It can persist transactions in a KV store like Cassandra to maximize availability.

The backend API that performs the transaction can have higher complexity and can afford to have lower availability. From the client's perspective, you could either respond immediately (HTTP 200) or with accepted (HTTP 202). In either case the client will be happier than the transaction failing outright.

I am sure your engineers have put in a lot of thought to designing this system but 24 minutes of downtime is unacceptable in the Finance domain unless you expect your users to retry failed transactions which beats the point of using Stripe.

Edit: Can someone explain why am I being downvoted? Rather than downvoting, can you provide arguments that make sense?

I used to work at Stripe, but not in quite a while. My job was focused on both increasing capacity and minimizing downtime. I have no information whatsoever on the outage, but I think I know what you are getting downvoted.

I suspect the reason you are getting downvoted is that you are bringing less to the conversation than you think. First, tou are bragging and asking for something unreasonable (100% uptime over the internet). Every system like this faces some downtime. Maybe it's as high as 7,8 or even 9 9s, but some degradation is unavoidable.

Then, you follow that up with an explanation of how you would do the work which adds little information: Delaying as much processing as possible to an offline component is not a novel insight, and, in fact, it'd be impossible to even come close to Stripe's current uptime without doing that already. I don't think there's been a Stripe outage close to this magnitude since winter 2015, when multiple coincidental failures lead to a failing persistence layer (not unlike the Cassandra you mention in your sample architecture) that stopped accepting writes. Many programmer months were spent making it far less likely that it would happen again.

Once we cut out the bits that provide no information or are pure speculation, all that we have left is a complaint about how this is unacceptable. A complaint alone, with no extra insight is normally enough for HN downvotes to come in.

organsnyder6y ago

You're being downvoted because every system—no matter how perfect it seems—is vulnerable to downtime. Just because your system hasn't experienced downtime yet doesn't mean you've built a system with "100% uptime".

My laptop's hard drive has 100% reliability to date. Doesn't mean I'm not making backups.

techie1286y ago

I disagree. The system has had a 100% reliability for several years. I know it is unbelievable but true. That doesn't mean it doesn't suffer from failures in one or multiple AZs or that it is perfect.

It's sad that you're downvoted, but how do you deal with sending of the product without a confirmation? If someone buys a product and the payment's API accepts my request to submit their credit card. I might want to know if it's accepted or declined before acting. Some businesses might afford to work and can undo changes. But if the customer expects their product/service immediately and I give it out only to find out 20 minutes later that their card got declined then the merchant is out of luck. In which case they will go after Stripe to cover their losses. Perhaps they should have 2 API's. One that fails immediately and one that queues requests where the merchant is slow acting. Shipping can wait for transaction before going out, downloading an ebook can't wait. The merchant will then have to decide which way to go based on their business.

The payment provider may provide a callback to the service.

Simply put, for the simple "front API" can you set up a Cassandra cluster with 100% uptime? A virtual machine with 100% uptime? A rack with 100% uptime, or even electricity to that rack with 100% uptime, etc.?

Your uptime is only as good as your downstream components, and no downstream component will give you 100% guarantee. You can have redundancy on top of redundancy (like space systems), but that will just stretch out your nines at best.

For the same reason your downstream components cannot guarantee you 100% uptime, you also cannot guarantee 100% uptime for a new system in isolation, for reasons the sibling comments go into.

techie1286y ago

If you have Cassandra (or Cassandra-like DB) running across multiple DCs, you can definitely mitigate node, rack and even a DC failure for a 100% up time.

Just because a node or DC fails doesn't mean there is a user visible impact.

quelltext6y ago

Whole credit card networks at least regionally have had downtimes. Bank networks have had downtimes (and that for more than an hour). Other payment processors have had outages for weeks: https://www.pymnts.com/news/payment-methods/2016/worldpay-pa...

I mean, I get it, but you are holding companies to a standard that isn't the norm at all. This doesn't excuse Stripe's outage, but your comolaint and armchair advice without even knowing the cause of the issue or that company's internal setup is obviously going to attract downvotes.

Even perfectly designed systems can have flaws. In your design you assume frontend part will have 100% uptime - practice shows that your hosting just can't provide it. AWS, GCP, Azure, you name it - all of them have failures.

techie1286y ago

Your statement holds true if you use a single cloud provider. FTR we ran our own DCs with several AZs spread globally. We did suffer failures in individual DCs occasionally but there was zero user-visible impact which is the whole point.

edwinarbus6y ago· 9 in thread

We're back up as of 17:02 UTC: https://twitter.com/stripestatus/status/1149002362691833856

pc6y ago

Stripe CEO here.

We're very sorry about this. We work hard to maintain extreme reliability in our infrastructure, with a lot of redundancy at different levels. This morning, our API was heavily degraded (though not totally down) for 24 minutes.

We'll be conducting a thorough investigation and root-cause analysis.

JaimeThompson6y ago

Will the results of the investigation and analysis be publicly available?

calhoun1376y ago

This is causing a big problem for my business right now, but I am not mad at Stripe because you earned that level of credibility and respect in my opinion. I understand these things happen and am glad to know a team as excellent as Stripe's is on the job.

simonebrunozzi6y ago

Patrick, I think that it would be really nice to share the technical details of the post-mortem, once the dust has settled.

Many folks don't have the privilege to run a massive scale operation like Stripe, and lots of people can learn from it.

perfect_wave6y ago

I also look forward to reading the postmortem. Stripe puts out a lot of high quality blog posts.

Aaaand it'd down again as of 19:19 UTC.

https://twitter.com/stripestatus/status/1149065544399609856

Wow, Stripe is having a really shitty day.

omnimkar696y ago

FROM 16:36 - 17:02 STRIPE'S system saw elevated error rates and response tims with the API. THEY HAVE NOW RECOVERED And are continuing to monitor as per their tweets on twitter

kennethfriedman6y ago

that was fast!

klinskycOP6y ago· 8 in thread

Between Cloudflare, Google, and now Stripe, I feel like there's been a huge cluster of services that never go down, going down. Curious to see Stripe's post-mortem here

bluntfang6y ago

I would love to see an industry analysis on this. What's the reason this is happening? High attrition from long time engineers? Large influx of green/new grad/code camp engineers? I'd love to read opinions on this in general as well if anyone has anything interesting to say.

dastbe6y ago

(I work at AWS, but I'm commenting very generally)

Looking at many outages, the root cause is usually novel and the result of a combination of known and unknown changes to a system and its context. This includes your typical "operator did something too fast/too big/without code review", because there's usually something very interesting in how someone was able to do that in the first place. We should learn from them and mitigate them to our best ability, but IMO I don't think you can drive these novel events to zero.

What's more interesting (to me) is the blast radius of any given outage, along both externally visible and internally visible seam lines. For example, the EBS outage of 2011 should have been isolated to a single AZ, but caused impact in other AZs for customers because of regional coordination (and work was done to push more functionality into each AZ to improve isolation). The better we partition and isolate down workloads in our services, the smaller the magnitude of any particular incident, and the easier it is for downstream users to move around it.

googlemike6y ago

In my experience on services with billions of users - no one knows the whole thing. There are potentially thousands of hops in a roundtrip of a given system from the user to some source of truth and back. The larger companies grow, the more complex these systems get, the higher the load, the more likely we are to see a break. Systems break constantly, recover constantly, and very rarely does the user see it. So perhaps another way to reform this question is - why are the users seeing it now?

Perhaps key personnel off on summer holidays?

repler6y ago

I think it's increased Cyberwarfare activity. It all started happening in groups right after the drone takedown over Iran.

feifan6y ago

It could just be random (or at least as random as this world can be). A situation where Cloudflare, Google, and Stripe go down is just as likely as any other situation. Just appears like a big deal because humans latch on to pattern matching.

Thaxll6y ago

Most services are going down from time to time, it's just that the big one are widely used and so people notice quickly.

human error often, configuration changes often, new changes often.

normalperson6y ago· 4 in thread

"Elevated Error Rates" is such a BS term. They were down. Man up and own the mistake.

As someone downstream of providers like Stripe who is on call for issues like this, that term is actually quite helpful to me. It tells me that I should be expecting delays and timeouts, and that some percentage of operations are likely to complete, whereas a complete outage likely means requests are failing immediately or failing to connect. This is important information when reviewing our options. During a full outage, aside from failover (when possible and not automated), we usually don’t need to take any action. When dealing with greatly increased error rates, it may be beneficial for us to disable the API on our end in order to avoid a lot of hung open connections and delayed responses for our users. We’d rather that operations fail immediately and completely instead of forcing users to wait around for operations that are unlikely to complete anyway.

klinskycOP6y ago

We had a couple payments go through during the "downtime". Maybe "Severely elevated error rates" would be better?

munchbunny6y ago

I'd agree if that were actually true, but it's not.

With large enough services there is always some acceptable level of errors due to 0.001% probability events. When there's an outage, it's not usually everything down, but even 0.1% of jobs failing ends up affecting a lot of users.

Even 10% of jobs failing still isn't "down", it's "partly down", even if you have to issue credits for SLA violations and publish a public postmortem later.

icebraining6y ago

It now just says "Down".

the-dude6y ago· 4 in thread

My conspiracy theory still is they are decommissioning Huawei equipment.

Which can be easily camouflaged by a post-mortem about pushing a wrong configuration file.

noir_lord6y ago

That makes no sense.

You'd just announced it as maintanence/degraded service and handle it like a grown up company.

If you lie and it gets out you trash your credibility and for a company like stripe which handles money and is taking on some ancient and major systems credibility is pretty important.

organsnyder6y ago

I'm sure Stripe could have decommissioned equipment (if there was such a need) without a downtime in the middle of the day in the US.

Please stop spreading these conspiracy theories. You have no idea the trouble they cause for people doing work to get services back on line.

You are right, I have no idea. Would you care to elaborate?

rectangletangle6y ago· 2 in thread

If you haven't broken a critical system at least once, you haven't written enough production code. Everyone appreciates the other 99.993207% of the time where the system functions flawlessly. I look forward to reading the postmortem.

What a respectable comment. It’s so easy to just gripe about downtime. Stripe is one of those comments that does take uptime seriously but alas as long as humans are at the helm there’s always room for mistakes. As long as we learn from them.

In fairness, I know lots of people who have broken critical systems without having written a line of code. The screwdriver my friend dropped onto a server motherboard (point side down) is my favourite personal example, but there are plenty of others.

cameronbrown6y ago· 2 in thread

Google had their cables physically sliced.

Cloudflare was brought down by a config push.

Anybody want to guess what killed Stripe this morning?

arthurcolle6y ago

Host reboots

Over complicated software. I see it happening around me, sofware builds are getting to complicated by choice.

uxamanda6y ago· 1 in thread

Looks like it is struggling again.

Confirmed issue - https://status.stripe.com. Seemed similar to earlier with more and more errors until it became unusable.

craze36y ago· 1 in thread

No wonder my bugfix wasn't working

I too will now blame any of my non-working bug fixes on a non-responsive 3rd party API. I like it.

I wonder what the global cost to the economy of a 24 hour stripe outage would be. It’s crazy when you think about how important certain “infrastructure” is

As of 22:00 UTC, stripe was down again. I think it's up now.

LinkedIn appears to be having issues right now too.

j / k navigate · click thread line to collapse