The whole point of the idempotence mechanism is so you can make a reliable distributed system. If the first try fails, the client doesn't know if it succeeded or not, so the client should try again later ("at-least-once"). The idempotence mechanism just ensures that we don't get duplicates in the case that the first try actually succeeded.
If you replayed failures there wouldn't be any point to the idempotency key.
You need to store the payment state at each relevant step and process it asynchronously. If requests time out, you check the status of it using the key you store (with the processor) to see if it was even received.
It’s not perfect, some processors will 500 while processing the payment (Braintree), so you still need reconciliation on the backend.
Regardless, I think your assumption about how the request/response cycle should be working is wrong. For this kind of API and transaction, the server should be returning a response immediately: 202 Accepted. The only thing the API server should be doing before returning is creating a row in a DB (with a "state" field with an initial value of "pending"), and pushing some work on a queue.
The server should not be sitting there with the HTTP request open, trying to complete the transaction, and only returning a response to the client when the transaction is finished or has encountered an error.
The client will have to learn about the progress of the state of the transaction outside of this initial request. There are many options here: polling, webhooks, a message queue like kinesis or kafka, etc.
Idempotency-Key should not replay the response (it depends, actually). But also it should not error 409. You need to be content aware before adding Idemmpotency Key header handling.
What will happen when the request is received and handled but during writing response body TCP connection dropped unexpectedly. And after second or two a connection reestablished. How two sides agree that previous request accepted and everything good to go? That's what Idempotency-Key header does.
An HTTP request comes in with a certain idempotency key. The server returns 202, as you say, and begins to process the database transaction.
While the server is still procesing the database transaction, a second HTTP request comes in with the same idempotency key. What response does this second HTTP request get? The original transaction that the first HTTP request triggered hasn't succeeded and hasn't failed, so it doesn't fall into either of the categories in the post I responded to.
Your answer is that the second HTTP request gets a 409, which makes sense to me, although others are objecting to it.
No no no no no.
You have multiple clients submitting the same business operation simultaneously. One must succeed, the others must fail. If you're using the 409 approach ("notify client that request is redundant") you must not send a 409 code until the work is complete.
The client must interpret 200 and 409 as success cases. 200 means "it was done" and 409 means "it was already done". Clients looping (say, processing durable queue messages) can stop when they receive these responses.
If the work is not complete, you can't return 409, or clients will think the work is done. You will lose messages.
But, rather than 409, I'd say that you should be using opportunistic concurrency control if you adopt this perspective. There should be a resource context for the request, so the client can obtain an ETag and send If-None-Match headers, and get a 412 response if things are out of sync. That allows them to retry a failed/lost request and safely prevent a double action.
Under a 412, they have to step back and retry a larger loop where they GET some new state and prepare a new action. Just like in DB transaction programming, where your failed commit means you roll back, clean the slate, and start a whole new interrogation of transaction-protected state leading up to your new mutation request.
That doesn't mean that idempotency keys have to be used. You can certainly hash message content if that is documented behavior. That probably only makes sense when there is already some logical session or transaction identifier that makes dedupe semantics clear.
The system you propose might be sound and might be necessary in some systems, but I can't think of what they might be that wouldn't be better served by the simpler solution that is already widely used for this purpose.
If it processed 99% of the request and the final bookkeeping failed because of a duplicate, that's still a failed request.
Arguably this should be the primary way you check for idempotent requests - you shouldn't have a separate check for existence, you should have the insert/update fail atomically.
This is the same thing you see on filesystems for TOCTOU security holes - the right way is to atomically access and modify once, and you only know the request was already processed because that fails.
Even if you have a complex long-running multistep orchestration problem, you can break it down into simpler transactions. Eg you could start with a "lock the resources" txn.
But 99% of these conversations around idempotence are simple POST operations like "create order" that regular old database concurrency management handles just fine.
That doesn't answer my question. What response do you return to the client in the case I described?
But your follow up responses here are making me rethink. Now you have to have all these special cases where the original request is still in process. I think or assertion of "99% are simple POST operations" is bullshit. For the times where idempotency is hard and really matters, often times you're calling a third party API, like a payment processing API.
I would think a better approach would be to always return a 409 on a subsequent request, regardless of whether it passed or failed, and then have a separate standard API that lets you get the result of any request by its idempotency key.
Devs are too scared to be nice (ie not return errors) to clients when they misbehave.
The pattern I describe was the dominant design pattern for financial transaction processing systems before Stripe. Stripe's API makes life for the clients slightly easier at the expense of making life for servers more complicated, but the two approaches are equivalent in function.