Idempotency in Payment Gateways: Why Your Money Moves Exactly Once
Idempotency is one of those concepts that quietly determines whether your payment system is trustworthy or a constant source of chaos. If you’ve never worked with it directly, it’s worth understanding now, before it becomes a problem you’re forced to solve under pressure.
If you’ve ever built a payment system and woken up at 3 AM to a customer complaint about being charged twice, you already understand why idempotency matters. If you haven’t, consider this your warning.
Payment gateways operate in the most unforgiving corner of distributed systems. Networks drop packets, clients retry aggressively, load balancers time out, and mobile users tap “Pay” three times because the spinner looked frozen. In this environment, the difference between a reliable system and a lawsuit comes down to a single property: idempotency.
What Idempotency Actually Means
An operation is idempotent if performing it once produces the same result as performing it N times. DELETE /users/42 is naturally idempotent, the user is gone whether you call it once or ten times. SET counter = 5 is idempotent. INCREMENT counter is not.
Payments are fundamentally non-idempotent by nature. “Charge this card $100” executed twice charges $200. The engineering work is to build an idempotent layer on top of an inherently non-idempotent operation, so that retries, duplicates, and network chaos can’t corrupt the financial state.
The Failure Modes You’re Actually Defending Against
Before discussing solutions, it’s worth being precise about what goes wrong:
The retried request. Client sends a payment request. The request reaches your server and succeeds, but the response is lost to a network blip. The client, seeing no response, retries. Without idempotency, you’ve just double-charged the customer.
The concurrent click. User double-taps the “Pay Now” button. Two requests fly out within milliseconds of each other. Both hit your server. Both pass validation because neither has committed yet. Both create charges.
The upstream retry. Your gateway accepts a request, begins processing, and your M-Pesa or Stripe integration experiences a timeout. You retry the downstream call. The provider actually succeeded the first time—your retry creates a second transaction on their side.
The webhook replay. Payment providers deliver callbacks with at-least-once semantics. The same “payment. succeeded” event may arrive three times. If your handler credits the user’s wallet each time, you’ve handed out free money.
The race on the same resource. Two requests arrive simultaneously to debit the same wallet. Both read a balance of $100, both check that $80 is sufficient for the $80 withdrawal, and both write a new balance of $20. You’ve just lost $80.
Each of these requires a slightly different defense. Idempotency is a family of techniques, not a single switch.
The Idempotency Key Pattern
The canonical approach: the client generates a unique key (typically a UUID) per logical operation and sends it in a header—usually Idempotency-Key. The server stores the key alongside the response. If the same key arrives again, the server returns the cached response instead of processing a new operation.
A minimal flow looks like this:
1. Client generates UUID: “a7f3c2e1-...”
2. POST /payments with Idempotency-Key: a7f3c2e1-...
3. Server checks: has this key been seen?
- No → process payment, store (key, response), return response
- Yes → return stored response, do not process again
The critical properties:
The key scope is the logical operation, not the HTTP request. The same key across client retries must return the same result.
The key has a lifetime. Stripe keeps them for 24 hours. Most internal systems keep them 24–48 hours. Keys older than that can be cleared.
The key is bound to request parameters. If a client reuses a key with different parameters (e.g, a different amount), that’s a client bug, and the server should reject it with 422.
Where the Pattern Gets Interesting: Concurrency
Storing a key and checking for it sounds simple until two requests with the same key arrive at the same millisecond. Both check the store, both find no key, both proceed to process the payment, both insert. You’ve reproduced the exact bug you were trying to prevent.
The fix is atomic acquisition. Three common implementations:
Database unique constraint. Create an idempotency_keys table with a unique index on the key column. Insert the key as the first step of processing, inside the transaction. The second request’s insert fails with a constraint violation, and you handle that as “duplicate—fetch and return the original response.”
CREATE TABLE idempotency_keys (
key TEXT PRIMARY KEY,
request_hash TEXT NOT NULL,
response JSONB,
status TEXT NOT NULL, -- ‘in_progress’ | ‘completed’
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Redis SETNX with TTL. Fast and simple: SET key value NX EX 86400. Returns OK if the key didn’t exist, nil if it did. This acts as a distributed lock with a built-in expiry.
SET idem:a7f3c2e1-... “in_progress” NX EX 86400
But here’s the trap, and it’s one worth lingering on: SETNX alone is not enough if the operation spans multiple steps. If the first request acquires the lock, begins processing, and crashes before writing the result, the second request will see “in_progress” and have to decide: wait? fail? retry?
This is the exact race condition that bites backend engineers in production—you acquire the lock, do the work, and if anything between those two moments fails without cleanup, you’re left with a zombie key blocking legitimate retries. The resolution is to treat the idempotency record as a state machine with pending → completed | failed transitions, persist state at each step, and make sure the “in_progress” state has a sensible timeout shorter than the key’s TTL.
A state table with explicit status. Combine the database constraint with a status column. First request inserts (key, ‘pending’) atomically. On completion, it updates to (key, ‘completed’, response). A second request arriving during processing sees ‘pending’ and can either poll briefly or return 409.
Downstream Idempotency: The Part People Forget
Storing the key in your own database protects you from duplicate incoming requests. It does nothing to protect you from duplicate outgoing calls to payment providers.
Consider: your server calls Stripe’s charge endpoint. The call times out after 30 seconds. Did Stripe receive it? Did it succeed? You don’t know. If you retry naively, you might charge twice. If you don’t retry, you might leave a transaction in limbo.
The answer is to propagate idempotency downstream. Stripe, PayPal, and most mature providers accept an Idempotency-Key header of their own. M-Pesa’s Daraja API uses a similar concept via the OriginatorConversationID. When you call the provider, send a deterministic key derived from your internal transaction ID. If you retry, you send the same key, and the provider de-duplicates on their side.
The chain looks like:
Client idempotency key → Your gateway → Provider idempotency key
(per user action) (stored) (per internal txn)
Each link in the chain needs to be idempotent independently. A gap anywhere leaks duplicates.
Webhooks: Idempotency in the Other Direction
Outgoing payment flows are only half the problem. Payment providers send you webhooks, and those webhooks are explicitly documented as at-least-once delivery. Stripe will retry a webhook for up to three days on non-2xx responses.
Your webhook handler must be idempotent on the event ID, not on the payment ID. If you key on the payment ID and a provider sends both a payment.authorized and a payment.captured for the same payment, you’ll process one and drop the other.
def handle_webhook(event):
if events_table.exists(event.id):
return 200 # already processed
with transaction():
events_table.insert(event.id)
apply_event(event)
return 200
The insert and the application must share a transaction. If they don’t, a crash between them means you’ve recorded the event as seen but haven’t applied its effects.
Idempotency vs. Concurrency: Not the Same Thing
A subtle point that trips up many implementations: idempotency keys prevent duplicate operations, but they don’t prevent concurrent operations on the same resource.
If Alice triggers a withdrawal with key A and Bob (on Alice’s account, perhaps via a shared session) triggers a different withdrawal with key B, both keys are new. Idempotency doesn’t help here. You need a separate mechanism—typically a row-level lock, a Redis distributed lock on the resource ID, or optimistic concurrency via a version column—to serialize operations on the same wallet.
A production payment system needs both:
Idempotency keys to make retries of the same logical operation safe.
Resource locks to make concurrent operations on the same account safe.
Conflating them leads to bugs. Idempotency protects the write; resource locking protects the invariant.
What the Response Should Look Like
When a duplicate key arrives, what do you return? The answer has nuance.
Return the original response, byte-for-byte if possible, with the same status code. The client sent this request because it didn’t know the original outcome; giving them the original outcome is exactly what they need.
Some teams add a header like Idempotent-Replayed: true so the client can log that a retry was collapsed, which is useful for debugging but not required.
If the request arrives with the same key but different parameters, return 422 Unprocessable Entity with a clear error. The client has a bug and silently succeeding would mask it.
If the key matches an operation still in_progress, you have a design choice: return 409 Conflict and let the client poll, or block briefly and return the result when it completes. The former is cleaner for high-throughput systems; the latter is friendlier for simple clients.
Schema and TTL Considerations
A few operational notes from running these systems:
Bound key size. UUIDs are fine. Don’t accept arbitrary client-generated strings of unbounded length—you’ll eventually see a 10MB “key” from a misconfigured client.
Hash the request body. Store a hash of the request payload alongside the key so you can detect key reuse with different parameters. SHA-256 is overkill but cheap.
Set a TTL and enforce it. Keys accumulate forever otherwise. A background job that purges keys older than 48 hours, or a Redis TTL, keeps the table manageable.
Index on (key) and nothing else. The access pattern is exact-match on key. No range scans, no secondary lookups. Keep the table lean.
Testing Idempotency Is Harder Than It Looks
The nasty truth about idempotency bugs is that they hide. Everything works perfectly in development because you never experience the exact timing window that causes a duplicate. Then production traffic finds the window in three hours.
The tests that actually catch these bugs:
Retry under response loss. Fire the same request twice in parallel. Assert exactly one side effect.
Kill mid-transaction. Start processing, crash the process, restart, retry with the same key. Assert recovery without duplication.
Webhook replay. Deliver the same webhook five times. Assert one ledger entry.
Concurrent same-key bombardment. 50 requests with the same key fired simultaneously. Assert one processed, 49 return the cached response.
Chaos-style testing—deliberately introducing latency, drops, and restarts in a staging environment—catches what unit tests cannot.
Bringing It Together
A payment gateway that handles retries, concurrency, and replays correctly looks something like this at the request level:
Client sends request with Idempotency-Key.
Server atomically inserts key with status pending (unique constraint).
If insert fails, fetch existing record:
completed → return cached response.
pending → return 409 or wait.
Acquire lock on affected resources (wallet, account).
Validate, call downstream provider with derived idempotency key.
On success, update record to completed with response, release lock.
On failure, update to failed with error, release lock. Client may retry with a new key or the same one depending on error class.
Every arrow in that flow is a place something can go wrong. The reason idempotency is a discipline rather than a feature is that every one of those arrows needs its own defense.
The Mindset
Idempotency isn’t a library you import or a header you add. It’s a property your system either has or doesn’t, and it has to be designed in from the ledger up. Every write path needs to be examined: can this be replayed safely? Every integration needs a contract: what happens if I call this twice? Every webhook handler needs a fingerprint: have I seen this event before?
Get it right, and your system quietly absorbs the chaos of real-world networks. Get it wrong, and you’ll spend weekends reconciling ledgers, refunding customers, and explaining to finance why the numbers don’t add up.
The math of money is unforgiving. Build the system so that even when everything retries, times out, duplicates, and replays, the money moves exactly once.
