I'll hang around here to answer any more technical questions if anyone's interested.
That does sound weird how it happened by a corrupted date formatter. I'm assuming it's something like the formatter reset itself back to the langauge default
There was this massive Java application processing hundreds of parallel requests per second. For each request it wrote a line with billing information into a log file. Those lines were fine from a quick glance, but when we tried processing them later we encountered invalid dates. Those log records contained future dates as well as invalid dates like 2019-02-30. Long story short: In the end we figured out that this was caused by the date formatting not being done thread safe (might have been SimpleDateFormat, but I don't remember the details anymore), causing the date components of multiple threads to get interleaved. Ouch, I guess somebody learned a lesson back then.
Cheers for this blog post, by the way. It was really informative about the issue, and about how FPS works.
FPS uses ISO8583 for its messaging format, and I suspect at the edge the Gateway translates it to a BERT blob for passing around internally.
Sorry, but I can't name our partner.
As far as I understand, PayPort was/is the recommended options for all new "direct" connections. Though it seems that it is also possible to go more-direct into the FPS system itself.
The important thing to understand is "clearance and settlement". Banks either maintain accounts with each other ("nostro/vostro accounts") or, within a country, at the central bank. So e.g. Halifax and Monzo will have accounts at the Bank of England.
Settlement will either be immediate or delayed. For immediate, at the same time as Halifax is sending a "please credit £10 to Bob" message to Monzo, they will send a message to the Bank of England to transfer £10 between their account and Monzo's.
For delayed settlement, the banks wait until the end of the day, add up the total money in each direction, subtract the difference, and transfer that.
A lot of work goes into making sure all the necessary entries line up. So, in the example, if the bank sent a payment message but didn't debit their user's account, either they would have made the central bank transfer (in which case they've lost £10 and effectively given it to their user's account), or they haven't, in which case Monzo will notice and demand payment for the discrepancy.
Banking is eventually consistent, and has been for centuries.
The standard money transfer from Bob to Joe is a deal where the bank says "ok, Bob, we owed you $100 but if you want that then we'll now owe that $100 to Joe instead".
It's also worth noting that's just a record of debt not reality - there has to be some legal basis for that transaction to actually change the liability between the bank and the account holder, simply changing the balance in a database doesn't change the amount of debt but just the record i.e. "bank's opinion" of that debt; and if that record/opinion is wrong, then that balance can and will be disputed, and if the dispute can't be resolved otherwise, then it'll be up to courts to decide if that debt is valid or not.
If you record just the credit without the debit, then it's the equivalent of the bank unilaterally agreeing to new debt, the bank asserting that it now owes $100 to Joe just because. It's free to do that, but it would mean that it's "books won't balance" i.e. their accounting isn't consistent with itself and doesn't match reality, so to properly account for that transaction they'd have to book a debit to their profit&loss statement since they lost money by acknowleding that balance increase i.e debt without an offseting balance/debt decrease to someone else.
The recipient bank will receive the money into their settlement account at that time. If the sending bank doesn't debit their customer then both sending and receiving customers will have the money in their accounts, but the sending bank will be out of pocket.
The settlement process through a central bank is a way of ensuring that banks dont need to literally send truckloads of cash to each other at the end of the day.
Monzo says to the central bank, "today I sent RBS £1,500,000", and RBS says to the central bank, "today I sent Monzo £1,200,000". So the central bank just debits Monzo's account with them by £300,000, and credits RBS's account with them by £300,000. The total amount in the central bank remains the same.
So, sure, a bank could claim they sent less money to another bank than they did, but eventually the numbers wouldn't add up, and it would trigger a bucketload of auditing, likely resulting in revocation of banking licenses, and legal issues for both the bank and people involved.
https://en.wikipedia.org/wiki/Net_settlement
There is also a technique involving things called nostro/vostro accounts, where banks have money on deposit with each other, and the sending bank's deposit with the receiving bank is used to cover transfers:
https://en.wikipedia.org/wiki/Nostro_and_vostro_accounts
Of course, then they need to keep their accounts topped up, and they can do that by transfers through other banks, which might be central banks or commercial ones. The nostro/vostro system is suitable for use where banks don't trust each other so much, eg because they are in different countries. I think it was used more in the past, before reliable central settlement schemes were established, but i'm not sure.
You can think of net settlement as being a bit like nostro/vostro where the accounts have infinite free overdraft facilities, and so the banks never build up a credit balance, and just settle their debts at the end of the day.
For more about what makes a good apology, see https://withoutbullshit.com/?s=apology&submit=Search by Josh Bernoff, a former Forrester editor and a very direct writer.
Just yesterday a major high street bank stopped sending payments for an hour, and was telling customers on Twitter that there were no problems.
Hell, the central system (what I called the Hub in this article) had a 12 hour split brain meltdown last July which had banks emailing each other spreadsheets back and forth for two weeks afterwards.
This reminds me the saying "Never admit a wrongdoing and you'll never be wrong".
It's great that we get several of those new startup banks (Monzo, N26 etc.) that provide superior experience and slowly show what horrible things traditional banks were getting away with.
An apology would have been nice, but I suppose unwarranted threats are more in character.
A+ job on handling the unfortunate situation, Monzo.
We can only hope more companies follow this great example.
Formal only generally comes across, to me, as cold and distant. Great for a persuasive essay or other mediums where you want to remove the topic from the author, not so great for communicating with your audience and wanting to come across as sincere.
If anything, a strict formal-voice only blog post would come across, to me, as contrived.
To each their own.
I had to look that one up: https://www.dailymotion.com/video/x15ij62 (3min)
1. Was this post-mortem part of an official process or something of an individual initiative? I saw it published on the blog, but it might be helpful to have this information disambiguated from marketing material on a separate site: https://status.cloud.google.com/summary
2. I'm not sure how payment processors work, but would having multiple payment processors from Monzo's interface make sense from a cost/benefit perspective?
3. Any plans to expand to the U.S. anytime soon, or recommend any banks that follow Monzo's best practices? ;-)
As another poster mentioned we already have a status page where we post about incidents as they happen (though obviously not in quite as much detail as here). Personally I think our main blog is a reasonable place to have this ️.
2. Multiple redundant payment processors would be great, but ultimately infeasible. As a settling FPS participant we have to have a single Bank of England settlement account, tied 1:1 to a "bank code". Multiple sort codes map to a single bank code, and migrating sort codes between bank codes is non-trivial.
It'd be great if we could migrate sort codes easily between redundant connections, but as we build our own Gateway we'll have complete control over how our failover mechanisms work. Here's to much greater uptime in the future!
3. As another commenter mentioned - yes! We're just doing staff testing for now, but we've got a waiting list up. It'll be a prepaid product issued by another bank before we get a US banking license, just like we were in the UK a couple of years ago.
The 1.55% rate is fixed term for 12 months with no withdrawals
There is a waitlist to join though.
The bug was in a computer program the Gateway uses to translate payment messages between two formats. When the program was operating under load, the system tried to clear memory it believed to be unused (a process known as garbage collection).
But because it was using an unsafe method to access memory, the code ended up reading memory that had already been cleared away, causing it not to know how to translate the date field in payment messages.
So apparently a dangling reference.
I'm wondering what do you use to call these external processing APIs. I assume these are blocking calls.
TL;DR- Largely Go microservices running on k8s, with http-based RPC calls for synchronous communication, and kafka for asynchronous communication.
As for sending and receiving of this kind of payment message, they are largely async but it does depend on the payment system we're talking about. When we build our own FPS gateway we're going to have to have something to manage "sessions" (TCP connections) which will block waiting for a response to an individual payment messages. Right now our communication with our third party Gateway is via a queue.
[1]: https://monzo.com/blog/2016/09/19/building-a-modern-bank-bac...
We have learnt a lot from you guys as we build out similar systems in India. Thank you for putting this stuff out!
Quick question that I have always wondered about - would you have used something like Uber Cadence (https://github.com/uber/cadence) as the core of your infrastructure if it had been available back thhen ?
They already offer business accounts. I have one open for my Ltd company.
Starling does offer business accounts now, but you can only have one Person of Significant Control, i.e. over 25% owner. There is no monthly fee with their offering though, so it's probably the better offer.
What now? Their datacentre was ... rewriting (presumably) encrypted packets?
What I meant here is they could tell that the corruption was being introduced by some component in their infrastructure, and they were only observing it for messages passing through one of their two active-active sites.
It's a fine line between understandable to laymen and people been pernickity sadly.
Or, rather, unsafe access of memory managed by a garbage collector:
> The bug was in a computer program the Gateway uses to translate payment messages between two formats. When the program was operating under load, the system tried to clear memory it believed to be unused (a process known as garbage collection).
>
> But because it was using an unsafe method to access memory, the code ended up reading memory that had already been cleared away, causing it not to know how to translate the date field in payment messages.
Being someone who also works in the payments space currently, relying on gateways, I have gone through several similar outages, where we detected a gateway issue causing an outage, notified the gateway who ack’d... and then we waited. More than one time, like Monzo, we built a workaround on our end, before the gateway provider could even mitigate the outage.
Hats off to the Monzo team, who clearly have a solid oncall and incident mitigation strategy in-place. They determined an outage happening in 4 minutes, built a workaround as best they could and deployed it in 2 hours, while it took the gateway provider 9 hours only to mitigate their change that caused the issue the first place. Granted the issue seemed complex, this is still slow.
Unfortunately, in cases like this, the best one can do is make sure there is a clear SLA in-place with the third party, with a contract stating financial liability in case the third party fails to meet this SLA. Monzo will not tell us much about this part, but I suspect the gateway will have to pay a hefty fee to Monzo, as their availability dropped to under 99% for this month, which should trigger payments/fee reductions from the third party with a well-written contract. It is good to see they are pushing the third party to do a proper post-mortem and prevention actions, as well as holding them accountable.
Nice work!
I'm seriously impressed they were able to deploy mitigations to product twice in the same few hours, especially given they are a bank (and a small one, at that), and the consequences of fucking up are enormous.
It's been said here many times already, but I'll join those saying "well done" for handling this so well, and for the extraordinary level of transparency!