story

Monitoring 9600 banks at scale (opens in new tab)

blog.plaid.com

96 pointsjeandenis7y ago13 comments

13 comments

Interesting writeup. This is also a major issue for us at TradeIt (we do something similar but for stock brokers and portfolio/trading) as the brokers we integrate are not always...ahem..."robust". We've found that our upstream users really appreciate that often we can tell them about brokers' service outages before the brokers even announce it (when the brokers even bother). Sometimes the brokers don't even realize their system is malfunctioning until we poke them to ask what's going on.

Our throughput numbers are much lower and and our integrations are much fewer than Plaid, so we have been able to get away with keeping a close eye on Graphite/Grafana for spikes in request failures/timeouts. Seems like eventually we will need to implement some kind of statistical monitoring and alerting.

funkymatt7y ago

grafana has that ability built in!

divxflounder7y ago

Great article! I'm definitely taking an action item to look into Prometheus. I own DevOps/Monitoring and Alerting my org and it's really cool to see how other companies skin this cat.

I saw Cloudwatch in the pipeline, which is an Amazon product. I know I'm going to make a very controversial statement here, but - why Amazon? With volumes like yours, your scale will eventually hit the point where your cost skyrockets.

Regarding the metrics themselves, you might already do this, but I highly recommend splitting your metrics into a 50th, 95th, and 99th percentile in your Grafana graphs. This will give you a solid idea of not only what your customers experience on average, but edge cases as well.

Do you have a regular forum with how you are reviewing said metrics and pre-solving problems? We're still trying to solve this in multiple teams where I work and have noticed that some teams are great at it and other teams are a little more reactive.

Love to see this stuff :)

jeeyoungk7y ago

One of the authors here. Thanks for enjoying the article!

Re: AWS. We're not at a point where we are overburdened by the AWS spending. Many things are more efficient with AWS, as we have a fairly small engineering team. We use various different AWS products (Aurora, Kinesis, to name a few) that we are utilizing.

Regarding metrics & percentiles - Yes I agree. 99th percentile is what we try to look at the most, as most other metrics tend to be deceiving.

Regular forums - This is something that we need to improve on as we move forward. The blog post mostly describes the infrastructure we've built, but it takes time and effort to become a metric-driven organization.

Terretta7y ago

Pretty unofficial here, but I prefer engineering channel to biz dev channel... Drop me a note, loop in whoever would be interested? I’ve been meaning to get our companies better acquainted — your fantastic write up reminded me.

syastrov7y ago

Nice write up. I love reading these kinds of postmortems.

Unlike a lot of those I read, it sounds like you actually set out with a good set of requirements and really understood the problem.

I had a good experience using Prometheus as well for a smaller project (server monitoring). It’s interesting to know that it can handle so many metrics and scale so well to more complex problem areas.

joyzheng7y ago

One of the blog authors here -- thanks!

> I had a good experience using Prometheus as well for a smaller project (server monitoring). It’s interesting to know that it can handle so many metrics and scale so well to more complex problem areas.

Yep, we started out with a pretty simple prometheus setup too (two instances scraping the same metrics, just for redundancy) but have been adding federated instances and doing some pre-aggregation to scale; the nice part is that we've been able to do it pretty gradually by updating the config (e.g. splitting out one bucket of metrics into a separate node for scraping at a time).

tigre1007y ago

We took a similar journey with Prometheus @ Improbable. We found federation to have its limits & wanted a global query view as well as a few other nice features: https://improbable.io/games/blog/thanos-prometheus-at-scale

lordxenu7y ago

How do you get the data from banks? Are you scraping the webpage after the user logs in? Not many banks I know of have public apis.

throwawaymath7y ago

Yes, for any bank that doesn't provide them with API access they're scraping the login pages. They even do this for banks which implement anti-scraping measures.

Rainymood7y ago

How do you guys handle user log-in credentials? I mean, you're basically logging into their bank, right?

wbh17y ago

Really enjoyed this write-up. I'm currently in the process of scaling out a Prometheus-based replacement for an old Nagios setup that was scaled to its limit and posts like this just make me that much more excited for Prometheus as a technology.

beamatronic7y ago

With that many integrations, some small set must be broken at any given time. How do you handle this without scaling a support staff accordingly?

j / k navigate · click thread line to collapse