Our throughput numbers are much lower and and our integrations are much fewer than Plaid, so we have been able to get away with keeping a close eye on Graphite/Grafana for spikes in request failures/timeouts. Seems like eventually we will need to implement some kind of statistical monitoring and alerting.
I saw Cloudwatch in the pipeline, which is an Amazon product. I know I'm going to make a very controversial statement here, but - why Amazon? With volumes like yours, your scale will eventually hit the point where your cost skyrockets.
Regarding the metrics themselves, you might already do this, but I highly recommend splitting your metrics into a 50th, 95th, and 99th percentile in your Grafana graphs. This will give you a solid idea of not only what your customers experience on average, but edge cases as well.
Do you have a regular forum with how you are reviewing said metrics and pre-solving problems? We're still trying to solve this in multiple teams where I work and have noticed that some teams are great at it and other teams are a little more reactive.
Love to see this stuff :)
Re: AWS. We're not at a point where we are overburdened by the AWS spending. Many things are more efficient with AWS, as we have a fairly small engineering team. We use various different AWS products (Aurora, Kinesis, to name a few) that we are utilizing.
Regarding metrics & percentiles - Yes I agree. 99th percentile is what we try to look at the most, as most other metrics tend to be deceiving.
Regular forums - This is something that we need to improve on as we move forward. The blog post mostly describes the infrastructure we've built, but it takes time and effort to become a metric-driven organization.
Unlike a lot of those I read, it sounds like you actually set out with a good set of requirements and really understood the problem.
I had a good experience using Prometheus as well for a smaller project (server monitoring). It’s interesting to know that it can handle so many metrics and scale so well to more complex problem areas.
> I had a good experience using Prometheus as well for a smaller project (server monitoring). It’s interesting to know that it can handle so many metrics and scale so well to more complex problem areas.
Yep, we started out with a pretty simple prometheus setup too (two instances scraping the same metrics, just for redundancy) but have been adding federated instances and doing some pre-aggregation to scale; the nice part is that we've been able to do it pretty gradually by updating the config (e.g. splitting out one bucket of metrics into a separate node for scraping at a time).