undefined | Better HN

story

0 pointscourtf5y ago0 comments

Good questions, and you are onto something here for sure. This wasn't too long ago, around 2014-2015.

This was all for a weather radar app, and you are correct, there really weren't any SLA's, but we had to handle very high loads. We did make use of cloud services for some pieces of the system (there was a database and a small API for some minor bookkeeping, mostly around users). I included those costs in my estimate of monthly expenses. We had lots of caches, for all our JSON and for things like user authentication, which saved us from having to really figure out the database side. The caches were typically push-based, so we didn't let user requests get to the disc, if we could help it.

The vast majority of requests were for those images though, which required moving lots of clumsy geographic data into the GPUs to render map tiles (at high-def and high zooms as well), so the requests were still somewhat costly to serve, even if they didn't hit a database. We were able to get away with a small footprint in the datacenter by making heavy use of CDN caching. Cache lifetimes for the latest weather images were often measured in seconds, and getting those timings right was crucial. Screwing up cache lifetimes would rapidly swamp the system with requests, but the software was good at continuing to keep latency low under heavy load, and degrading gracefully. In fact, the vast majority of bandwidth usage in the datacenters was actually not requests, but streaming geographic data from various government sources. We regularly had 50-100MB/s coming in, and we stored all of it in memory. The GPU machines had 100-200GB of memory, and we used all of it. We had to cycle through that memory pretty rapidly as well, so making sure allocations were low and memory was freed up on time was important.

It may not sound like we had much redundancy, but with all the caches, and each machine being quite powerful, it was better than it sounds in that regard. We often took machines in an out of nginx. The way the graceful degradation worked, we would prioritize the imagery from higher zooms (more zoomed out) so the worst that would happen on a typically day is that some very zoomed in images, in places few people were looking, might be slow or time out.

So, in the end, you are correct, the situations are different. The bank had to store things for a lot longer, and had to uphold more stringent SLA's and the like. That said, I still think they were flushing a lot of cash down the toilet, and making things over complicated :).

0 comments

melq5y ago

Thanks for the detailed reply, I found it very informative. I work on a small team developing 'private cloud' infrastructure for a large company, so I usually find myself on the opposite side of the arguement... trying to highlight the virtues of on-prem hardware and the downsides of 3rd party cloud providers.

We've had to work very hard to allow for developers/sre/ops folks to be able provision vms and bare-metal machines in our datacenters the same way they would in the cloud provider that we use. Obviously its not as fast, seamless or feature-rich as it is with aws/gcp/azure et al, but I'm proud of the progress we've made.

What really kills me though, is that a huge chunk of our engineers seem to think our work is a complete waste of time in the first place. We have several physical dcs, and tens of thousands of machines... but since most engineers don't have to think about costs, or about workloads other than their own, they think of us as out of touch and clinging to the past.

Nothing worse than getting snark about our platform from an SRE who spends their days in a web app glueing together the ready made services of google and amazon while acting as if they're building the world of tomorrow :)

courtfOP5y ago

Wow, sounds like really challenging work. I still think those skills will be valuable for a long time, especially since they are becoming more rare. I'm curious how you prepared for that job, did a specific degree help? I have only learned what was necessary to solve the problems I've been given, and picked things up from others on the job.

j / k navigate · click thread line to collapse

0 comments

melq5y ago

courtfOP5y ago

j / k navigate · click thread line to collapse