undefined | Better HN

0 points283042834092343y ago0 comments

Argh. "It works now, so it will work until forever."

It takes _effort_ to make it work this smoothly now, _and in the future_.

SRE is about _preventing_ issues. Not mopping up after them.

To me, the article read like every succesfull sysadmin story: there's no fires, so sysadmin must be bloat.

0 comments

I think you're misinterpreting the comment you're replying to. They would agree with you that the tiny SRE team described in the article sounds very effective, and likely have a lot to do with why the site is still up and running currently. Work like that should continue. But if 1-3 people can have that degree of impact, what are the other 8000 doing? (Again, this is just me attempting to interpret the point made by the parent, not trying to make one myself.)

3 more replies

ryanfreeborn3y ago

No, I think the article makes it very clear what the value and function of SRE is. The point of the comment you're responding to is that the author was the only one doing this—not a team of ten, not even a team of two. This is Twitter's whole cache system! Probably the most important part of their hardware stack, in terms of "is the site performing well for users". There are other SRE needs at Twitter, but not that that many. What were the other 9k people at the company? It begs the question.

b3lvedere3y ago

Whenever clients complain about those costs and efforts, i tell them it's just like their car.

Your car is working perfectly fine so why should you pay for maintenance?

1 more reply

foton19813y ago

To be honest I was very surprised to hear what a cache SRE was working on. It sounded like he had to build all of handling of hardware issues, rack awareness and other basic datacenter stuff himself. Does it mean that every specialized team also had to do it? Why would cache engineer need to know about hardware failures at all, its datacenter team's responsibility to detect and predict issues and shutdown servers gracefully if possible. It should be completely abstracted from cache SRE, like cloud abstracts you from it. Yet he and is team spends years on automation around this stuff using Mesos stack that they probably regret adopting by now. I feel like in this zoomed in case of twitter caches what they were working on is questionable, but the team size seems to be adequate to the task, so my takeaway is that like any older, larger company Twitter accumulated fair amount of tech debt and there is no one to take large scale initiative to eliminate it.

j / k navigate · click thread line to collapse