I've grown unfond of this attitude. I most certainly don't own it. I have no IP rights to it at all. We're both being paid to solve different facets of the same problem. Coming at me with "this is your problem" isn't going to foster a collaborative environment with me. Which is much more pleasant than an adversarial environment.
Also: I'm not the only one that knows how it works, it's been peer reviewed in no small part to reduce my bus factor. All documentation requested is perfectly reasonable, and should be part of the organizations standard operating procedure.
If it's not part of the SOP, then no, you wont have those things. You need to work at a cultural level to change that, and for that you're much better off making allies than anything else. Make it clear how those things help you, and what you'll do to make the developers life easier when you don't need to worry about the basics. If altruism fails you, you can usually count on people to act in their own best interests.
SRE here.
My takeaway from this is: If you want SRE support running this service, then you need to provide SREs with knowledge of how the system works. As long as only the devs have this knowledge, it's a bit unfair to put the SREs on the hook for supporting it.
Maybe I'm reading between the lines too much--the wording in the article is sloppy at best, and at worst, it doesn't actually say what I'm saying.
It's nice that your code has been through peer review and other people on your team know how it works too. That's less helpful for the SREs running it. SREs bear the burden of the pager--sometimes getting woken up at odd hours of the night to fix problems that were, in a sense, created by developers.
The SOP for getting SRE support for new services should include things like runbooks and design reviews. SREs should be in the loop when you figure out what metrics to expose from your service, because SREs will be the ones using those metrics to figure out the alerting systems. Very few companies have decent "SOP" for SRE support--there are a few companies which are really good at it, like Google, and then a long tail of companies which dump services on SREs without including SREs in the process.
IMO--the right thing to do is to give SRE teams the power to say "no" and refuse to take the pager for any particular service, barring exceptional circumstances. There's a deeper discussion to be had about why this should be the case--basically, devs and SREs have different incentives, and neither team should be put in a subordinate position to the other, because both teams have goals that support the business.
Was a tenet in the original Google SRE material. SREs help operate well-behaving services using engineering best-practices. Services that fail to behave well and bust their error and support budgets repeatedly go back to the teams that wrote them.
Specifically, if things are not "working", I expect the developers to understand how their code works and what it needs to function properly. I'm constantly surprised by how much developers don't know about the app they write code for.
I'm not asking for your intellectual capital because I want to sue you. I want to understand how the app comes up.
The main problem I'm usually trying to solve is, apps are just packages of stuff, if I can deliver 1000 packages no problem, but one doesn't deliver is it the package, the address it got sent to, or is this some brand-new package requiring a different way of handling it. I need the package sender to explain to me what it contains.
On another note, @klodolph, if you are getting paged a lot, then your SRE needs to improve. Perhaps you were slightly exaggerating, but I consider any escalation to SRE personnel a failure on the SRE side. It's kind of a brutal metric to follow, escalations as close to 0 as possible. An interesting thing that happens if you try hunting for it is you will realize 80% of your calls come from 1 or 2 things. Addressing them will make developers happier and SREs happier.
Build telemetry and convergence into well known platforms to make response easier.
The author is using ownership as a tool to avoid responsibility, and is thus creating an `us vs them` mindset rather than an `us vs the problem` mindset.
Having a strong definition of ownership (like committing your organizational structure to your monorepo as config file) is invaluable for building tooling.
If you have a strong definition of code ownership it allows for things like people less familiar with a particular piece of code being able to make changes with the approval of the owners, while simultaneously notifying them of the change.
Likewise, if you are working on a platform that multiple teams use, you can write tooling that automatically assigns bugs or tickets.
Ownership problems and "us vs them" is a clear sign of poor leadership. Most devs that experience it become cyncial or hostile without being able to understand that it is leadership that failed them.
Having a strong ownership can become toxic very quickly.
It is really important to understand that toxicity and hostility is a function of leadership (or lack of it).
I highly recommend the book Extreme Ownership. That book explains the mechanics of toxic environments.
If you are a dev on the team that owns that service then it's you and your team's responsibility to answer all of these questions... Even Org's SOP would end up reaching back to the team who owns the service if problem's arises...
Far too frequently you end up in a situation where someone makes an environment change and blows everything up because they have no understanding of the services they're stewarding.
If you want me to take responsibility, my team should be managing the service end to end.
I feel really strongly against this division of responsibility in software teams. It too often leads to holding up progress and hostile interactions due to each team pursuing their own priorities.
> How can I check the health of the service?
In the definition of service, you define a field for
health check script.
> How can I safely and gracefully restart the service?
This will exist within the script used to push new code.
> Does it has any external dependencies?
This could be defined in the service configuration and
used for setting up integration tests and automatically
generating a dependency dashboard.
> Do you have a playbook, or sequence of steps, to bring
the service back up?
You could generate a field in the service defintion to
automatically generate a dashboard and include the
playbook link at the top of the page.
> Do you use appropriate logging levels depending on the
environments?
Production could be extremely opinionated about what
acceptable logging looks like, forced via code review. Log
level could be defined in service config.
> Are you logging to stdout?
Why would any production service get to choose?
Service owners shouldn't be able to log into machines.
> Are you measuring the RED signals?
Required fields in service config that could be used to
generate a service dashboard.
> Is there any documentation/design specification for the
service?
Required config field.
> Are you using gRPC or REST?
Trivial grep.
> How does the data flow through the service?
This is complicated, but can probably be easily replaced
by asking what state your service keeps and how it's
stored. This is the only question I think the author
should/needs to ask.
> Do you have any PII/Sensitive data flowing through the
service?
While this question is important, this is one of the
problems that has to be a particular person's
responsibility. Any dev that answers anything but
"probably not, but I don't know" shouldn't be trusted.
> What is the testing coverage for this service?
Some form of this would exist in a service config.
I don't think the question of responsibility is as simple as "it's the team's problem."I’m not sure I follow what it is.
A technical construct, like a code template or a API that services implements ?
Or a process constructs, like a SOP to follow with checkboxes?
Thanks
GDPR makes it the responsibility of the organisation to know. You can't safely say "I don't know" about PII.
Code that is not running correctly in production is worthless. If you write code and haven’t thought about all of the implications that it takes to make it run correctly you haven’t produce business value.
Yes, I have been a developer for 25+ years professionally. But for the last 10, I’ve also thought through all of the topics that the author has delineated in his article.
Yes I consider myself to be an competent “DevOps” engineer as long as it is on AWS and can go from empty AWS account to a fully functional infrastructure using IAC, a CI/CD pipeline, monitoring, alerting, centralizing logging etc.
Knowing that I will either be the person doing the “DevOps” or working with the person who is informs the design of my development.
I wish this were true, but the number of billion dollar companies running code that doesn't run correctly shows it isn't.
Likewise error budgets that say "your service has not been reliable enough this period so the next sprint will be dedicated to improving that and no new features will ship" is another way to make sure quality is not an afterthought.
I understand this, this is what I was trying to direct my distaste at. Unfortunately, it seems to have been misconstrued by responders.
> you own and are responsible for the part of that service delivery
No, I'm just responsible for that part of the service delivery. I own none of it.
I object to it in the same way that I object to hearing "we're a family" from leadership.
It's either a DevOps culture or a classical Dev + SysAdmin setup with clear boundaries and where a developed application is being thrown over the fence to SysAdmins. The latter will always result in mutual animosities between the two parties.
No, it was supposed to be dev and ops collaborating with mutual respect and empathy.
Although it's sane that developers keep contact with the reality of maitaining the live applications they wrote, it just doesn't scale to ask them to fully support them.
There is an infinite amount of maintenance for any live system. No service of any magnitude just "works" in production indefinitely, at the very least because this service interacts with others that will fail.
If developers are responsible for every live system they publish, they will get locked on after a finite amount of service they published, and leave because of maintenance boredom.
There needs to be a reasonable amount of documentation written, explanations given, level 2 support taken, but that's it, the maintenance is for ops.
And look, infra people are harder to come by and more expensive than devs so no company wants to waste our time bumping dependencies and fixing non-infra related bugs. If you’re big enough to have a team or teams of maintainers then they would go there. If not then it’s ship it maintain it.
As such any team should ensure integrity of it’s code regardless of people coming in and out. Hiring, levels, code reviews etc. help this. Bunch of grads with no source control and write code directly in production without review would be the opposite.
SRE is a hat not a title or a team at small companies so this is how it has to be done. Also AWS probably works this way from what I have read. The team like a mini business.
Eventually the developers just gave up on all development work and became operations. The actual operations team kept k8 cluster alive. Developers had to do everything else.
Eventually people would get the hang of the operations side but by that point they were burnt out and quit.
Isn't this "adversarial" as well? Why would you withhold that information just because the SOP don't make you provide it? What will happen then is that eventually the service will break, nobody will know how to fix it, and they will come and ask you.
If you're no longer there, the service will be decommissioned and all your work will have been in vain. I don't see a net benefit for any of this.
I was trying to convey that they wouldn't exist because it hadn't been asked for. People aren't mind readers, ops has to vocalize their needs. Which is what the post was doing. And I don't think those needs are unreasonable, either. My point was more that the work to make sure those things are available needs to happen higher up than you'll be able to reach by talking to any individual developer.
(I was mostly in agreement with the OP and was surprised by the level of opposition to it in the comments.)
Yes, it's better if the information asked for in the original post is in the SOP, but if it's not, then asking for it each time somebody sends something to production, while suboptimal, is better than nothing IMHO.
These kinds of responsibilities create this weird scenario now where the team sre is the teams babysitter. Which just leads to the ops vs dev bullshit weve seen before. Toxic right off the bat.
Someone has to enforce those good practices. Weak engineers hire more weak engineers and they suck and their job.
I love meeting people from other cultures and backgrounds and experiences. It’s great to get to know each other. I perceive you to be one of those people for me.
See, in my travels, weak engineers usually have no/little say in anything of import, except the yucky boring maths stuff others don’t want to/can’t do. Hiring a personality is something that our management types love to get involved in. They hire for all the wrong reasons. They retain people for all of the wrong reasons. They primitive for all the wrong reasons. We have “hiring as a service” institution here known often as HR that manages to meddle with things.
It’s not that any of these bodies hire weak or strong engineers for any malicious purpose (oceans razor ya know), it’s just completely arbitrary for us and you kind of just learn to put op with and cope with the chaos.
Where you’re from, is there a sort of settling function where after a while, all the weak engineers have cohired all of the week engineers? Do the strong engineers hire the other strong engineers?
I'm interested in your experience as well. Even at the largest US orgs there's still the concept of a 'hiring manager', who leads a small team, has a big role in hiring, and can can do meaningful technical work on the system. So the interviewing ability of these people, and the people that they trust, is the main determinant of the engineering hiring decisions that get made. Like a common interview loop would be hiring manager, (2x)senior/staff dev on the team, senior dev on another team, director. Lots of exceptions and opportunities for random consultants/reorganizations from above, but the basic idea of line employees and managers being able to identify their successors is pretty baked in around here. Can't speak for the east coast though
In "not so well" scenario, the company ends up with weak engineers that don't have a lot of experience. They might have 20 years of making marketing websites at website mil with Drupal or maybe even Django, but zero experience in software development. Those people, either don't know what a good engineer is or they are afraid to hire someone noticeably better than them - so the team ends up stuffed with not so good engineers.
Maybe by luck they hire a good engineer eventually, but a good engineer will look at all that mess and will bounce really quick unless the pay is worth it. As an anecdote, I worked with a very good engineer recently, but her team was weak - on her 1:1 she was told to not be so strict during her PR reviews. Reviews she was making, unlike mine, were very polite and people still complained about her being too strict (she wasn't it was bare minimum).
When a strong engineer is interviewing a candidate - they know what to look for and willing to look the other way when it comes to personality. (that sometimes end up in toxic engineering team culture, but that's another issue) I also worked with strong engineers that would refuse to hire engineers that are better than them (it was unnoticed for some time).
A lot of SRE complaints are coming because they have to babysit weak engineers and hold their hands. Management thinks that the solution to this is to hire junior SREs to deal with it, but the real solution is to reduce need for babysitting. For context, I'm a sysadmin that ventured into software development and now doing SRE work and honestly, there are a lot of days when I wish to go back to software development.
if you can hire a good sre that knows all this stuff, then you should be identify your lacking in a skill on your teams(it'll be obvious because things are shipping slow, and breaking often) use the same skill to hire an sre to hire a good swe
it sounds like your implying there has to be some weird relationship were developers cant be trusted to debug and need to be babysat by the grand sre that know and watch over everything or something. this is the toxic nonsense that exists.
I've met CTOs that would agree with you. I no longer work with any of them.
Truthfuy often times I don't understand how things behave in a production environment.
I suggest asking about these things in the hiring process.
The questions are actually to prevent that. They are making sure somebody can possibly diagnose and fix a problem that involves[1] this service without having to call you up.
[1] Note "involves" doesn't mean your service is the cause, it might be the previous or next one in the chain.
The questions are great for sure.
And thus you'll have an incentive to improve the quality and lower the pain. Of course that's easier said than done, and there may be other factors at play (e.g. priorities set by someone else), but at least if the misery is shared those who can do the most to fix it are fully aware of it.
Yes? What, you want them to keep the pain even though they don't cause it?
I haven't read the SRE book, but my understanding was that at Google the answer to all this would be that the SRE would act as a software developer and submit pull requests to the codebase in order to implement/fix all of this?
> If you’re a Software Engineer/Developer, then consider that a service (at least, for me), is a piece of code running in a live production system, that YOU wrote, only YOU know how it works, thus YOU own.
And my own take on this statement which is getting so much traction in the comments is that this seems largely indistinguishable from the wall between Dev and Ops that we had back in the late 90s.
I don't necessarily disagree with everything the author writes by the way. There are a lot of good points in there about building things to be operational, but at the same time, what good is an operations department if it can't actually operate it's systems? I know the answer to this often becomes buying third-party software or "standard" systems, but as more and more businesses are realizing, that's often worse than simply using a lot of interconnected excel sheets (which doesn't really scale).
It'll be interesting to see what happens once the newer waves of project managers and it-business partners learn from the successes of companies that build in-house software instead of going to "standard" systems where they'd need the inhouse developers anyway to make the un-Godly amount of API's and data-transfers work. At least here in Denmark, companies like Lego and Vestas are doing some really groundbreaking money making at a much lower cost, by not going to "standard" systems for everything. Not that you should never use standard systems, there are somethings that are shared across businesses after all, but there are typically also a lot of things that just won't fit into some internationally shared box well enough for it to work out as a net bonus.
Agreed, integration can be a challenge, but the key here is good enterprise IT architecture and avoiding SaaS sprawl.
Do you have an link to an article about this somewhere?nn
To me, the worst SREs are the folks who come from the DevOps side whose experience is limited to pipelines and infrastructure as code type stuff. They invent solutions that just don’t work.
In my experience DevOps coming from a sysadmin background end up poisoning the well. They’re afraid of developing the right kind of abstractions, block any proposals to do these and before you blink you end up with a mediocre VendorOps team that can do nothing but integrate off the shelf solutions with an unmaintainable mess of yaml / hcl “code”.
They will replicate, the more capable DevOps engineers will leave in frustration and your platform will be taken hostage. Don’t let a single one in.
Random stereotypes might be funny but they are not useful in getting stuff done.
Although in my experience most devs would rather see their code all the way to production; the problem is their line manager wants them to tick off the ticket and move on to the next one as early as possible.
Friend, I have a whole laundry list of issues I have with devs, but this post isn't "things devs want from devops".
> Random stereotypes might be funny but they are not useful in getting stuff done.
You say "stereotype"... I say cynicism, which is ultimately what qualified me for the SRE role.
- What specs of a VM do you require?
I'll assume that 16mb of RAM and 512mb of drive space running Slackware is suitable operating from 1.44mb floopy.
- What do I do if it doesn't compile?
It works in DevLand I assume I'll work anywhere. No, you cant growl at me, you asked for Linux and I gave you Linux. Documentation please.
What are my options? How much do those cost the company?
I can run N requests on my laptop with specs of X. CPU usage hits 100% and memory usage hits 50%. This is 10% of our projected load. Therefore I project that I will need 10X CPU and 5X memory resources for this system for our entire userbase. My laptop is pretty powerful, but there are systems with 320GB of RAM and 80 cores. Do we get 2 of them?
Oh, what you have is a bunch of 3 generation old 28 core Xeons with 128gb of RAM. Can I get access to one of them to test thoroughput? One Xeon core is not as good as one macbook core, especially not that old, so I really can't make any promises of perf without testing on representative hardware. No? Fine, let's add 30% buffer for poor IPC on old hardware.
Oh, you've got a custom wrapper for Java that launches it with a bunch of custom JVM options that the SRE org mandates everyone uses? Got any documentation on what those are?
What's our lead time for getting more? We need to know that for an idea of how much buffer we need.
So apparently lead time is six months? Ok, what's our load going to be in 6 months? I'll ask bizdev how many customers they think we'll have in six months. Oh, they answered with a vague "We want to have lots of users". Fine, we'll add an extra 100% buffer.
What, the operational efficiency team are pissed our VMs are at 25% usage?
While there are certainly dev teams that throw capacity management over the wall to SRE, the inverse is also nonsense as it's the SRE team who usually get informed of hardware options, deployment standards (especially in bigco) and company operational standards.
What options do you want? You come to me with the specifications, and I'll get back to you if you can. I'll work with you to get what you want.
If I am the one with the power to create, the gatekeeper to the virtualization cluster and with a ballpark figure this makes my life so much easier. It allows me to make my justification easier too. "Hi Manager, X needs this. I think it needs this. I'm going to setup this and evaluate, no manager I don't think your correct."
"Here's my tech specification for what I'm setting up. here's the documentation for configuration and setup and these are the results" Lets ride.
Let me give you 8x core and 64GB see how the performance spikes and go from there. I'm not stingy and always happy to give more to test performance and than decrease if it's overkill. But don't dispute if I start to take away because of.
> what you have is a bunch of 3 generation old 28 core Xeons with 128gb of RAM. Can I get access to one of them to test thoroughput?
Sure. Yes you can. Why do you think you can't?
> Lead time
As fast as you require me to setup the VM. If I have all the docs, I can fly-by and have this thing setup under a day at minimum. Heck, I'll even work no-paid overtime to get this for you. I can escalate this on the fly. I'm in the good books with the NetOps, I even know the backup-ops. Tough crowd every time to please, but I manage so.
Production lead time? Sure, probably six months to get stake-holder approval and the rest, but I'll try to get it sooner.
Prototype lead: a week.
Just give me figures and I'll do the rest. Enterprise or not. I'll push for what you want but you have to work with me. However I need the figures and documentation before I can. I can't be seen creating the documentation on my own time based on made-up figures when I have an estate of 100 VM's needing security patches. That makes me look bad and if it fails the tests, I'll get the blame.
If you make the right technology choices, most of those should also be pretty doable (not necessarily easy, but not overwhelmingly hard):
- enable some basic healthchecks through configuration value in Spring or whatever you're using
- then do curl requests against that (or even the HEALTHCHECK command with containers)
- have something standardized take care of application lifecycle (like systemd services or once again containers)
- add some instrumentation along the way with something like Sentry or Apache Skywalking
- integrate with OpenAPI through whatever framework you use (so you don't have to write everything manually, but can use codegen)
- don't be lazy and write tests for your code, at least the parts that are easy to test (e.g. business logic, not the low level JSON serialization)
- hopefully have a few Markdown files that describe your service
Of course, those aren't strictly necessary for things to run (hence a lot might be called day 2 concerns), which many will use as an excuse not to care about it, as long as they can get something shipped and it seems to work now. I've seen that more often than I'd like, given that I'm sometimes the person who gets called in to fix the eventual issues.An excellent group of suggestions for developing applications in ways that can limit headaches is the 12 Factor App site, which can help you avoid some issues ahead of time (it covers configuration, ports, logs and other concerns): https://12factor.net/
Ideally, everyone who has the applicable skills or knowledge would collaborate and work together towards having their software both work now and also keep working, with insights into potential problems ahead of time. And if your technology stacks are sufficiently boring, there's no reason why a lot of that knowledge couldn't be encapsulated into a few concise Wiki pages, Markdown files, code snippets or project templates - so anyone who needs to ship a new service in your org can grab one of those and get up and running quickly and properly.
However we don't live in such an ideal world. I'm given tasks such as "set this up for devs for release of X" with nothing specified. How am I suppose complete the request? I now have to waste my own time chasing managers, dev's and everybody else to setup what's required. My resources are finite, I don't have unlimited resources nor a budget to waste on virtual machines. We use multitude of OS versions ranging from CentOS, to Debian including FreeBSD and Solaris so what do you want?
This is SRE playing lax on the case of not defining and just expecting it. I don't have any access to the DevKit side of things, I don't have the ability find out what is actually required.
You want it to run in production, fine. But it makes my job insane when no documentation is provided and a JIRA ticket consists of some attachment with "this please" is handed to me. Something breaks and I am the first to get the grunt because it's "my fault". I have to waste time debugging when turns out to be because the software isn't designed for what I thought was suitable for a project because someone decided to use some old version of a Ruby GEM that's been cross compiled and which I've had no clue.
No one thinks about SysAdmins when they're running the actual show. If it wasn't for those, you wouldn't have a production, uptime and all the stuff we do. I can recite so many stories in many different jobs. Where this has been the case. But same-so, HN developer is bias and don't take in to account the flow that requires for their produce to reach production.
/end vent. But your not allowed to do that on HN. Because you get thrown downvotes for expressing issues with how the teams integrates.
16 years of SysAdmin, 33 and I'm more than burnt out because of. The first thing I do every morning is decide the colour of the ethernet cable I wish to desire to hang myself with. I take pride in my work yet done dealing with the shit that's given, yet you still need to lick the plate because life. I've just had three months off, just exhausted all my savings and starting my new job on Monday. But hey-ho, maybe this company has stuff right.
The cut and dry answer is: you aren't, because you cannot. What should happen is a polite response to the request along the lines of:
"Insufficient information has been given for the release. Please see the attached Markdown template with the information that needs to be filled out. This request is on hold until this will be done. You can submit any suggestions to improve the template, or reach out with further questions to ..."
> This is SRE playing lax on the case of not defining and just expecting it. I don't have any access to the DevKit side of things, I don't have the ability find out what is actually required. You want it to run in production, fine. But it makes my job insane when no documentation is provided and a JIRA ticket consists of some attachment with "this please" is handed to me. Something breaks and I am the first to get the grunt because it's "my fault".So essentially you have all of the responsibility, without any of the power to actually do your job because of the circumstances that you're pigeonholed into? If you need the money, then I guess that's what you have to tolerate, but otherwise some lines should be drawn somewhere. In most cases, adding documentation/instrumentation and requiring it going forwards would be a good idea that any sane organization would get behind and support: especially if you can reference all of the past incidents that this would have helped guard against.
Quite frankly, that sounds like a dysfunctional environment and absolutely nobody would fault you for quitting a year into it (or even not waiting that long), in search of something better.
I suspect that a part of the problem is that in our market, we have mindsets that don't go beyond any of the following goals:
- we got paid, regardless of what works or doesn't (can be seen in consulting)
- we shipped something to meet a deadline and not get contractual penalties, regardless of quality
- we shipped something that seems to work, though we don't care about much else (day 2 concerns)
Sometimes it's because of ignorance and not knowing better, other times it's because there are cultural issues in the country as a whole, maybe a lot of people viewing development only as a step in the path to becoming a manager, instead of a craft that demands attention and care.So what you get globally can be companies that range anywhere from "Hey, we want you to be comfortable and not overworked: here are some learning materials or a budget for that, here's our knowledgebase and an overview of our procedures and architecture decision records (ADRs), here's a user group for this particular technology, feel free to reach out if you need anything." to "Hey, ship the software until monday. Why isn't it still done? I don't care about the details, get it done."
Maybe I should write a blog post about that some day, just not sure what to title it: "OKRs/KPIs of caring about software" or something like that, probably. I suspect bad mindsets and a bad culture is one of the reasons for sites like this existing: https://devrant.com/ or rather why articles like this ring true: https://www.stilldrinking.org/programming-sucks
At the end of the day, do what you can to take care of yourself!
I don’t see how this is even controversial. Consider the case where a SRE is responsible for 5 or 10 such systems. They could never be expected to know as much about those systems as the people that wrote them.
Now if there is a one to one relationship between SREs and systems then it might make sense to expect that level of understanding from the SRE.
In my experience it would be a great privilege to have a dedicated SRE to your application.
The right attitude is to figure out processes that let people draw a line when to go to DevOps, and when to escalate to developers. Developers need to understand the costs they impose on devops and organizations need to make sure developers are empowered to fix their own issues, rather than to be constantly chased around to business requirements.
Developers ultimately answer to business priorities, and they don't necessarily own the business processes that demand their support. If developers are given ample resources to keep bugs out of systems, document operational expectstions and respond to incidents, then the developers can "own" the processes better. If not, it's a management problem that is just of the same nature as the usual SRE complaint that developers don't want to own anything at all.
They might know how to build/test/deliver/monitor some solution, they might know to some degree how to configure solution (but developers should support them with it and describe it well), how to script some operations, however they definitely won't write bugfix themselves.
As an SWE, I want to and need to know how to provide metrics on my system to be able to understand its health, and I should have good safeguards in place, or at least have communicated with the SREs what I need to provide to them to help them have good safeguards in place, to make sure the application keeps running. If the application goes down, it's my responsibility to make sure it's not my fault (bug in application code) that caused the system to fail.
What I, an SWE, want out of an SRE, though, is infrastructure management. I want to be able to ask them for some queues, and for a redis instance with high availability. I want them to set up the Kafka cluster, the database. I want us to have a conversation about where the secrets are to be stored. I want to be able to ask them what I need to do in code to get a secret and use it. I want them to be able to give me a good template for k8s deployments - or maybe to pair with them, given the docker containers and sidecars I need for a deployment and the projected scaling I'll need and come out with a best-practices set of k8s deployments.
I would be grateful if they monitor the database for some horrible queries; and, use their knowledge of which deployments made that bad query, to file a ticket to the right team so they fix their code or add an index or whatever is necessary.
Infrastructure, be it k8s or nomad, configuring redis, making rabbitmq highly available, configuring and organizing (especially organizing) k8s deployments into something sane and logical, and so many other things related to infrastructure are as specialized of skills as writing high-performance or unusually architected, large systems. I've seen the systems that come up when SWE-on-assignment create infrastructure; and, I've seen the literal years of work SREs have in their backlog to fix it with best practices.
It's similar to front-end developers: it's an entirely different skill set; and, while each person in each tear can stumble around in the other tiers, it's way better if we are all there, working together toward a common goal, and especially focusing in the areas we have each specialized our craft.
addendum: of course there are exceptions; but I think those exceptions are 1 in 100 or 1 in 1000.
What you're asking for is:
a) SRE wrote an alert for slow queries, because slow queries affect all shared infrastructure users
b) SRE gets woken up at 2 AM by his alert
c) SRE sees Dev wrote a bad query
d) SRE files a ticket for Dev
e) SRE goes back to sleep
f) SRE gets woken up at 2:30 AM by his alert, because the query fired again
g) SRE has a restless night
h) SRE goes into work, asks Dev to prioritize the ticket to fix the query because sleep
i) Dev tells SRE that maybe it'll get into the next sprint, in the meantime the alerts are SRE's problem
No thanks.I'm sorry you've had bad experiences with alert-all-the-things ticketing paradigms.
addendum: in fact, you can automate all of this so it just shows up in a team's "known issues" query that they may manage in standup or sprint planning.
Like this is the single biggest truth in the article, and I'm glad to see it stated so clearly. Shout it from the rooftops, please. It's a direct logical consequence, too — and yet, so many people seem to make decisions that violate this truth.
I field so many questions about "why is service X doing Y?" Have you asked the service owners?
Unfortunately, I've found one more or less has to become proficient in rapidly understanding services you don't own, because getting other people to act logically is a fool's errand.
> Are you logging to stdout ?
Nooooo to stderr, that's literally what it is there for. (As C says, "for writing diagnostic output". Logs are that.) Also, it is sometimes buffered and you don't (IMO) really want that.
Any output producing program requires stdout for the output, and you can't co-mingle logs with that and have piping still work. While it is unlikely that your production service is producing output, there's no reason to do anything different with the logs. (I'd say a part of being a good production service is "don't be needlessly special".)
(But our tooling will just capture and mux the two streams together, too, so it doesn't matter, unless buffering means the error logs don't make it right before your service is killed.)
Also, your infra team provides the metrics service, but you need to capture your own metrics. My metrics provider does not have a crystal ball, it cannot peer into your service's memory and pull out critical stats. You must push them yourself. Talk to your infra team, they can show you the API they use… (We collect common, machine level stats, like "CPU in use" or external things about your service that are easily visible, like per-container memory usage. But not your reqs/sec.)
Bah… use syslog() (or whatever uses the same protocol) and then you get priority, name of the daemon… and if you step it up to journald, then you get to log key:value stuff.
Of course most golang developers have never heard of syslog() and think that logging is done with stdout and then a bunch of parsers to extract information that was there to begin with, had they used a proper logging.
(We could perhaps arrange that, but JSON-lines is typically good enough, and easier for devs to understand.)
(Note that the KV stuff requires you to speak journald's protocol: syslog in systemd (and really, everything I've ever seen speak syslog) is the old BSD syslog protocol, which doesn't support KV data. Not that journald's protocol is particularly hard to speak.)
Questions in this form always seem condescending. Like “I‘m smarter than you, I thought about it, you didn’t”.
If this isn't standardized in an organization it should be. Otherwise, it's the same repetitive questions, the same finger pointing, and the same miscommunication. If these are the requirements needed to put a service into production, then make it explicit. As the developer, of course I own the service, but (usually) don't have the access. Standardized as requirements, both teams can work together to produce, monitor, and troubleshoot production services smoothly. Then nobody is surprised when it is release day, and asked these questions with an impatient PM whom has already publicly set expectations.
Almost all of the questions can be simply answered with: "This is a NFR that was created by SRE".
The important thing is to collaborate with each team and be there when architectural and design decisions are being made in the first place!
All of these questions are post-hoc, coming after the thing has been built. You would never need to ask these questions, if you help drive initial design.
Embed yourself with your teams. Ask to be part of design discussions. Remember: 50% eng 50% ops. You have no excuse!
I agree that this should happen, most successful projects have people with all sorts of knowledge contributing to it, without too many silos in place.
> You have no excuse!
However, the Ops people don't always get that power or a say in the matter. In many dysfunctional environments they'll simply be given an apparently finished service and will be told to put it in prod.
Please don't dismiss that these circumstances exist altogether and don't shift the "blame" exclusively on the people who already have their lives be needlessly hard, this isn't likely to encourage a positive outlook.
Au contraire, OP was blaming engineers with a "holier than thou" attitude. That exact attitude is the kind of thing that leads to the dysfunctional environments that you speak of.
Should SWE consider SRE at design time? Absolutely. Should SRE consider SWE at design time? Absolutely.
These are the questions I find useful:
"How is capacity for the service allocated right now?"
"How is software updated right now?"
"How was the last outage handled in as much detail as possible?"
From there, just about everything answers itself with a couple days of reading code and poking at machines, particularly from the output of `lsof` (log files, config files, what the service talks to).Half of these questions could be answered with grep and once you get proficient at grep, you can answer questions faster, and more importantly, more accurately than the people who work on the services themselves.
> that YOU wrote, only YOU know how it works, thus YOU own.
I find this attitude pretty toxic. If you are in an SRE vs Product Dev mindset, then you have bigger battles to fight than service manipulation.
> Half of these questions could be answered with grep and once you get proficient at grep, you can answer questions faster, and more importantly, more accurately than the people who work on the services themselves.
SREs can own the whole development process you mean?
edit: in HN you probably want to use intellJ for everything, don't even mention grep please, they don't know what that is
And I am not objecting to it in the least; these are all good and vital questions.
I am objecting to anyone claiming that DevOps is anything other than "using the kinds of tools that help software development projects to help operations", and I present this as absolute evidence.
Before DevOps was en vogue (i.e. was a descriptive term more so than a buzz word), the whole premise was to collapse the bulwark between engineers and sys admins. All SWE's should care about how their application is deployed, monitored, and scaled in production. This leads to far better application engineering outcomes in most efforts in which I've been involved.
The end result of those efforts was often, but not always, engineers writing some amount of operations tooling themselves.
But now we've come full circle. There is a ton of operations tooling you can pull off the shelf, and those tools are generic/complex enough to require administration. So many DevOps roles now as a result, particularly in larger orgs, are mostly administration-focused and less so about building the tooling itself.
It feels like we've reinvented the bulwark we tried to escape previously. There's an open question as to whether, from a practical perspective, we still have gained a net win there irrespective of the logical separation between eng and ops. I'm not sure where I've landed yet on that question.
The way this is phrased, it sounds like the author is managing reliability for things where they don’t already know the answers to these questions nor do they have the context or bandwidth (or even access?) to answer it themselves. Seems like a recipe for disaster, or at the very least, a lot of frantic learn-as-you-go.
That said, as a dev, I do think we could do a lot better adding playbooks. Though on the other side of the fence, they’re often ignored with a “I don’t know what’s going on and you wrote this, can you help?”
I actually liked the DevOps-as-in-devs-also-ops as a forcing function to keep deployment relatively simple because it’s very low on the core competency/value proposition spectrums. It also has the benefit of rewarding companies for making that feasible at the expense of a tiny fraction of the cost of dedicated ops roles.
If you work in the same company, you all own the application. The customers don't care that you're "only" the SRE, or "only" the sales guy. This type of attitude is toxic and should be challenged categorically.
If you, the SRE, do not have the information needed (i.e. the "list of questions") then it's as much your responsibility to ask for it as it is the developers jobs to help you answer it.
If you feel that the company culture makes it impossible for you to create these necessary processes so that everyone have the information they need, you need to either work towards changing that culture or get a new job.
You know why you "rarely get an answer for straight away "? I assume because they are working on the next ticket/delivery. A lot of this stuff is not estimated properly. A way to get it estimated properly is to work with the devs, cooperatively.
This said, for some reason, this blog post seems adversarial and gives me a bad vibe. Instead of "List of questions I’d like to get an answer from devs", it should be "we should work together to get these things done".
* SRE/DevOps folks stating the person that wrote the application has the knowledge to debug it.
* Devs saying that it's SRE/DevOps job to debug it
* Lots of comments on culture and you should do X
I know most people like the whole grassroots thing, but the only shops I've seen that are actually killing it are the ones who dictate these boundaries and responsibilities from the top down. And I've seen a lot of shops.All services should have common health endpoints and shutdown operations.
Logging should be standardized across all the services of a company.
Having bespoke answers to these questions for each service will rapidly devolve into chaos, when you have multiple services deployed.
I've thought, that DevOps by definition is developer and operations in one. You wrote service, you support service, and there is no boundary, and there is no such problem as described in this text, by definition.
DevOps complains about problem, proposed solution for which is to be DevOps...
This is unfortunately the death knell for DevOps organizational teams on large projects. Primarily, the design specification usually ends up being hammered into the inherent dysfunction the project was intended to solve in the first place.
Best of luck =)
The first sin they embark in is framing their argument, in part, as one of titles/labels. This is usually an institutional smell. And it’s not a pretty odor.
The second is that the person believes there role is to question others. It’s a move that insecure people play. The idea is that you keep your opponents defending themselves against questions you define, and that means there’s no time to address some of the hard questions that might circle your own “roll.”
It sounds like the guy feels he knows the answers. If so, why doesn’t he jump in and do them? If he knows better how to do this SRE thing as defined by him, clearly his company has pulled a Peter principle, promoting him from something he did well, to a position where he now harps on others using their nostalgia. Value may have been lost. If he’s really that good, we can use him in the trenches. If not, he’ll learn how to try to explain why some of these PHB questions are actually hard to answer and execute.
That was suppose to be the definition of “DeVOps” in the first place. Any company that has a DevOps role is going to really be an operation role by another name.
If only I had a dollar for every time some program dereferences a null.
SE owns the code, but SRE owns the running code
Other than that, I agree with everything in the post