General guidance when working as a cloud engineer (opens in new tab)

(lockedinspace.com)

227 pointslockedinspace3y ago147 comments

147 comments

83 comments · 21 top-level

pondidum3y ago· 21 in thread

> Do not make production changes on Fridays

I ~hate~ dislike this advice. If you can't deploy on a Friday, you need to fix your deployment strategy. By removing Friday from when you can deploy, you're wasting 1/5 of your available days.

Note: deploy != Release[1]. Use flags, canaries etc.

[1]: https://andydote.co.uk/2022/11/02/deploy-doesnt-mean-release...

Edit: hate is far too stronger word for this

Sevii3y ago

The point of not deploying on friday is to reduce the risk of getting paged over the weekend. It's a quality of life move for the oncall team. No deployment strategy will change the fact that deployments are the leading cause of outages.

If you can't afford to give up 1/5 of your available deployment days you have a problem somewhere in your CI/CD system.

nijave3y ago

Sure but ideally you have high enough confidence in your software that those types of issues are highly unlikely.

1 more reply

kevan3y ago

I'm a huge advocate for CI/CD pipelines and my team owns a lot of them. We're confident enough to deploy anytime but we choose to limit deploys to our team's business hours and not on Fridays. Why? Because we think the return going from deploying 4 days/week to 5 days/week is outweighed by the stress and morale hit of ruined weekend plans if something weird happens. There's probably situations where that extra speed makes a difference but for us deploying to all regions safely can take a full day anyways so it's pretty normal to have multiple changes flowing at the same time.

pondidum3y ago

I understand that, but would counter with that it sounds like deploy == release, and that if that weren't the case, you could deploy more often.

However, I will admit it is a trade-off; some engineering time does have to be spent to get there, and perhaps that engineering time is better used elsewhere right now.

2 more replies

grogenaut3y ago

CI/CD, flags, canaries don't catch everything, and can still cause outages to others. We try and do pretty heavy CI/CD where I work, but not everyone does (we, like everyone, has old systems). It's actually quite easy for us to have the well behaved systems honor release hours or not depending how their release history has gone, or coverage,etc... but they're well behaved, so they usually have great tests, and they're not usually panicked about rolling out after hours, they have their sh*t together.

The reason we have core hours release only without director approval (aka director approval required outside core hours) is so you don't piss off another team by paging them after hours, and so you aren't trying to shove out a thing on a system that doesn't have good coverage or by turning off the safeties. In a large company I've noticed many engineers assume urgency even where there isn't. As an approver myself, most of the time someone wants to rush is because they've not even had the convo with their manager on if it's worth the risk, they are assuming urgency because that's when the sprint ends or what some TPM added to a jira ticket 4 months ago.

I admit that sounds risky itself (the engineers not having the right risk training) but this is why we have a policy and tooling... most of the times I've dug in they're just very new and worried about perception as a new employee, so my job is to shepherd them through having that convo with their managers which inevitably has the managers saying "yes it can totally wait till monday", and the change is inevitibly a bit more hot than it should be due to accidental deadline pressure.

rexarex3y ago

I get that people really want to flex that they can deploy on Friday afternoon and NOTHING CAN GO WRONG, but it’s still foolish and flaunts Murphy’s Law. It can wait.

nikau3y ago

Plus they are likely running simple Mickey mouse systems that aren't intertwined with a bunch of other systems maintained by other groups.

lockedinspaceOP3y ago

Yep, let's not forget that Murphy's Law, whenever you least expect it, boom.

dopylitty3y ago

This one made me laugh. I've been places that only allow deployments on Fridays because it gives the whole weekend to fix things if they break.

It's a good interview question as a candidate. If you ask the interviewer when they deploy and they say only Friday (or worse only once a month) then perhaps look elsewhere for your own sanity because it's a sign of serious malfunction either organizationally, technically, or both.

spc4763y ago

> or worse only once a month

Don't discount a job because of one deployment per month---it really depends upon the service. I joke that a busy year for me involved four deployments to production, but "production" for me wasn't a website, or even a web-based service, but a service involved in the call path for phone calls. Our customer is [1] the Oligarchic Cell Phone company and the SLAs are pretty severe.

I do have to ask---where do you people work where you have multiple deployments per week (or even per day)? To me, that sounds insane!

[1] Still is, even though I left a few months ago, and not because of the lack of deployments, but the shoving of Enterprise Agile [2] on the company by new management.

[2] Which is anything but Agile.

1 more reply

bityard3y ago

I am on a tools team, so the "customers" for our team are the company's developers. For changes that might cause an extended outage if things go sideways, we generally prefer to do those after-hours or on weekends so that we don't have all the dev teams sitting idle during work hours if something goes wrong on our end.

The upshot is that this is fairly rare and we do not have an on-call rotation. If most anything breaks over the weekend, nobody is going to notice or care until Monday morning rolls around.

fragmede3y ago

Depending on your role, that is. If your desired position is straight dev with minimal to no ops work as possible, then yeah, red flag. However, if you're an SRE/DevOps-type person, setting up a continuous deployment system so they can deploy more often than that is a perfect landing task to dig your teeth into. Different strokes for different folks.

tbrownaw3y ago

If there's a very strong "only during standard work hours" usage pattern with SLA penalties for downtime, adapting deployment patterns to that reality can maybe be sensible.

rexarex3y ago

I love this idea.

doctor_eval3y ago

You should have both the confidence that you could deploy on Fridays, and the wisdom to know that you shouldn’t.

lopatin3y ago

Interestingly my company only deploys on Friday because it has to wait for (most) markets to close for the weekend.

pondidum3y ago

Oh now that is an interesting take!

Would that still be required if a deployment and a release are decoupled entirely, or is it unavoidable? Genuinely interested!

1 more reply

elric3y ago

Hating it seems a little strong. I'm sure that any team far along enough on the quality spectrum can just read this and say "we've moved beyond this worry". The post is titled "general guidance", not "absolute truths". Adjust expectations accordingly.

pondidum3y ago

You're right, hate is far too stronger term for this, I've updated the wording. Thanks.

charcircuit3y ago

There isn't a bottleneck in the amount of commits you can ship per day. You just get more changes to roll out on Monday.

I do disagree with the absolute statement of not doing it, but I definitely do a risk analysis whenever I ship a change on Friday and avoid anything risky and just push it off until Monday morning.

Not all changes can be put behind flags and canaries don't really fix the issue (unless you are okay in blocking important fixes from being rolled out due to your bad change killing the canary)

lockedinspaceOP3y ago

Seems reasonable, but if you work for a large company, you can't guarantee that a major release (which is a production change) won't cause any unexpected harm. I have worked with quite bit organizations, and deployed on Friday and wasted my entire Friday-night and Saturday morning, rolling back the +130 components that an app had.

If you are a small company, or you do not do extra weekend shifts, I understand your point. Elsewhere, you just want to live an adventure every Friday.

kator3y ago· 10 in thread

Don't forget "pets vs cattle", thinking of servers as ephemeral and working towards quickly being able to scale up/down based on demand. So often I see people "lift and shift" from a dedicated server model into the cloud and never convert their pets into cattle. This reduces flexibility later, not to mention makes it harder to respond to patching needs, scaling, and moving to optimize latency or costs.

r3trohack3r3y ago

As an ex-FAANG engineer, this is FAANG advice. Pets are just fine. Most companies arent FAANG and don't need that class of solution.

An R620 plugged into a switch in a colo, a bash script via cron, or a cloudflare worker are just fine for a lot of use cases. The only time it stops being fine is when you can't afford to do your pet -> cattle migration as you scale up. But I don't think this is a common death for companies.

If you call "cattle" a cloudflare worker or lambda function - fine. But when we are talking about multiple redundant servers with load balancing across them, you really need to justify the cost of that vs the value you squeeze out. Sometimes you're squeezing the juice out of the rind.

mr_toad3y ago

Treating servers as disposable is about more than just scale. It helps avoid creating snowflake servers, makes DR more predictable, and makes creating dev environments much easier.

2 more replies

throwawaaarrgh3y ago

Pets are fine in the sense of "there's no way our servers would just disappear, and Larry the DevOps Guy who knows everything will never leave us..."

Cattle is the best approach. Practice it and make it your default.

1 more reply

oofnik3y ago

No, pets are not fine.

Just the other day I had to perform some maintenance on a long-running VM hosting some monitoring software. A backup VM is supposedly always running and ready to handle the workload in case of downtime. The switchover seemed to go fine, at first.

Turns out, someone long ago had manually added a cron job to the primary server without adding it to the backup server, without documenting what it does, what permissions it needs, how it works, or why it's needed. This was only discovered after some manager in a different department complained that he stopped receiving the daily report to his e-mail inbox.

If whoever deployed the report generation script took an extra hour to document what the script did, or even better, added it to VCS as part of the provisioning process for the server and re-deployed the server to ensure that the process works as expected, a day's worth of headache could have been averted.

dijksterhuis3y ago

> But when we are talking about multiple redundant servers with load balancing across them, you really need to justify the cost of that vs the value you squeeze out.

Sometimes you can justify using a thing for the wrong reasons.

I recently attached 1x NLB to each of our Swarm clusters to migrate to automatically managed certificates directly attached to the NLB (Digital Ocean).

$COMPANY has maybe ~3 users accessing each production application at a time. So the NLB itself is utterly pointless.

But Engineering no longer have to fix the certificates each quarter after users see an insecure browser warning and email us about it.

100% worth it for $12 pcm (per swarm cluster).

nuker3y ago

> Pets are just fine.

They are not, if they are being configured by hand, using mouse or cli

2 more replies

voiper13y ago

Some replies are saying this is only for "as-scale/FAANG".

It may only be absolutely necessary there, but it's helpful even for smaller folks.

Over the years, even Debian LTS goes out of support and new features and software should be installed. There's moving systems, doing restores, things breaking and wanting to "reset" to a known working state. Any time you can do something simple with docker or even just (short) step-by-step build scripts, that's a huge win.

I have playbooks for deploying a system, but with npm installs, bower installs, secrets to be hand copied from multiple places, etc, it feels more like pets and it's NOT simple to deploy.

candiddevmike3y ago

Citation needed? There are tradeoffs to both, one is not always better than the other.

hiAndrewQuinn3y ago

There might be an earlier source, but I first ran across the pets versus cattle nomenclature in Tom Limoncelli's _Handbook of System and Network Administration_ - which is a really, really good read for anyone going deep into ops space (like a cloud engineer should be).

paulryanrogers3y ago

What's the advantage of pets? Simplicity?

1 more reply

throwawaaarrgh3y ago· 7 in thread

Truth is an interesting concept. It's often subjective and has many forms. Within the context of the cloud, almost all cloud services are only mutable, so "truth" is whatever the current state of the cloud actually is. Whatever is in Git is merely idealism.

Whatever you are maintaining, read the docs completely first. And I mean cover to cover. Not just the one chapter you need to get a PoC up and running. You will wish you had later, and it will come in handy many times over your career. Consider it an investment in your future.

Read books on microservices before you implement them. Whatever two-line quip you read on a blog will not be as good as reading several whole books from experts.

Docker multi-stage builds won't work in some circumstances. Build optimization eventually gets complex, the more you rely on builds to be "advanced".

crdrost3y ago

Thanks for the alternative microservices quip, it was better than the original. Indeed, I find that “microservices should only perform a single task” is a really dangerous way to phrase it because we have no idea what the article means by “task.” The classic microservices separation is to separate an ordering service from a shipping service, is each of those one “task”? Or at the most extreme, is saving an order distinct from returning the list of your outstanding orders? Even when people graduate to a language of DDD and refine, often they settle on “one microservice per bounded context” where “bounded context” means “separated however I want it separated at the time,” and has no consistent principle behind... This despite the fact that I think Eric was quite explicit in his explanation of the idea, he meant a mapping of the software idea to the fussy complex world of businesspeople and business language: perhaps a better way to phrase this is that it's one microservice per archetype of user, “we have people from the warehouses who all speak the same shipping jargon, we should have a microservice specifically for them which speaks their language,” and I think most developers target their microservices smaller than that, in which case it is definitely not “one microservice per bounded context”.

Don't get me started on how “strong coupling” is shorthand for “coupled in ways I don't like” etc. ... Sometimes I feel like I'm on an episode of “whose line is it anyway?”, where everything is made up and the points don't matter.

LawTalkingGuy3y ago

All guidance assumes it's for a thinking person. You should look at what could be learned, not insist that there be one unambiguous message and that it be earth-shaking.

> I find that “microservices should only perform a single task” is a really dangerous way to phrase it because we have no idea what the article means by “task.”

First off, it's not dangerous. It's just loosely defined.

Second, it's not important to know their task size to still know that two different (to them!) things shouldn't be shoved together.

If the two tasks fit together so well that it's a mere 5% extra code to do them both, then maybe they are two sides of the same coin. But if "one simple thing" is painful to implement maybe it's really two separate things under one blanket.

> often they settle on “one microservice per bounded context” where “bounded context” means “separated however I want it separated at the time,”

Right, because sometimes 'bounded context' is what Legal tells you about data residency and sometimes it's about optimal latency.

> Don't get me started on how “strong coupling” is shorthand for “coupled in ways I don't like”

Strong coupling is generally easy to define, relative to loose coupling, in any given language/platform. The value of engineering around it depends on the value of doing the thing which requires it multiplied by how often you have to do it.

1 more reply

vsareto3y ago

Most of this comes down to what the team/org wants and who has authority to tell you it’s not defined right or tightly coupled.

It definitely is loosely defined and the rules do get stretched to fit opinions though.

rswail3y ago

"Microservices" are just "services". I use the business object/entity as the service boundary. So I don't have an "ordering" vs "shipping" service. I have an Orders service and a Shipments service.

Orders are related to Fulfillments are related to Shipments. They are "coupled" in that an Order will trigger Fulfillment, and Fulfillment will trigger Shipment (there's a Payments service in there somewhere too).

k__3y ago

"read the docs completely first"

I learned from using software like Photoshop and Ableton Live, that you shouldn't underestimate the complexity of any software you use.

Take a few days or weeks, if you can, to read docs or do high quality courses on the topic and it will make your life easier in the long run.

pabs33y ago

The only truth is the memory and disk contents of the devices that make up your cloud. Everything else is an abstraction of that, which discards data and potentially is out of sync with reality.

thewisenerd3y ago

while I wouldn't wish the bootstrap problem on my worst enemy, I think the idealism helps for at least versioning configuration changes, and partial component-level tear-downs and bring ups (you don't need this often, but when you do, you do).

also, with k8s, nothing like deleting the wrong object or making a change and not knowing what it was, N revisions ago.

myfirstproject3y ago· 7 in thread

> Git should be your only source of truth. Discard any local files or changes, what's not pushed into the repository, does not exist.

Completely agree with that.

pondidum3y ago

What about secrets?

I like to do short lived credentials using Vault (e.g. vault can create say db access credentials dynamically), but for things like API keys where I can't do that..? Is the Vault KV store the source of truth?

adra3y ago

At least in cloud providers, they have secret vaults accessible to their customers. The individual secrets are stored in source code but they're encrypted. We've used SOPS as a valuable way to manage these secrets. You can certainly stand up your own secretserver or equiv but may not have all the same.integratuon bells and whistles.

rexarex3y ago

We utilize version control for config/secret management as well…encrypted of course.

Edit: now that I think of it, for generated short lived passwords we also use SSM but for anything set by a human it’s in version control…

lr4444lr3y ago

Config that should be pushed into the env: it's not code or assets.

f4c390123y ago

> should

Completely agree

> only

Fine, but can substitute "git" as appropriate

> discard any local files or changes

Ok for when deployment is completely and always automated, but for that _one special case_ maybe keep a copy of the old that you can revert to until you're _really_ sure of no unwanted effects. In the meantime, find out how & why that local change got made and what can be done to automate it next time

intelVISA3y ago

Git feat. Nix = no more worries. Ever!

Probably.

lofatdairy3y ago

i think notion uses nix in production. I can't remember what their ci/cd pipeline and version control system is outside of that, or if it was even mentioned in that one comment i saw about it

nielsole3y ago· 4 in thread

Another random selection:

* When choosing internal names and identifiers (e.g. DNS) do not include org hierarchy of the team. Chances are the next reorg is coming faster than the lifetime of the identifier and renaming is often hard.

* The industry leading tools will contain bugs. From Linux kernel to deploy tooling, there are bugs everywhere. Part of your job is to identify and work around them until upstream patches make it to you if ever.

* Maintaining a patched fork is usually more expensive than setting up a workaround

* Your hyperscaler cloud provider has plenty of scalability limitations. Some of which are not documented. If you want to do something out of the ordinary make sure to check with your account rep before wasting engineering time.

* Bought SaaS will break production in the middle of the night. Your own team will have the best context and motivation to fix/workaround them. When choosing a vendor, include the visibility into their internal monitoring as a factor for disaster recovery (exported metrics and logs of their control plane for example)

vladvasiliu3y ago

> * Your hyperscaler cloud provider has plenty of scalability limitations. Some of which are not documented. If you want to do something out of the ordinary make sure to check with your account rep before wasting engineering time.

If only they'd tell you. We had this exact issue on AWS. Seemingly random packet drops. Metrics on both clients and servers were ok, latency specifically was very low when it worked.

Call up support "yeah, you're running into our connection limit". "Oh. What's that limit?" "yeah, I can't tell you that". His solution was that, since this was somehow related to connection tracking in the security group, I could set this to allow all/all, and set up filtering at the NACL level. Turns out I could do it for this particular issue.

This was before there was a possibility to monitor this [0]. Called up our customer manager. "Let me check". A few days later, "yeah, that's not something we divulge".

---

[0] For those who don't know, it's now possible to keep an eye on refused connections (at least on Linux). https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitori... -> conntrack_allowance_exceeded

toast03y ago

Ran into that one too, but my service rep didn't mention the possibility of configuring connectionless firewall rules. I'm still bitter many years later.

1 more reply

anilakar3y ago

> When choosing internal names and identifiers (e.g. DNS) do not include org hierarchy of the team.

Naming in general is hard. If you name stuff based on location, use an identifier that won't change, like provider datacenter names, street addresses or customer building codes, not the current tenant or purpose of use.

For products, come up with an internal product/project name and stick to it in everything that is not immediately visible to the customer. At one point you could see the current and three previous names of our product if you popped out an iframe and opened the inspector (logo with name, page title, URL and prefixed log messages).

> Maintaining a patched fork is usually more expensive than setting up a workaround

When your bosses demand additional features a single customer requested, you absolutely have to make them understand that the functionality must be added to the main product.

throwawaaarrgh3y ago

My naming convention is like this:

- anything my company doesn't create or own, is just called whatever it is

- logical components that aren't specific to the org chart can be whatever you want

- anything org chart specific gets a randomly generated code name. an internal website allows you to register and look up any code name across the company. this allows you to find who the hell owns server "ws-prod" in a 6 year old account that nobody seems to maintain. instead you find "peanutcar-ws-prod" and can then look up who registered "peanutcar". (this also prevents the bu-org-group-product-subproduct-env mess that eventually runs over character limits)

- doesn't matter if the rest of the company doesn't do it, I do it for what I manage. later on if it gets adopted, fine, but if not, at least I won't ever have to rename my crap.

WolfOliver3y ago· 4 in thread

"Microservices should only perform a single task." -> I guess this advice is the reason there are so widely misunderstood, see: https://linkedrecords.com/challenging-the-single-responsibil...

adamisom3y ago

Wow and I thought functions should only perform a single task. I need to keep up with the times! Apparently you need an entire deployable app and API to do anything these days. I guess it makes sense. How else could we justify so many software engineers!?

elric3y ago

So many? Last I checked there was a huge shortage, and with the exception of a couple of notable bloatware companies, most seem to be understaffed?

3 more replies

_vertigo3y ago

I think this advice really depends on your scaling needs. If you need to scale your services up, it’s a lot easier to do that if each service only does one thing.

It also depends on how much functionality you consider to be “one thing”.

dagss3y ago

I never understood when people talk about microservices and scaling (for traffic).

I thought microservices is a solution to scale development teams, not for traffic.

If you have a horizontally scalable monolith, it can scale pretty much as far as you want. If you split services along functional boundaries (i.e., vertical) then a split from 1 to 2 services will in the extreme best case scenario give you 2x scaleup; further splits give you less. So: If load is the issue, work on horizontal scaling, not microservices.

What am I missing?

2 more replies

elric3y ago· 4 in thread

> Certify yourself with official courses.

Can anyone recommend some certifications that are worthwhile? I realize that this is a very broad ask, but the advise is also rather broad.

hiAndrewQuinn3y ago

_one word answer: AWS_

There's a lot of conceptual carryover between cloud platform offerings, so getting _any_ of the big 3 (GCP, AWS, Az) is likely to help you out a ton of you're new to the space. Much like how your first programming language took much longer than your second through fifth ones, learning your first platform well enough to get employed is much more challenging than filing the serials number off and learning the new quirks of the other two.

In the absence of further information as to your career goals, I'd lightly recommend AWS. It came into existence years before the others and can offer SLAs that approach "this S3 bucket will outlive you in the event of thermonuclear war".

Azure is where I have lived so far in my career and it seems to be catering more towards enterprise and government needs. I actually imagine finding an Azure shop is harder than an AWS shop if you haven't already worked at one before, but it's a pretty sweet gig otherwise.

GCP goes the other direction from what I've seen - much more startup-oriented, as the newest kid on the block itself. It looked nice from the last time I played around with it.

Kubernetes exists as a useful "stage 2" if you want to go further down the pipeline, as a technology whose business raison d'etre is to commoditize cloud providers. _In theory_ a Kubernetes cluster can be engineered to run unaltered on any of the big 3, since they all offer k8s clusters.

It's also totally cool to say that's okay thanks, I'll stick with simple architectures and a focus on getting MVPs out the door rapidly. For me $DAYJOB is spent between Azure and k8s, but my side projects start the same way every time - SQLite, Django, and _maybe_ Docker Compose to sidecar Litestream if I'm feeling extra infra-inclined that day. Really there's no reason to get dogmatic about anything in a space with so many options.

eikenberry3y ago

Just about any Certificate is worthwhile depending on your reasons. Best case I've seen them used for is to help you break into new technology areas, EG. you want to work as an SRE for AWS services, having a few AWS Certificates under your belt might be just enough to get you that interview (plus you'd kill at AWS trivia nights).

intelVISA3y ago

Love AWS trivia. Most effective way to harm your employer's wallet?

EKS?? EKS? EC2? EBS..? ELB..? Ah no way it was /data egress/ of course.

1 more reply

oneepic3y ago

I'd absolutely go to an AWS trivia night/lunch hour at work. Maybe GCP? Azure?

birdymcbird3y ago· 2 in thread

> A good monitoring system, well-organized repository, fault-tolerance workloads and automation mechanisms are the basis of any architecture.

Monitoring/alarming, and knowing what to monitor. Also, properly instrument your services or whatever it is you have. Take time to reflect on what are the signals that tell you operational health. An error metric alone is useless if you don’t know the denominator. Also be careful to avoid adding noisy metrics that cause panic for no reason.

I’m not sure what fault tolerance means in this context. Very handwavy statement. I think if you have dependencies, have a plan and understanding of which ones tipping over will bring down your service or how you can build resiliency. For example, some feature on your page requires talking to a recommendations service. If the service goes down, can you call back to a generic list of hard coded recommendations or some static asset?

As for automation: yeah, have test workflows built into your CI/CD harness. And avoid manual steps there requiring human intervention. Use canaries to test certain functions are up and running as expected, etc

lockedinspaceOP3y ago

Maybe I was a bit vague in the fault tolerance statement. What I mean is to have a high availability in your services, e.g: using AWS ASG for your servers, having more than one replica for your Kubernetes pods.

If one of the servers/pods fail, the process behind detects them as "unhealthy" (having a nice monitoring/alarming as you mentioned) and replaces them with a new server with the same software characteristics so, for the end-user, SO, your client. Nothing has changed, the load just moved to a single instance for about 5-10 mins until a new server was deployed.

birdymcbird3y ago

Cool makes sense

raydiatian3y ago· 2 in thread

> If you need to build an architecture which involves microservices, I am sure that your cloud provider has a solution that fits better than Kubernetes. E.g: ECS for AWS. Kubernetes is a fantastic toolkit, but only shines when all that it has to offer, gets used.

As far as FAAS goes, I think more people need to go check out Cloud Run as a Knative implementation. Having used it for sometime now it feels like a near-perfect FAAS solution. The only gripe I have is that versioning is a bit dopey. But hey, if I can have autoscaling services with absolute impunity over how my HTTP interface is shaped (looking at you AWS lambda) and without needing to worry about Kubernetes headaches, I’m perfectly happy to embed version names in service domains.

gizzlon3y ago

Agreed,but would call Cloud Run a Papas not a FaaS

raydiatian3y ago

Papas=PaaS, or? If Papas I am unfamiliar with the term.

If PaaS, isn’t Gcloud itself the PaaS? For instance cloud run, the product inside Gcloud, is ephemeral and stateless, which wouldn’t be at all good for trying to make a DB.

1 more reply

abledon3y ago· 1 in thread

> If you need to build an architecture which involves microservices, I am sure that your cloud provider has a solution that fits better than Kubernetes. E.g: ECS for AWS.

Thank you! So many people running unnecessary things on Kubernetes

rswail3y ago

On the other hand, K8S provides you with orchestration abstraction across AWS, GCP, Azure, VMWare, bare metal.

There are distinct advantages to that in terms of both development (running a local K8S cluster is relatively easy) and deployment.

ECS has no distinct advantages over K8S (or EKS in AWS land). Particularly now that there are CRDs for K8S that allow you to deploy AWS functionality (eg ALBs, TGs) from K8S.

TrackerFF3y ago

"Learn to say: I do not know about this/that. You cannot know everything that gets presented to you. The bad habit comes when the same technological asset appears for a second time and you still do not know how it works or what it does."

Absolutely. I've seen so many junior engineers / devs go on about it like this:

Someone higher up: Could you please look at this problem? I need it fixed ASAP.

Jr. Engineer, presented with a problem he's never seen before: No problem, I will look into it!

Someone higher up (the next day): Did you fix the problem?

Jr. Engineer: Sorry, I haven't still gotten around to look at it / I'm still working on it / etc.

Someone higher up: We really need it fixed today, please prioritize it and give me a call when it is fixed.

Jr. Engineer works on the problem all night, feeling stressed out, not wanting to let down his seniors.

zikduruqe3y ago

EVERYTHING costs money. Tag every resource. Come up with ways to show cost avoidance and cost savings. This is will be appreciated more by management than any code you can bang out.

rr8083y ago

I love monitoring but after a few decades working I still haven't found a good way to monitor everything. Still a mix of email, pagerduty, prometheus, cloudwatch, websites, kibana consoles. Surely there is a good way to do this? I figure some of the new BI dashboards would be good but haven't seen much usage.

nijave3y ago

>Before jumping straight into a new technology, read and understand their docs

The number of issues I've seen that turn out to be documented features... (or, more accurately, things just being configured incorrectly)

virgilp3y ago

> Microservices should only perform a single task. If you are not able to achieve that isolation, maybe you should switch back to a monolithic architecture. Do not get fooled by the current trends, microservices are not meant for everything.

I feel like this is spectacularly bad advice. "Do not get fooled by shades of grey, things are meant to be either black or white!"

mustafabisic13y ago

Some solid career advice in there as well.

I feel like this could used as one of those "How to 10x career" articles - and be better than all of them.

bobismyuncle3y ago

Some of these are lessons you only really learn once you make the mistake yourself

lockedinspaceOP3y ago

A helpful list of things to have in mind when working with anything tech related.

raxits3y ago

One more

Have a good logging & rollback strategy well communicated across stakeholders

martynvandijke3y ago

Nice guide, just curious are there more of these guides ?

qaq3y ago

Don't just read docs try things -- make a POC. The amount of time we hit something that "should work" according to the docs but doesn't is very high.

j / k navigate · click thread line to collapse

147 comments

83 comments · 21 top-level

pondidum3y ago· 21 in thread

> Do not make production changes on Fridays

I ~hate~ dislike this advice. If you can't deploy on a Friday, you need to fix your deployment strategy. By removing Friday from when you can deploy, you're wasting 1/5 of your available days.

Note: deploy != Release[1]. Use flags, canaries etc.

[1]: https://andydote.co.uk/2022/11/02/deploy-doesnt-mean-release...

Edit: hate is far too stronger word for this

Sevii3y ago

If you can't afford to give up 1/5 of your available deployment days you have a problem somewhere in your CI/CD system.

nijave3y ago

Sure but ideally you have high enough confidence in your software that those types of issues are highly unlikely.

1 more reply

kevan3y ago

pondidum3y ago

I understand that, but would counter with that it sounds like deploy == release, and that if that weren't the case, you could deploy more often.

However, I will admit it is a trade-off; some engineering time does have to be spent to get there, and perhaps that engineering time is better used elsewhere right now.

2 more replies

grogenaut3y ago

rexarex3y ago

I get that people really want to flex that they can deploy on Friday afternoon and NOTHING CAN GO WRONG, but it’s still foolish and flaunts Murphy’s Law. It can wait.

nikau3y ago

Plus they are likely running simple Mickey mouse systems that aren't intertwined with a bunch of other systems maintained by other groups.

lockedinspaceOP3y ago

Yep, let's not forget that Murphy's Law, whenever you least expect it, boom.

dopylitty3y ago

This one made me laugh. I've been places that only allow deployments on Fridays because it gives the whole weekend to fix things if they break.

spc4763y ago

> or worse only once a month

I do have to ask---where do you people work where you have multiple deployments per week (or even per day)? To me, that sounds insane!

[1] Still is, even though I left a few months ago, and not because of the lack of deployments, but the shoving of Enterprise Agile [2] on the company by new management.

[2] Which is anything but Agile.

1 more reply

bityard3y ago

The upshot is that this is fairly rare and we do not have an on-call rotation. If most anything breaks over the weekend, nobody is going to notice or care until Monday morning rolls around.

fragmede3y ago

tbrownaw3y ago

If there's a very strong "only during standard work hours" usage pattern with SLA penalties for downtime, adapting deployment patterns to that reality can maybe be sensible.

rexarex3y ago

I love this idea.

doctor_eval3y ago

You should have both the confidence that you could deploy on Fridays, and the wisdom to know that you shouldn’t.

lopatin3y ago

Interestingly my company only deploys on Friday because it has to wait for (most) markets to close for the weekend.

pondidum3y ago

Oh now that is an interesting take!

Would that still be required if a deployment and a release are decoupled entirely, or is it unavoidable? Genuinely interested!

1 more reply

elric3y ago

pondidum3y ago

You're right, hate is far too stronger term for this, I've updated the wording. Thanks.

charcircuit3y ago

There isn't a bottleneck in the amount of commits you can ship per day. You just get more changes to roll out on Monday.

I do disagree with the absolute statement of not doing it, but I definitely do a risk analysis whenever I ship a change on Friday and avoid anything risky and just push it off until Monday morning.

Not all changes can be put behind flags and canaries don't really fix the issue (unless you are okay in blocking important fixes from being rolled out due to your bad change killing the canary)

lockedinspaceOP3y ago

If you are a small company, or you do not do extra weekend shifts, I understand your point. Elsewhere, you just want to live an adventure every Friday.

kator3y ago· 10 in thread

r3trohack3r3y ago

As an ex-FAANG engineer, this is FAANG advice. Pets are just fine. Most companies arent FAANG and don't need that class of solution.

mr_toad3y ago

Treating servers as disposable is about more than just scale. It helps avoid creating snowflake servers, makes DR more predictable, and makes creating dev environments much easier.

2 more replies

throwawaaarrgh3y ago

Pets are fine in the sense of "there's no way our servers would just disappear, and Larry the DevOps Guy who knows everything will never leave us..."

Cattle is the best approach. Practice it and make it your default.

1 more reply

oofnik3y ago

No, pets are not fine.

dijksterhuis3y ago

> But when we are talking about multiple redundant servers with load balancing across them, you really need to justify the cost of that vs the value you squeeze out.

Sometimes you can justify using a thing for the wrong reasons.

I recently attached 1x NLB to each of our Swarm clusters to migrate to automatically managed certificates directly attached to the NLB (Digital Ocean).

$COMPANY has maybe ~3 users accessing each production application at a time. So the NLB itself is utterly pointless.

But Engineering no longer have to fix the certificates each quarter after users see an insecure browser warning and email us about it.

100% worth it for $12 pcm (per swarm cluster).

nuker3y ago

> Pets are just fine.

They are not, if they are being configured by hand, using mouse or cli

2 more replies

voiper13y ago

Some replies are saying this is only for "as-scale/FAANG".

It may only be absolutely necessary there, but it's helpful even for smaller folks.

I have playbooks for deploying a system, but with npm installs, bower installs, secrets to be hand copied from multiple places, etc, it feels more like pets and it's NOT simple to deploy.

candiddevmike3y ago

Citation needed? There are tradeoffs to both, one is not always better than the other.

hiAndrewQuinn3y ago

paulryanrogers3y ago

What's the advantage of pets? Simplicity?

1 more reply

throwawaaarrgh3y ago· 7 in thread

Read books on microservices before you implement them. Whatever two-line quip you read on a blog will not be as good as reading several whole books from experts.

Docker multi-stage builds won't work in some circumstances. Build optimization eventually gets complex, the more you rely on builds to be "advanced".

crdrost3y ago

LawTalkingGuy3y ago

All guidance assumes it's for a thinking person. You should look at what could be learned, not insist that there be one unambiguous message and that it be earth-shaking.

> I find that “microservices should only perform a single task” is a really dangerous way to phrase it because we have no idea what the article means by “task.”

First off, it's not dangerous. It's just loosely defined.

Second, it's not important to know their task size to still know that two different (to them!) things shouldn't be shoved together.

> often they settle on “one microservice per bounded context” where “bounded context” means “separated however I want it separated at the time,”

Right, because sometimes 'bounded context' is what Legal tells you about data residency and sometimes it's about optimal latency.

> Don't get me started on how “strong coupling” is shorthand for “coupled in ways I don't like”

1 more reply

vsareto3y ago

Most of this comes down to what the team/org wants and who has authority to tell you it’s not defined right or tightly coupled.

It definitely is loosely defined and the rules do get stretched to fit opinions though.

rswail3y ago

"Microservices" are just "services". I use the business object/entity as the service boundary. So I don't have an "ordering" vs "shipping" service. I have an Orders service and a Shipments service.

k__3y ago

"read the docs completely first"

I learned from using software like Photoshop and Ableton Live, that you shouldn't underestimate the complexity of any software you use.

Take a few days or weeks, if you can, to read docs or do high quality courses on the topic and it will make your life easier in the long run.

pabs33y ago

The only truth is the memory and disk contents of the devices that make up your cloud. Everything else is an abstraction of that, which discards data and potentially is out of sync with reality.

thewisenerd3y ago

also, with k8s, nothing like deleting the wrong object or making a change and not knowing what it was, N revisions ago.

myfirstproject3y ago· 7 in thread

> Git should be your only source of truth. Discard any local files or changes, what's not pushed into the repository, does not exist.

Completely agree with that.

pondidum3y ago

What about secrets?

adra3y ago

rexarex3y ago

We utilize version control for config/secret management as well…encrypted of course.

Edit: now that I think of it, for generated short lived passwords we also use SSM but for anything set by a human it’s in version control…

lr4444lr3y ago

Config that should be pushed into the env: it's not code or assets.

f4c390123y ago

> should

Completely agree

> only

Fine, but can substitute "git" as appropriate

> discard any local files or changes

intelVISA3y ago

Git feat. Nix = no more worries. Ever!

Probably.

lofatdairy3y ago

i think notion uses nix in production. I can't remember what their ci/cd pipeline and version control system is outside of that, or if it was even mentioned in that one comment i saw about it

nielsole3y ago· 4 in thread

Another random selection:

* Maintaining a patched fork is usually more expensive than setting up a workaround

vladvasiliu3y ago

If only they'd tell you. We had this exact issue on AWS. Seemingly random packet drops. Metrics on both clients and servers were ok, latency specifically was very low when it worked.

This was before there was a possibility to monitor this [0]. Called up our customer manager. "Let me check". A few days later, "yeah, that's not something we divulge".

---

toast03y ago

Ran into that one too, but my service rep didn't mention the possibility of configuring connectionless firewall rules. I'm still bitter many years later.

1 more reply

anilakar3y ago

> When choosing internal names and identifiers (e.g. DNS) do not include org hierarchy of the team.

> Maintaining a patched fork is usually more expensive than setting up a workaround

When your bosses demand additional features a single customer requested, you absolutely have to make them understand that the functionality must be added to the main product.

throwawaaarrgh3y ago

My naming convention is like this:

- anything my company doesn't create or own, is just called whatever it is

- logical components that aren't specific to the org chart can be whatever you want

- doesn't matter if the rest of the company doesn't do it, I do it for what I manage. later on if it gets adopted, fine, but if not, at least I won't ever have to rename my crap.

WolfOliver3y ago· 4 in thread

"Microservices should only perform a single task." -> I guess this advice is the reason there are so widely misunderstood, see: https://linkedrecords.com/challenging-the-single-responsibil...

adamisom3y ago

elric3y ago

So many? Last I checked there was a huge shortage, and with the exception of a couple of notable bloatware companies, most seem to be understaffed?

3 more replies

_vertigo3y ago

I think this advice really depends on your scaling needs. If you need to scale your services up, it’s a lot easier to do that if each service only does one thing.

It also depends on how much functionality you consider to be “one thing”.

dagss3y ago

I never understood when people talk about microservices and scaling (for traffic).

I thought microservices is a solution to scale development teams, not for traffic.

What am I missing?

2 more replies

elric3y ago· 4 in thread

> Certify yourself with official courses.

Can anyone recommend some certifications that are worthwhile? I realize that this is a very broad ask, but the advise is also rather broad.

hiAndrewQuinn3y ago

_one word answer: AWS_

GCP goes the other direction from what I've seen - much more startup-oriented, as the newest kid on the block itself. It looked nice from the last time I played around with it.

eikenberry3y ago

intelVISA3y ago

Love AWS trivia. Most effective way to harm your employer's wallet?

EKS?? EKS? EC2? EBS..? ELB..? Ah no way it was /data egress/ of course.

1 more reply

oneepic3y ago

I'd absolutely go to an AWS trivia night/lunch hour at work. Maybe GCP? Azure?

birdymcbird3y ago· 2 in thread

> A good monitoring system, well-organized repository, fault-tolerance workloads and automation mechanisms are the basis of any architecture.

lockedinspaceOP3y ago

birdymcbird3y ago

Cool makes sense

raydiatian3y ago· 2 in thread

gizzlon3y ago

Agreed,but would call Cloud Run a Papas not a FaaS

raydiatian3y ago

Papas=PaaS, or? If Papas I am unfamiliar with the term.

If PaaS, isn’t Gcloud itself the PaaS? For instance cloud run, the product inside Gcloud, is ephemeral and stateless, which wouldn’t be at all good for trying to make a DB.

1 more reply

abledon3y ago· 1 in thread

> If you need to build an architecture which involves microservices, I am sure that your cloud provider has a solution that fits better than Kubernetes. E.g: ECS for AWS.

Thank you! So many people running unnecessary things on Kubernetes

rswail3y ago

On the other hand, K8S provides you with orchestration abstraction across AWS, GCP, Azure, VMWare, bare metal.

There are distinct advantages to that in terms of both development (running a local K8S cluster is relatively easy) and deployment.

ECS has no distinct advantages over K8S (or EKS in AWS land). Particularly now that there are CRDs for K8S that allow you to deploy AWS functionality (eg ALBs, TGs) from K8S.

TrackerFF3y ago

Absolutely. I've seen so many junior engineers / devs go on about it like this:

Someone higher up: Could you please look at this problem? I need it fixed ASAP.

Jr. Engineer, presented with a problem he's never seen before: No problem, I will look into it!

Someone higher up (the next day): Did you fix the problem?

Jr. Engineer: Sorry, I haven't still gotten around to look at it / I'm still working on it / etc.

Someone higher up: We really need it fixed today, please prioritize it and give me a call when it is fixed.

Jr. Engineer works on the problem all night, feeling stressed out, not wanting to let down his seniors.

zikduruqe3y ago

EVERYTHING costs money. Tag every resource. Come up with ways to show cost avoidance and cost savings. This is will be appreciated more by management than any code you can bang out.

rr8083y ago

nijave3y ago

>Before jumping straight into a new technology, read and understand their docs

The number of issues I've seen that turn out to be documented features... (or, more accurately, things just being configured incorrectly)

virgilp3y ago

I feel like this is spectacularly bad advice. "Do not get fooled by shades of grey, things are meant to be either black or white!"

mustafabisic13y ago

Some solid career advice in there as well.

I feel like this could used as one of those "How to 10x career" articles - and be better than all of them.

bobismyuncle3y ago

Some of these are lessons you only really learn once you make the mistake yourself

lockedinspaceOP3y ago

A helpful list of things to have in mind when working with anything tech related.

raxits3y ago

One more

Have a good logging & rollback strategy well communicated across stakeholders

martynvandijke3y ago

Nice guide, just curious are there more of these guides ?

qaq3y ago

Don't just read docs try things -- make a POC. The amount of time we hit something that "should work" according to the docs but doesn't is very high.

j / k navigate · click thread line to collapse