I ~hate~ dislike this advice. If you can't deploy on a Friday, you need to fix your deployment strategy. By removing Friday from when you can deploy, you're wasting 1/5 of your available days.
Note: deploy != Release[1]. Use flags, canaries etc.
[1]: https://andydote.co.uk/2022/11/02/deploy-doesnt-mean-release...
Edit: hate is far too stronger word for this
If you can't afford to give up 1/5 of your available deployment days you have a problem somewhere in your CI/CD system.
However, I will admit it is a trade-off; some engineering time does have to be spent to get there, and perhaps that engineering time is better used elsewhere right now.
The reason we have core hours release only without director approval (aka director approval required outside core hours) is so you don't piss off another team by paging them after hours, and so you aren't trying to shove out a thing on a system that doesn't have good coverage or by turning off the safeties. In a large company I've noticed many engineers assume urgency even where there isn't. As an approver myself, most of the time someone wants to rush is because they've not even had the convo with their manager on if it's worth the risk, they are assuming urgency because that's when the sprint ends or what some TPM added to a jira ticket 4 months ago.
I admit that sounds risky itself (the engineers not having the right risk training) but this is why we have a policy and tooling... most of the times I've dug in they're just very new and worried about perception as a new employee, so my job is to shepherd them through having that convo with their managers which inevitably has the managers saying "yes it can totally wait till monday", and the change is inevitibly a bit more hot than it should be due to accidental deadline pressure.
It's a good interview question as a candidate. If you ask the interviewer when they deploy and they say only Friday (or worse only once a month) then perhaps look elsewhere for your own sanity because it's a sign of serious malfunction either organizationally, technically, or both.
Don't discount a job because of one deployment per month---it really depends upon the service. I joke that a busy year for me involved four deployments to production, but "production" for me wasn't a website, or even a web-based service, but a service involved in the call path for phone calls. Our customer is [1] the Oligarchic Cell Phone company and the SLAs are pretty severe.
I do have to ask---where do you people work where you have multiple deployments per week (or even per day)? To me, that sounds insane!
[1] Still is, even though I left a few months ago, and not because of the lack of deployments, but the shoving of Enterprise Agile [2] on the company by new management.
[2] Which is anything but Agile.
The upshot is that this is fairly rare and we do not have an on-call rotation. If most anything breaks over the weekend, nobody is going to notice or care until Monday morning rolls around.
Would that still be required if a deployment and a release are decoupled entirely, or is it unavoidable? Genuinely interested!
I do disagree with the absolute statement of not doing it, but I definitely do a risk analysis whenever I ship a change on Friday and avoid anything risky and just push it off until Monday morning.
Not all changes can be put behind flags and canaries don't really fix the issue (unless you are okay in blocking important fixes from being rolled out due to your bad change killing the canary)
If you are a small company, or you do not do extra weekend shifts, I understand your point. Elsewhere, you just want to live an adventure every Friday.
An R620 plugged into a switch in a colo, a bash script via cron, or a cloudflare worker are just fine for a lot of use cases. The only time it stops being fine is when you can't afford to do your pet -> cattle migration as you scale up. But I don't think this is a common death for companies.
If you call "cattle" a cloudflare worker or lambda function - fine. But when we are talking about multiple redundant servers with load balancing across them, you really need to justify the cost of that vs the value you squeeze out. Sometimes you're squeezing the juice out of the rind.
Cattle is the best approach. Practice it and make it your default.
Just the other day I had to perform some maintenance on a long-running VM hosting some monitoring software. A backup VM is supposedly always running and ready to handle the workload in case of downtime. The switchover seemed to go fine, at first.
Turns out, someone long ago had manually added a cron job to the primary server without adding it to the backup server, without documenting what it does, what permissions it needs, how it works, or why it's needed. This was only discovered after some manager in a different department complained that he stopped receiving the daily report to his e-mail inbox.
If whoever deployed the report generation script took an extra hour to document what the script did, or even better, added it to VCS as part of the provisioning process for the server and re-deployed the server to ensure that the process works as expected, a day's worth of headache could have been averted.
Sometimes you can justify using a thing for the wrong reasons.
I recently attached 1x NLB to each of our Swarm clusters to migrate to automatically managed certificates directly attached to the NLB (Digital Ocean).
$COMPANY has maybe ~3 users accessing each production application at a time. So the NLB itself is utterly pointless.
But Engineering no longer have to fix the certificates each quarter after users see an insecure browser warning and email us about it.
100% worth it for $12 pcm (per swarm cluster).
They are not, if they are being configured by hand, using mouse or cli
It may only be absolutely necessary there, but it's helpful even for smaller folks.
Over the years, even Debian LTS goes out of support and new features and software should be installed. There's moving systems, doing restores, things breaking and wanting to "reset" to a known working state. Any time you can do something simple with docker or even just (short) step-by-step build scripts, that's a huge win.
I have playbooks for deploying a system, but with npm installs, bower installs, secrets to be hand copied from multiple places, etc, it feels more like pets and it's NOT simple to deploy.
Whatever you are maintaining, read the docs completely first. And I mean cover to cover. Not just the one chapter you need to get a PoC up and running. You will wish you had later, and it will come in handy many times over your career. Consider it an investment in your future.
Read books on microservices before you implement them. Whatever two-line quip you read on a blog will not be as good as reading several whole books from experts.
Docker multi-stage builds won't work in some circumstances. Build optimization eventually gets complex, the more you rely on builds to be "advanced".
Don't get me started on how “strong coupling” is shorthand for “coupled in ways I don't like” etc. ... Sometimes I feel like I'm on an episode of “whose line is it anyway?”, where everything is made up and the points don't matter.
> I find that “microservices should only perform a single task” is a really dangerous way to phrase it because we have no idea what the article means by “task.”
First off, it's not dangerous. It's just loosely defined.
Second, it's not important to know their task size to still know that two different (to them!) things shouldn't be shoved together.
If the two tasks fit together so well that it's a mere 5% extra code to do them both, then maybe they are two sides of the same coin. But if "one simple thing" is painful to implement maybe it's really two separate things under one blanket.
> often they settle on “one microservice per bounded context” where “bounded context” means “separated however I want it separated at the time,”
Right, because sometimes 'bounded context' is what Legal tells you about data residency and sometimes it's about optimal latency.
> Don't get me started on how “strong coupling” is shorthand for “coupled in ways I don't like”
Strong coupling is generally easy to define, relative to loose coupling, in any given language/platform. The value of engineering around it depends on the value of doing the thing which requires it multiplied by how often you have to do it.
It definitely is loosely defined and the rules do get stretched to fit opinions though.
Orders are related to Fulfillments are related to Shipments. They are "coupled" in that an Order will trigger Fulfillment, and Fulfillment will trigger Shipment (there's a Payments service in there somewhere too).
I learned from using software like Photoshop and Ableton Live, that you shouldn't underestimate the complexity of any software you use.
Take a few days or weeks, if you can, to read docs or do high quality courses on the topic and it will make your life easier in the long run.
also, with k8s, nothing like deleting the wrong object or making a change and not knowing what it was, N revisions ago.
Completely agree with that.
I like to do short lived credentials using Vault (e.g. vault can create say db access credentials dynamically), but for things like API keys where I can't do that..? Is the Vault KV store the source of truth?
Edit: now that I think of it, for generated short lived passwords we also use SSM but for anything set by a human it’s in version control…
Completely agree
> only
Fine, but can substitute "git" as appropriate
> discard any local files or changes
Ok for when deployment is completely and always automated, but for that _one special case_ maybe keep a copy of the old that you can revert to until you're _really_ sure of no unwanted effects. In the meantime, find out how & why that local change got made and what can be done to automate it next time
Probably.
* When choosing internal names and identifiers (e.g. DNS) do not include org hierarchy of the team. Chances are the next reorg is coming faster than the lifetime of the identifier and renaming is often hard.
* The industry leading tools will contain bugs. From Linux kernel to deploy tooling, there are bugs everywhere. Part of your job is to identify and work around them until upstream patches make it to you if ever.
* Maintaining a patched fork is usually more expensive than setting up a workaround
* Your hyperscaler cloud provider has plenty of scalability limitations. Some of which are not documented. If you want to do something out of the ordinary make sure to check with your account rep before wasting engineering time.
* Bought SaaS will break production in the middle of the night. Your own team will have the best context and motivation to fix/workaround them. When choosing a vendor, include the visibility into their internal monitoring as a factor for disaster recovery (exported metrics and logs of their control plane for example)
If only they'd tell you. We had this exact issue on AWS. Seemingly random packet drops. Metrics on both clients and servers were ok, latency specifically was very low when it worked.
Call up support "yeah, you're running into our connection limit". "Oh. What's that limit?" "yeah, I can't tell you that". His solution was that, since this was somehow related to connection tracking in the security group, I could set this to allow all/all, and set up filtering at the NACL level. Turns out I could do it for this particular issue.
This was before there was a possibility to monitor this [0]. Called up our customer manager. "Let me check". A few days later, "yeah, that's not something we divulge".
---
[0] For those who don't know, it's now possible to keep an eye on refused connections (at least on Linux). https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitori... -> conntrack_allowance_exceeded
Naming in general is hard. If you name stuff based on location, use an identifier that won't change, like provider datacenter names, street addresses or customer building codes, not the current tenant or purpose of use.
For products, come up with an internal product/project name and stick to it in everything that is not immediately visible to the customer. At one point you could see the current and three previous names of our product if you popped out an iframe and opened the inspector (logo with name, page title, URL and prefixed log messages).
> Maintaining a patched fork is usually more expensive than setting up a workaround
When your bosses demand additional features a single customer requested, you absolutely have to make them understand that the functionality must be added to the main product.
- anything my company doesn't create or own, is just called whatever it is
- logical components that aren't specific to the org chart can be whatever you want
- anything org chart specific gets a randomly generated code name. an internal website allows you to register and look up any code name across the company. this allows you to find who the hell owns server "ws-prod" in a 6 year old account that nobody seems to maintain. instead you find "peanutcar-ws-prod" and can then look up who registered "peanutcar". (this also prevents the bu-org-group-product-subproduct-env mess that eventually runs over character limits)
- doesn't matter if the rest of the company doesn't do it, I do it for what I manage. later on if it gets adopted, fine, but if not, at least I won't ever have to rename my crap.
It also depends on how much functionality you consider to be “one thing”.
I thought microservices is a solution to scale development teams, not for traffic.
If you have a horizontally scalable monolith, it can scale pretty much as far as you want. If you split services along functional boundaries (i.e., vertical) then a split from 1 to 2 services will in the extreme best case scenario give you 2x scaleup; further splits give you less. So: If load is the issue, work on horizontal scaling, not microservices.
What am I missing?
Can anyone recommend some certifications that are worthwhile? I realize that this is a very broad ask, but the advise is also rather broad.
There's a lot of conceptual carryover between cloud platform offerings, so getting _any_ of the big 3 (GCP, AWS, Az) is likely to help you out a ton of you're new to the space. Much like how your first programming language took much longer than your second through fifth ones, learning your first platform well enough to get employed is much more challenging than filing the serials number off and learning the new quirks of the other two.
In the absence of further information as to your career goals, I'd lightly recommend AWS. It came into existence years before the others and can offer SLAs that approach "this S3 bucket will outlive you in the event of thermonuclear war".
Azure is where I have lived so far in my career and it seems to be catering more towards enterprise and government needs. I actually imagine finding an Azure shop is harder than an AWS shop if you haven't already worked at one before, but it's a pretty sweet gig otherwise.
GCP goes the other direction from what I've seen - much more startup-oriented, as the newest kid on the block itself. It looked nice from the last time I played around with it.
Kubernetes exists as a useful "stage 2" if you want to go further down the pipeline, as a technology whose business raison d'etre is to commoditize cloud providers. _In theory_ a Kubernetes cluster can be engineered to run unaltered on any of the big 3, since they all offer k8s clusters.
It's also totally cool to say that's okay thanks, I'll stick with simple architectures and a focus on getting MVPs out the door rapidly. For me $DAYJOB is spent between Azure and k8s, but my side projects start the same way every time - SQLite, Django, and _maybe_ Docker Compose to sidecar Litestream if I'm feeling extra infra-inclined that day. Really there's no reason to get dogmatic about anything in a space with so many options.
Monitoring/alarming, and knowing what to monitor. Also, properly instrument your services or whatever it is you have. Take time to reflect on what are the signals that tell you operational health. An error metric alone is useless if you don’t know the denominator. Also be careful to avoid adding noisy metrics that cause panic for no reason.
I’m not sure what fault tolerance means in this context. Very handwavy statement. I think if you have dependencies, have a plan and understanding of which ones tipping over will bring down your service or how you can build resiliency. For example, some feature on your page requires talking to a recommendations service. If the service goes down, can you call back to a generic list of hard coded recommendations or some static asset?
As for automation: yeah, have test workflows built into your CI/CD harness. And avoid manual steps there requiring human intervention. Use canaries to test certain functions are up and running as expected, etc
If one of the servers/pods fail, the process behind detects them as "unhealthy" (having a nice monitoring/alarming as you mentioned) and replaces them with a new server with the same software characteristics so, for the end-user, SO, your client. Nothing has changed, the load just moved to a single instance for about 5-10 mins until a new server was deployed.
As far as FAAS goes, I think more people need to go check out Cloud Run as a Knative implementation. Having used it for sometime now it feels like a near-perfect FAAS solution. The only gripe I have is that versioning is a bit dopey. But hey, if I can have autoscaling services with absolute impunity over how my HTTP interface is shaped (looking at you AWS lambda) and without needing to worry about Kubernetes headaches, I’m perfectly happy to embed version names in service domains.
If PaaS, isn’t Gcloud itself the PaaS? For instance cloud run, the product inside Gcloud, is ephemeral and stateless, which wouldn’t be at all good for trying to make a DB.
Thank you! So many people running unnecessary things on Kubernetes
There are distinct advantages to that in terms of both development (running a local K8S cluster is relatively easy) and deployment.
ECS has no distinct advantages over K8S (or EKS in AWS land). Particularly now that there are CRDs for K8S that allow you to deploy AWS functionality (eg ALBs, TGs) from K8S.
Absolutely. I've seen so many junior engineers / devs go on about it like this:
Someone higher up: Could you please look at this problem? I need it fixed ASAP.
Jr. Engineer, presented with a problem he's never seen before: No problem, I will look into it!
Someone higher up (the next day): Did you fix the problem?
Jr. Engineer: Sorry, I haven't still gotten around to look at it / I'm still working on it / etc.
Someone higher up: We really need it fixed today, please prioritize it and give me a call when it is fixed.
Jr. Engineer works on the problem all night, feeling stressed out, not wanting to let down his seniors.
The number of issues I've seen that turn out to be documented features... (or, more accurately, things just being configured incorrectly)
I feel like this is spectacularly bad advice. "Do not get fooled by shades of grey, things are meant to be either black or white!"
I feel like this could used as one of those "How to 10x career" articles - and be better than all of them.
Have a good logging & rollback strategy well communicated across stakeholders