“In the grand scheme of things, one week isn’t that long. But to us, it felt like forever. We are constantly iterating and release multiple changes every day”.
I assume they mean multiple production releases? Is this because the product lacks maturity or stability, or is it just your culture?
I am asking because I am trying to imagine the impact of this on existing customers. It sounds like an awful lot of churn.
This obviously happens a lot in the “you are the product” space like Facebook, Google, etc. But this looks to be a data analytics product with paid tiers. Curious what tooling and processes you have to support this, and how you keep customers happy with this model.
Actually, if you work with SMBs/enterprises, I agree with you on customer facing changes. In my past life we would ship very frequently (often more than once a day) but always had to feature flag changes that large clients might see or be affected by. Even something as simple as tweaking the layout of a core flow could cause support headaches and angry customers — customers worth 10s of thousands of dollars per month. Is it worth losing a customer to CD a new button placement?
We keep customers happy because we push changes live incrementally, reduce our chances of major outages and improve our response time when they do occur.
For example, I cancelled my netflix subscription because they are unable to reliably operate microservices, and the UI was always in some semi-broken state. As a software engineer, this stressed me out during my relaxing TV time.
Even if continuous delivery is somehow reliably delivered, if the changes are customer visible, then they break my muscle memory, and increase my cognitive load -- I have to re-learn the damned UI every fucking time I log in. If the changes are not customer visible, then what business value to they deliver?
In that context I assume this means they make multiple production releases per day (which makes me shudder). I am curious how they do this while maintaining high quality and not driving customers insane.
Continuous deployment deploys code to production frequently, as soon as it's ready.
Continuous delivery has some ready-to deliver branch that's constantly being updated as above, but they're not deployed to production until someone (Product Owner?) or something (Yay - end of sprint!) triggers it.
Different people may use the word release for at least this many things: 1) a deployment, 2) an unveiling via feature flags, 3) a public announcement despite the code already having been live.
Our context is the one of a startup that is constantly validating things, also in our context a release does not necessarily mean releasing to the users, sometimes stuff is behind feature flags or for beta testing.
Continuous deployment has been around long enough that even IBM (remember never getting fired for buying IBM?) talks about it.
https://www.ibm.com/topics/continuous-deployment
"Dark deploys" and "feature flags" are often used to keep customers safe from incomplete features while still giving all of the advantages of CD plus allowing testing in production.
I'd never heard of Flagship, but this is a nice writeup on that (kudos, Flagship.io):
One of the more fundamental things actually pushing towards faster releases is what I call the relativistic deployment speed. We have products that will need at least 2 months to get a remotely deployable version ready. The average fast hotfix usually takes more like 4 months until an installation on a prod system actually can start. Our fastest products can go from code to prod in like 15 minutes with the automated tests being the bottleneck.
This in turn shapes choices for the product managers, but also for security. If something like Log4shell hit these slow products, I'd have to plan to be vulnerable for two months at least, and usually more like 4 - 8 months depending on the customers. I have no choice, because that's their light speed of deployment. No code goes to prod faster than two months latency. That, quite frankly, fucking sucks.
Other products were much better in that situation. We were lucky to have the right devs around, but we went from the decision to emergency log4shell at an utmost risky speed to the first log4shell patches in prod of many within 30 minutes.
However, that's not the normal speed, and that's when you get into the second decision area. Given a lightspeed of deployment, how fast do you want to go?
Some of our possibly faster moving products are B2B products, with a lot of internal training for support and consulting going into a release, and also training at customers happening for larger customers. This means, product chooses to only release bigger changes and heavily customer-visible changes every 6 weeks. They could do this a lot faster, but they choose to slow down because it fits their customers well. And for example, december is usually frozen entirely because customers want to.
But then there is the third decision area. What happens if there is an entirely customer invisible change, such as an optimization in database handling, some internal metric generation for an optimization, or an internal change to prepare a new feature for the next scheduled rollout? And we have the tested, vetted and working option to just push that into prod without downtime, with also gives us opportunity to build experience with, and confidence into our no-downtime deployment system? I don't see a reason why I wouldn't exercise this daily at least once.
I'm using gitlab-ci with it's docker executor, and overall I'm very happy with it.
I use it on some rather beefy machines, but most of the CI time is not spent compiling, it is spent instead on setting up the environment.
Are there any tips/tricks to speed up this startup time? I know stuff like ensuring that artifacts are not passed in if not needed can help a lot, but it seems that most of the execution time is simply spent waiting for docker to spin up a container.
Unfortunately, doing this in most CI services is actually quite difficult. It usually means a complex graph of execution, complex cache usage, and being careful to not re-generate artifacts you don't need to.
In my experience, building this, at a level of reliability necessary for a team of more than a few devs, is hard. Jenkins can do it reliably, but doing it fast is hard because the caching primitives are poor. Circle and Gitlab can do it quickly, but the execution dependency primitives aren't great and the caches can be unreliable. Circle also has _terrible_ network speeds, so doing too much caching slows down builds. GitHub Actions is pretty good for all of this, but it's still a ton of work.
The best answer is to use a build system or CI system that is modelled in a better way. Things like Bazel essentially manage this graph for you in a very smart way, but they only really work when you have a CI system designed to run Bazel jobs, and there aren't many of these that I've seen. It's a huge paradigm shift, and requires quite a lot of dev work to make happen.
At the very least, see if you can keep heavy dependencies on the local network rather than depending on the internet.
## 1. Do more in the job's script
If you have multiple jobs that use (or could use) the same image, perhaps those jobs can be combined. It's definitely a tradeoff, and it depends on what you want from your pipeline. For example, normally you may have separate `build` and `test` jobs, but if they take, say (30s init + 5s work) + (30s init + 10s work), then combining them into a single job taking (30s init + 15 s work) _might_ be an acceptable trade-off. (These numbers are small enough that it probably isn't, but you get the idea.)
## 2. Pre-build the job's image
If your job's script uses an off-the-shelf image, and has a lot of setup, consider building an image that already has that done, and using that as your job's image instead. For example, you might be using a `node` image, but your build requires pulling translations from a file in an S3 bucket, and so you need to install the AWS CLI to grab the translation file. Rather than including the installation of the AWS CLI in the script, build it into the image ahead of time.
This is a good idea and something I will seriously consider
I'm already doing #2, but I'm glad to see others come to the same conclusion as me. :D
Crucially: Make sure that the large layers say they are "cached" when you rebuild the container. Docker goes out of its way to make this difficult in CI environments. The fact that it works on your laptop doesn't mean that it will be able to cache the big layers in CI.
Once you've done that, make sure that the CI machines are actually pulling the big layers from their local docker cache.
30-90 seconds to pull docker images for each run of a golang project's CI environment is too high. You might look into using "go mod vendor" to download the dependencies early in the docker build, then using a symlink and "--mod=vendor" to tell the tests to use an out-of-tree vendor directory. (I haven't tested this; presumably go will follow symlinks...)
- much faster network speeds. - We no longer run on the docker executor. Instead we run on Ubuntu. These boot in a second or 2 pretty consistently. - the bulk of our test suite was able to be pulled out of docker entirely (a lot of jest, and PHPUnit tests). - we have a bigger suite of E2E PHPUnit tests that we spin up a whole docker compose stack for. These are slower but still manageable.
Parallelism is key in all of this too. Our backend test suite has a full execution time of something like 250 minutes, but we just split it over a bunch of small workers and the whole thing completes in about 8 minutes.
For me the controlling factor with build time and to a lesser extent production performance is to divorce visibility from vigilance. You can’t watch things 24/7 waiting to pounce on any little size or time regressions. You need to be able to audit periodically and narrow the problem to a commit or at least an hour in a day when the problem happened. Otherwise nobody will be bothered to look and it’s just a tragedy of the commons.
Graphs work well. Build time, test count, slow test count, artifact sizes, and so on.
I just had some success running android builds on a self-hosted github runner. One of the big setting up stages was having sdkamanger pull down large dependencies (SDK, emulator images etc.) on startup.
Forcing sdkmanager into http_only mode and pointing it at a properly-configured squid took a large percentage off the build time.
Similar story for the gradle build, where running a remote gradle cache node locally to the job means gradle steps get automatically cached without any magic CI pipeline steps.
I’ll be baking some images with dependencies included, so the only stuff in the updated Dockerfile will be pulling the pre baked images from our registry and commands to build and run our app code.
The setup time is fairly constant even for very quick jobs.
For longer jobs where it takes less of a percentage of the total time it's not a bother, like when we run integration tests for a few minutes.
No optimisation to the baseOS other than mounting the /var/lib/docker on a RAID0 array with noatime on the volume and CPU mitigations disabled on the host
Compilation is mostly go binaries (with the normal stuff like go vet/go test).
Rarely it will do other things like commit-lint (javascript) or KICS/SNYK scanning.
the machines themselves are Dual EPYC 7313 w/ 256G DDR4.
> Our CI process was pretty standard: Every commit in an MR triggered a GitLab Pipeline, which consisted of several jobs.
me: nodding silently
> Those jobs would run in an auto-scaling Kubernetes cluster with up to 21 nodes
me: what the actual deuce?
Is this really "pretty standard"?
Gitlab makes it pretty easy to just toss a ci runner process on a vm or a physical box. You can get real far with a couple rack servers and some xeons for < $1000. You do have to over provision if your work load is not very consistent ( and of course pay for the power and rack space, and someone to mind them from time to time).
IMO it’s the optimal use case for K8s
Also, the duty cycle on the 21 nodes needs to be low enough to justify the complexity over just buying 21 computers (or getting annual pricing on 21 VMs). You could use spot instances for the EKS nodes, but then PRs will randomly fail because their instances disappear. That wastes developer salary money and productivity.
Assuming you have a ventilated room you don't care about, you could run 21 desktop towers off of ~ two-four 120V circuits. (Or buy a rack and pay ~ 2x as much for the hardware.) 21 build hosts would cost ~$21-42K. Power is probably averaging 50W per machine (they are probably mostly idle even when running tests, since they have to download stuff.) That's about 720KWh per month. US average electrical pricing is $0.20 / kWh; punitive California rates are about $0.40. So, in the punitive case, that's $288 / month.
Running 21 machines probably requires as much annoying maintenance work as EKS, though the maintenance includes swapping bad hardware, fiddling with ethernet cables, and wearing ear protection (if a rack is involved) instead of debugging piles of yaml and AWS roles, optimizing to stay in budget, etc, etc.
Someone's got a new project for Q2 if they aren't doing this already - it's a pretty easy sell if you calculate out the time savings for developers during busy time of day + savings on spinning down compute resources in the middle of the night/weekends, and being able to put "I saved the company $X in idle compute and saved developers Y hours per day" on your yearly performance review looks pretty good.
I have started tinkering with Fastbuild, and preliminary testing makes it seem like to good to be true, or the best thing since sliced bread. I'm sure there are drawbacks somewhere, but it's really fast.
Then again, a big chunk of our pipelines is not actually the compilation, but stuff like downloading nuget packages, uploading artifacts and stuff, all of which are. very. very. slow.
I would change now "pretty standard" by "we don't invented the wheel" xD :pray:, in the end I wanted to mean we use existing tools and "just" put them together
Even if you are using SaaS GitLab, there are still good reasons to have custom runners, and kube is one option for running them.
This is interesting, and is something I've also suspected on many CI systems that offer free public runners (CircleCI, GitHub Actions, etc.).
For seemingly no reason at all, tests were very flaky and unstable in CI, which couldn't be reproduced on local machines. I tried everything from resource-limited containers, to identically spec'd VMs, and never was able to reproduce certain failures. This made issues very hard to troubleshoot and fix.
Of course, you might say that this unstable environment surfaced race conditions in our tests or product, and that's true, but it's incredibly frustrating to have random failures that are impossible to reproduce locally, and having to wait for the long experiment-push-wait for CI development loop.
I suspect this is caused by over provisioning of the underlying hardware, where many VMs are competing for the same resources. This seems quite frequent on Azure (GH Actions).
In the article's case they patched it by making their environment more stable, which is a solution we can't do on public runners, but I'd caution them that they're only patching the issue, and not really fixing the root cause. The flakiness still exists in their code, and is just not visible when the system is not under stress, but will surface again when you least want it to, possibly in production.
That was one of the reasons we ended up setting up our own runners. Didn't mention in the post but we use spot VM instances.
I guess you mean the kind of trunk based development? But still some sort of CI happens, maybe locally.
Never worked in a different way than using a local / remote CI pipeline, that's why I'm curious.