We cut our CI pipeline execution time in half (opens in new tab)

(tinybird.co)

69 pointsalrocar3y ago97 comments

97 comments

52 comments · 6 top-level

dijit3y ago· 19 in thread

I have a somewhat related question.

I'm using gitlab-ci with it's docker executor, and overall I'm very happy with it.

I use it on some rather beefy machines, but most of the CI time is not spent compiling, it is spent instead on setting up the environment.

Are there any tips/tricks to speed up this startup time? I know stuff like ensuring that artifacts are not passed in if not needed can help a lot, but it seems that most of the execution time is simply spent waiting for docker to spin up a container.

danpalmer3y ago

The short answer is "do as little as possible". What this means in practice is breaking down every step of CI, figuring out the dependencies for that step, and then ordering the graph of dependencies such that you start as much as possible as early as possible. This process also usually shows you where things are slow and what the critical path is.

Unfortunately, doing this in most CI services is actually quite difficult. It usually means a complex graph of execution, complex cache usage, and being careful to not re-generate artifacts you don't need to.

In my experience, building this, at a level of reliability necessary for a team of more than a few devs, is hard. Jenkins can do it reliably, but doing it fast is hard because the caching primitives are poor. Circle and Gitlab can do it quickly, but the execution dependency primitives aren't great and the caches can be unreliable. Circle also has _terrible_ network speeds, so doing too much caching slows down builds. GitHub Actions is pretty good for all of this, but it's still a ton of work.

The best answer is to use a build system or CI system that is modelled in a better way. Things like Bazel essentially manage this graph for you in a very smart way, but they only really work when you have a CI system designed to run Bazel jobs, and there aren't many of these that I've seen. It's a huge paradigm shift, and requires quite a lot of dev work to make happen.

packetlost3y ago

It's so surprising to me that this is such a poorly supported paradigm in commodity CI systems. Caching artifacts and identifying slow stages is like... super important for scaling CI for large enough orgs. We need better tools!

2 more replies

brightball3y ago

I have found Gitlab and runners the best option here.

1 more reply

david_allison3y ago

Set up the environment on the runner, not in the job: https://docs.gitlab.com/ee/ci/runners/configure_runners.html...

At the very least, see if you can keep heavy dependencies on the local network rather than depending on the internet.

recfab3y ago

I'm not sure how to speed up the spin-up-a-container time (at least not without more details), but I have two suggestions that may help mitigate it. Based on your wording ("waiting for docker to spin up a container"), the second one may not be relevant.

## 1. Do more in the job's script

If you have multiple jobs that use (or could use) the same image, perhaps those jobs can be combined. It's definitely a tradeoff, and it depends on what you want from your pipeline. For example, normally you may have separate `build` and `test` jobs, but if they take, say (30s init + 5s work) + (30s init + 10s work), then combining them into a single job taking (30s init + 15 s work) _might_ be an acceptable trade-off. (These numbers are small enough that it probably isn't, but you get the idea.)

## 2. Pre-build the job's image

If your job's script uses an off-the-shelf image, and has a lot of setup, consider building an image that already has that done, and using that as your job's image instead. For example, you might be using a `node` image, but your build requires pulling translations from a file in an S3 bucket, and so you need to install the AWS CLI to grab the translation file. Rather than including the installation of the AWS CLI in the script, build it into the image ahead of time.

dijit3y ago

> If you have multiple jobs that use (or could use) the same image, perhaps those jobs can be combined. It's definitely a tradeoff, and it depends on what you want from your pipeline. For example, normally you may have separate `build` and `test` jobs, but if they take, say (30s init + 5s work) + (30s init + 10s work), then combining them into a single job taking (30s init + 15 s work) _might_ be an acceptable trade-off. (These numbers are small enough that it probably isn't, but you get the idea.)

This is a good idea and something I will seriously consider

I'm already doing #2, but I'm glad to see others come to the same conclusion as me. :D

hedora3y ago

Make sure your docker build is being cached properly, and break infrequently running stuff into their own steps, then move them to the top of the docker file.

Crucially: Make sure that the large layers say they are "cached" when you rebuild the container. Docker goes out of its way to make this difficult in CI environments. The fact that it works on your laptop doesn't mean that it will be able to cache the big layers in CI.

Once you've done that, make sure that the CI machines are actually pulling the big layers from their local docker cache.

30-90 seconds to pull docker images for each run of a golang project's CI environment is too high. You might look into using "go mod vendor" to download the dependencies early in the docker build, then using a symlink and "--mod=vendor" to tell the tests to use an out-of-tree vendor directory. (I haven't tested this; presumably go will follow symlinks...)

denzil3y ago

My usual strategy is to ensure that the lengthy parts are executed only once. So for example one of the lengthy parts is environment setup for me too. So what I did is to put as much as possible on the docker image I build and then I start tests from image mostly ready to run. Of course something similar can be done during runtime. If starting the software you test takes long time, you could set it up only once and run multiple tests without tearing down the setup. Of course this has disadvantage of having possibly tainted environment and there is risk of making the tests depend on previous state. On the other hand this could also help discover problems that are hidden by always running tests on clean slate, so it's a tradeoff. And I have to note that I mostly do integration testing, so the long parts are probably in different places than for unit testing.

charrondev3y ago

I had similar problems with CircleCI and it’s docker executor. We recently switched to GitHub actions and the following led to huge improvements:

- much faster network speeds. - We no longer run on the docker executor. Instead we run on Ubuntu. These boot in a second or 2 pretty consistently. - the bulk of our test suite was able to be pulled out of docker entirely (a lot of jest, and PHPUnit tests). - we have a bigger suite of E2E PHPUnit tests that we spin up a whole docker compose stack for. These are slower but still manageable.

Parallelism is key in all of this too. Our backend test suite has a full execution time of something like 250 minutes, but we just split it over a bunch of small workers and the whole thing completes in about 8 minutes.

hinkley3y ago

Pulling snapshots helps, particularly with slowdowns over time. Pulling deps is a problem that deserves its own initiatives.

For me the controlling factor with build time and to a lesser extent production performance is to divorce visibility from vigilance. You can’t watch things 24/7 waiting to pounce on any little size or time regressions. You need to be able to audit periodically and narrow the problem to a commit or at least an hour in a day when the problem happened. Otherwise nobody will be bothered to look and it’s just a tragedy of the commons.

Graphs work well. Build time, test count, slow test count, artifact sizes, and so on.

growse3y ago

> Pulling snapshots helps, particularly with slowdowns over time. Pulling deps is a problem that deserves its own initiatives.

I just had some success running android builds on a self-hosted github runner. One of the big setting up stages was having sdkamanger pull down large dependencies (SDK, emulator images etc.) on startup.

Forcing sdkmanager into http_only mode and pointing it at a properly-configured squid took a large percentage off the build time.

Similar story for the gradle build, where running a remote gradle cache node locally to the job means gradle steps get automatically cached without any magic CI pipeline steps.

1 more reply

colechristensen3y ago

How long are we talking? Are the containers getting pulled from somewhere across the internet and it’s a network bottleneck?

DiggyJohnson3y ago

This is what I’m working on next week. The majority of time is spent building the first n numbers of our Dockerfiles (which aren’t cached in our test/deploy pipeline).

I’ll be baking some images with dependencies included, so the only stuff in the updated Dockerfile will be pulling the pre baked images from our registry and commands to build and run our app code.

1 more reply

dijit3y ago

it's a 30-90s 'setup' for a compile that usually lasts about 10-30s.

The setup time is fairly constant even for very quick jobs.

For longer jobs where it takes less of a percentage of the total time it's not a bother, like when we run integration tests for a few minutes.

1 more reply

kyriakos3y ago

I noticed that many times using cache makes gitlab ci take longer than just fetching node dependencies again via npm install.

nicoburns3y ago

If you're mostly just compiling Go then why not cut out docker entirely? Just run your CI on bare metal.

dijit3y ago

Main reason is because honestly I'm too scared of a dirty filesystem wrecking builds.

1 more reply

buttersbrian3y ago

How do you setup or provision your environment? And what does this environment look like?

dijit3y ago

Docker on Debian 11 bare metal with gitlab-ci installed the "blessed" way (by adding gitlabs apt repos).

No optimisation to the baseOS other than mounting the /var/lib/docker on a RAID0 array with noatime on the volume and CPU mitigations disabled on the host

Compilation is mostly go binaries (with the normal stuff like go vet/go test).

Rarely it will do other things like commit-lint (javascript) or KICS/SNYK scanning.

the machines themselves are Dual EPYC 7313 w/ 256G DDR4.

1 more reply

Scubabear683y ago· 13 in thread

Maybe I am just an old fuddy duddy conservative, but this struck me from the post:

“In the grand scheme of things, one week isn’t that long. But to us, it felt like forever. We are constantly iterating and release multiple changes every day”.

I assume they mean multiple production releases? Is this because the product lacks maturity or stability, or is it just your culture?

I am asking because I am trying to imagine the impact of this on existing customers. It sounds like an awful lot of churn.

This obviously happens a lot in the “you are the product” space like Facebook, Google, etc. But this looks to be a data analytics product with paid tiers. Curious what tooling and processes you have to support this, and how you keep customers happy with this model.

mariosisters3y ago

I think it’s that you are an old fuddy duddy :P

Actually, if you work with SMBs/enterprises, I agree with you on customer facing changes. In my past life we would ship very frequently (often more than once a day) but always had to feature flag changes that large clients might see or be affected by. Even something as simple as tweaking the layout of a core flow could cause support headaches and angry customers — customers worth 10s of thousands of dollars per month. Is it worth losing a customer to CD a new button placement?

chrisandchris3y ago

I can only image how clean code looks & works that is full of feature flags. Glad that I don't need to do that to often :)

2 more replies

scott_w3y ago

This is considered the norm for high performing product teams in the modern day.

We keep customers happy because we push changes live incrementally, reduce our chances of major outages and improve our response time when they do occur.

hedora3y ago

As a customer, if I find a competitor that does not do this, then I will switch to it.

For example, I cancelled my netflix subscription because they are unable to reliably operate microservices, and the UI was always in some semi-broken state. As a software engineer, this stressed me out during my relaxing TV time.

Even if continuous delivery is somehow reliably delivered, if the changes are customer visible, then they break my muscle memory, and increase my cognitive load -- I have to re-learn the damned UI every fucking time I log in. If the changes are not customer visible, then what business value to they deliver?

1 more reply

recfab3y ago

This just sounds like Continuous Delivery. We never achieved it in my last job, so I can't speak from experience, but my understanding is that typically "deploy" is separated from "release" using feature flags of some kind.

Scubabear683y ago

The article starts with “Last year, we made the difficult decision to stop deploying any changes to production for one week” and goes on to talk about releases.

In that context I assume this means they make multiple production releases per day (which makes me shudder). I am curious how they do this while maintaining high quality and not driving customers insane.

2 more replies

drewcoo3y ago

It sounds like continuous deployment, not continuous delivery.

Continuous deployment deploys code to production frequently, as soon as it's ready.

Continuous delivery has some ready-to deliver branch that's constantly being updated as above, but they're not deployed to production until someone (Product Owner?) or something (Yay - end of sprint!) triggers it.

Different people may use the word release for at least this many things: 1) a deployment, 2) an unveiling via feature flags, 3) a public announcement despite the code already having been live.

alrocarOP3y ago

Yep, it's part of our culture, we do many releases per day to constantly iterate things. Also as in other projects there are maintenance and bug fixing we want to bring to production as soon as possible.

Our context is the one of a startup that is constantly validating things, also in our context a release does not necessarily mean releasing to the users, sometimes stuff is behind feature flags or for beta testing.

drewcoo3y ago

Hey there! I'm an old fuddy duddy, too.

Continuous deployment has been around long enough that even IBM (remember never getting fired for buying IBM?) talks about it.

https://www.ibm.com/topics/continuous-deployment

"Dark deploys" and "feature flags" are often used to keep customers safe from incomplete features while still giving all of the advantages of CD plus allowing testing in production.

I'd never heard of Flagship, but this is a nice writeup on that (kudos, Flagship.io):

https://www.flagship.io/glossary/dark-launch/

tetha3y ago

Mh, I'm interacting with teams with wildly different release strategies and stability requirements at work.

One of the more fundamental things actually pushing towards faster releases is what I call the relativistic deployment speed. We have products that will need at least 2 months to get a remotely deployable version ready. The average fast hotfix usually takes more like 4 months until an installation on a prod system actually can start. Our fastest products can go from code to prod in like 15 minutes with the automated tests being the bottleneck.

This in turn shapes choices for the product managers, but also for security. If something like Log4shell hit these slow products, I'd have to plan to be vulnerable for two months at least, and usually more like 4 - 8 months depending on the customers. I have no choice, because that's their light speed of deployment. No code goes to prod faster than two months latency. That, quite frankly, fucking sucks.

Other products were much better in that situation. We were lucky to have the right devs around, but we went from the decision to emergency log4shell at an utmost risky speed to the first log4shell patches in prod of many within 30 minutes.

However, that's not the normal speed, and that's when you get into the second decision area. Given a lightspeed of deployment, how fast do you want to go?

Some of our possibly faster moving products are B2B products, with a lot of internal training for support and consulting going into a release, and also training at customers happening for larger customers. This means, product chooses to only release bigger changes and heavily customer-visible changes every 6 weeks. They could do this a lot faster, but they choose to slow down because it fits their customers well. And for example, december is usually frozen entirely because customers want to.

But then there is the third decision area. What happens if there is an entirely customer invisible change, such as an optimization in database handling, some internal metric generation for an optimization, or an internal change to prepare a new feature for the next scheduled rollout? And we have the tested, vetted and working option to just push that into prod without downtime, with also gives us opportunity to build experience with, and confidence into our no-downtime deployment system? I don't see a reason why I wouldn't exercise this daily at least once.

igetspam3y ago

Read the State of DevOps reports over the years and you'll see why this is the direction we're all heading now. It turns out all that safety we thought we were building by making complex commit flows to multiple branches and environments was not only more complex than it needed to be but has also slowed us down and not made things better. Truly based development is back again and this time with data. Push early, push often, push small changes and iterate quickly. It's not just easier but it also seems to increase quality. (There are a lot of reasons why this turns out to be true. Read Accelerate. I won't do a better job explaining in a comment.)

lelandfe3y ago

That’s continuous delivery, right? You make great tests and you should feel comfortable releasing after review.

mariosisters3y ago

In my experience and to parent’s point, it’s not about your comfort it’s about documenting, notifying clients, updating support, etc. All the non-code parts of selling software. As you suggest if the code has been reviewed, tested and merged, it “should” be ready to go. Right?

teach3y ago· 8 in thread

Am I a curmudgeon? Not to take away from this cool writeup, but I'm familiar with a few CI/CD tools, particularly QuickBuild, Jenkins and Spinnaker. So this jumped out at me:

> Our CI process was pretty standard: Every commit in an MR triggered a GitLab Pipeline, which consisted of several jobs.

me: nodding silently

> Those jobs would run in an auto-scaling Kubernetes cluster with up to 21 nodes

me: what the actual deuce?

Is this really "pretty standard"?

amenod3y ago

It's not "pretty standard", but we're working towards it and it looks like a pretty great solution. Our problem is that CI job runners sleep most of the day (low number of commits), but then you have spikes where the jobs are waiting on each other and times get really long. Autoscaling sounds great - you can have lots of runners when you need them and only a single one (or maybe even none? not sure yet) otherwise.

tempest_3y ago

Only if your company is "Cloud Native" and thus real concerned about paying for over provisioned compute.

Gitlab makes it pretty easy to just toss a ci runner process on a vm or a physical box. You can get real far with a couple rack servers and some xeons for < $1000. You do have to over provision if your work load is not very consistent ( and of course pay for the power and rack space, and someone to mind them from time to time).

brightball3y ago

If you have it, it’s awesome. You can get parallel execution of so much, spin up environments for each branch for QA and dynamic scans.

IMO it’s the optimal use case for K8s

hedora3y ago

You have to be at a certain scale for k8s to make sense in a CI environment. In particular, it needs to be economical to spend 10-50% of a full time employee to maintain the Kubernetes cluster (even if it is some managed thing like EKS).

Also, the duty cycle on the 21 nodes needs to be low enough to justify the complexity over just buying 21 computers (or getting annual pricing on 21 VMs). You could use spot instances for the EKS nodes, but then PRs will randomly fail because their instances disappear. That wastes developer salary money and productivity.

Assuming you have a ventilated room you don't care about, you could run 21 desktop towers off of ~ two-four 120V circuits. (Or buy a rack and pay ~ 2x as much for the hardware.) 21 build hosts would cost ~$21-42K. Power is probably averaging 50W per machine (they are probably mostly idle even when running tests, since they have to download stuff.) That's about 720KWh per month. US average electrical pricing is $0.20 / kWh; punitive California rates are about $0.40. So, in the punitive case, that's $288 / month.

Running 21 machines probably requires as much annoying maintenance work as EKS, though the maintenance includes swapping bad hardware, fiddling with ethernet cables, and wearing ear protection (if a rack is involved) instead of debugging piles of yaml and AWS roles, optimizing to stay in budget, etc, etc.

2 more replies

fragmede3y ago

Using kube for that is pretty fancy if you aren't already using kube elsewhere, but you don't just have a single Jenkins worker, you have multiple. All that kube is doing is giving a very convenient lever for autoscaling, but other platforms give you this lever as well. If you're not scaling Jenkins workers (or whatever) to match demand, even manually (spin workers down on weekends), you're wasting developer time, compute resources, or both.

Someone's got a new project for Q2 if they aren't doing this already - it's a pretty easy sell if you calculate out the time savings for developers during busy time of day + savings on spinning down compute resources in the middle of the night/weekends, and being able to put "I saved the company $X in idle compute and saved developers Y hours per day" on your yearly performance review looks pretty good.

actionfromafar3y ago

Yeah... I don't know. We don't, but we have talked about it though, because the Azure pipelines are. just. so. slow. On the other hand, more complexity and Rube Goldberg-machinery is not something we long for.

I have started tinkering with Fastbuild, and preliminary testing makes it seem like to good to be true, or the best thing since sliced bread. I'm sure there are drawbacks somewhere, but it's really fast.

Then again, a big chunk of our pipelines is not actually the compilation, but stuff like downloading nuget packages, uploading artifacts and stuff, all of which are. very. very. slow.

alrocarOP3y ago

Thanks for this comment. I guess there's sometimes we (developers) take things for granted when they are not, and that puts a lot of pressure on us instead of celebrating our wins.

I would change now "pretty standard" by "we don't invented the wheel" xD :pray:, in the end I wanted to mean we use existing tools and "just" put them together

recfab3y ago

Yes. In fact, it's standard enough that it's a little odd that they specify "autoscaling" and "21 nodes", when they could have simply said "we use the kubernetes executor".

Even if you are using SaaS GitLab, there are still good reasons to have custom runners, and kube is one option for running them.

berkle44553y ago· 5 in thread

CI has been such a productivity killer. You don’t need it. Stick with CD only and you can ship.

mailund3y ago

Interesting, how do you define CI in that case? IIRC, CI was originally defined as integrating continuously (i.e. daily or more frequently) and CD is delivering said code continuously. How does CI hurt productivity and how do you do CD without CI?

alrocarOP3y ago

I'm really interested in different points of view.

I guess you mean the kind of trunk based development? But still some sort of CI happens, maybe locally.

Never worked in a different way than using a local / remote CI pipeline, that's why I'm curious.

jdkoeck3y ago

CI is a prerequisite for CD.

nerdponx3y ago

I've done smaller projects before where "CI" only consists of merging to trunk/main/master, while testing, linting, etc. is covered by code review and the honor system. I wouldn't advocate this for something business critical that a lot of developers collaborate on, but you can do CD with only trivial CI.

berkle44553y ago

It’s literally not.

2 more replies

imiric3y ago· 1 in thread

> We noticed a strong correlation between crazy utilization spikes and CI failure rates.

This is interesting, and is something I've also suspected on many CI systems that offer free public runners (CircleCI, GitHub Actions, etc.).

For seemingly no reason at all, tests were very flaky and unstable in CI, which couldn't be reproduced on local machines. I tried everything from resource-limited containers, to identically spec'd VMs, and never was able to reproduce certain failures. This made issues very hard to troubleshoot and fix.

Of course, you might say that this unstable environment surfaced race conditions in our tests or product, and that's true, but it's incredibly frustrating to have random failures that are impossible to reproduce locally, and having to wait for the long experiment-push-wait for CI development loop.

I suspect this is caused by over provisioning of the underlying hardware, where many VMs are competing for the same resources. This seems quite frequent on Azure (GH Actions).

In the article's case they patched it by making their environment more stable, which is a solution we can't do on public runners, but I'd caution them that they're only patching the issue, and not really fixing the root cause. The flakiness still exists in their code, and is just not visible when the system is not under stress, but will surface again when you least want it to, possibly in production.

alrocarOP3y ago

Yep, default runners in most CI platforms shared resources so they are prone to produce flakiness (depending on your set up).

That was one of the reasons we ended up setting up our own runners. Didn't mention in the post but we use spot VM instances.

CottonMcKnight3y ago

TL;DR: how a data company uses their own product.

j / k navigate · click thread line to collapse

97 comments

52 comments · 6 top-level

dijit3y ago· 19 in thread

I have a somewhat related question.

I'm using gitlab-ci with it's docker executor, and overall I'm very happy with it.

I use it on some rather beefy machines, but most of the CI time is not spent compiling, it is spent instead on setting up the environment.

danpalmer3y ago

packetlost3y ago

2 more replies

brightball3y ago

I have found Gitlab and runners the best option here.

1 more reply

david_allison3y ago

Set up the environment on the runner, not in the job: https://docs.gitlab.com/ee/ci/runners/configure_runners.html...

At the very least, see if you can keep heavy dependencies on the local network rather than depending on the internet.

recfab3y ago

## 1. Do more in the job's script

## 2. Pre-build the job's image

dijit3y ago

This is a good idea and something I will seriously consider

I'm already doing #2, but I'm glad to see others come to the same conclusion as me. :D

hedora3y ago

Make sure your docker build is being cached properly, and break infrequently running stuff into their own steps, then move them to the top of the docker file.

Once you've done that, make sure that the CI machines are actually pulling the big layers from their local docker cache.

denzil3y ago

charrondev3y ago

I had similar problems with CircleCI and it’s docker executor. We recently switched to GitHub actions and the following led to huge improvements:

hinkley3y ago

Pulling snapshots helps, particularly with slowdowns over time. Pulling deps is a problem that deserves its own initiatives.

Graphs work well. Build time, test count, slow test count, artifact sizes, and so on.

growse3y ago

> Pulling snapshots helps, particularly with slowdowns over time. Pulling deps is a problem that deserves its own initiatives.

Forcing sdkmanager into http_only mode and pointing it at a properly-configured squid took a large percentage off the build time.

Similar story for the gradle build, where running a remote gradle cache node locally to the job means gradle steps get automatically cached without any magic CI pipeline steps.

1 more reply

colechristensen3y ago

How long are we talking? Are the containers getting pulled from somewhere across the internet and it’s a network bottleneck?

DiggyJohnson3y ago

This is what I’m working on next week. The majority of time is spent building the first n numbers of our Dockerfiles (which aren’t cached in our test/deploy pipeline).

I’ll be baking some images with dependencies included, so the only stuff in the updated Dockerfile will be pulling the pre baked images from our registry and commands to build and run our app code.

1 more reply

dijit3y ago

it's a 30-90s 'setup' for a compile that usually lasts about 10-30s.

The setup time is fairly constant even for very quick jobs.

For longer jobs where it takes less of a percentage of the total time it's not a bother, like when we run integration tests for a few minutes.

1 more reply

kyriakos3y ago

I noticed that many times using cache makes gitlab ci take longer than just fetching node dependencies again via npm install.

nicoburns3y ago

If you're mostly just compiling Go then why not cut out docker entirely? Just run your CI on bare metal.

dijit3y ago

Main reason is because honestly I'm too scared of a dirty filesystem wrecking builds.

1 more reply

buttersbrian3y ago

How do you setup or provision your environment? And what does this environment look like?

dijit3y ago

Docker on Debian 11 bare metal with gitlab-ci installed the "blessed" way (by adding gitlabs apt repos).

No optimisation to the baseOS other than mounting the /var/lib/docker on a RAID0 array with noatime on the volume and CPU mitigations disabled on the host

Compilation is mostly go binaries (with the normal stuff like go vet/go test).

Rarely it will do other things like commit-lint (javascript) or KICS/SNYK scanning.

the machines themselves are Dual EPYC 7313 w/ 256G DDR4.

1 more reply

Scubabear683y ago· 13 in thread

Maybe I am just an old fuddy duddy conservative, but this struck me from the post:

“In the grand scheme of things, one week isn’t that long. But to us, it felt like forever. We are constantly iterating and release multiple changes every day”.

I assume they mean multiple production releases? Is this because the product lacks maturity or stability, or is it just your culture?

I am asking because I am trying to imagine the impact of this on existing customers. It sounds like an awful lot of churn.

mariosisters3y ago

I think it’s that you are an old fuddy duddy :P

chrisandchris3y ago

I can only image how clean code looks & works that is full of feature flags. Glad that I don't need to do that to often :)

2 more replies

scott_w3y ago

This is considered the norm for high performing product teams in the modern day.

We keep customers happy because we push changes live incrementally, reduce our chances of major outages and improve our response time when they do occur.

hedora3y ago

As a customer, if I find a competitor that does not do this, then I will switch to it.

1 more reply

recfab3y ago

Scubabear683y ago

The article starts with “Last year, we made the difficult decision to stop deploying any changes to production for one week” and goes on to talk about releases.

2 more replies

drewcoo3y ago

It sounds like continuous deployment, not continuous delivery.

Continuous deployment deploys code to production frequently, as soon as it's ready.

Different people may use the word release for at least this many things: 1) a deployment, 2) an unveiling via feature flags, 3) a public announcement despite the code already having been live.

alrocarOP3y ago

drewcoo3y ago

Hey there! I'm an old fuddy duddy, too.

Continuous deployment has been around long enough that even IBM (remember never getting fired for buying IBM?) talks about it.

https://www.ibm.com/topics/continuous-deployment

"Dark deploys" and "feature flags" are often used to keep customers safe from incomplete features while still giving all of the advantages of CD plus allowing testing in production.

I'd never heard of Flagship, but this is a nice writeup on that (kudos, Flagship.io):

https://www.flagship.io/glossary/dark-launch/

tetha3y ago

Mh, I'm interacting with teams with wildly different release strategies and stability requirements at work.

However, that's not the normal speed, and that's when you get into the second decision area. Given a lightspeed of deployment, how fast do you want to go?

igetspam3y ago

lelandfe3y ago

That’s continuous delivery, right? You make great tests and you should feel comfortable releasing after review.

mariosisters3y ago

teach3y ago· 8 in thread

Am I a curmudgeon? Not to take away from this cool writeup, but I'm familiar with a few CI/CD tools, particularly QuickBuild, Jenkins and Spinnaker. So this jumped out at me:

> Our CI process was pretty standard: Every commit in an MR triggered a GitLab Pipeline, which consisted of several jobs.

me: nodding silently

> Those jobs would run in an auto-scaling Kubernetes cluster with up to 21 nodes

me: what the actual deuce?

Is this really "pretty standard"?

amenod3y ago

tempest_3y ago

Only if your company is "Cloud Native" and thus real concerned about paying for over provisioned compute.

brightball3y ago

If you have it, it’s awesome. You can get parallel execution of so much, spin up environments for each branch for QA and dynamic scans.

IMO it’s the optimal use case for K8s

hedora3y ago

2 more replies

fragmede3y ago

actionfromafar3y ago

Then again, a big chunk of our pipelines is not actually the compilation, but stuff like downloading nuget packages, uploading artifacts and stuff, all of which are. very. very. slow.

alrocarOP3y ago

Thanks for this comment. I guess there's sometimes we (developers) take things for granted when they are not, and that puts a lot of pressure on us instead of celebrating our wins.

I would change now "pretty standard" by "we don't invented the wheel" xD :pray:, in the end I wanted to mean we use existing tools and "just" put them together

recfab3y ago

Yes. In fact, it's standard enough that it's a little odd that they specify "autoscaling" and "21 nodes", when they could have simply said "we use the kubernetes executor".

Even if you are using SaaS GitLab, there are still good reasons to have custom runners, and kube is one option for running them.

berkle44553y ago· 5 in thread

CI has been such a productivity killer. You don’t need it. Stick with CD only and you can ship.

mailund3y ago

alrocarOP3y ago

I'm really interested in different points of view.

I guess you mean the kind of trunk based development? But still some sort of CI happens, maybe locally.

Never worked in a different way than using a local / remote CI pipeline, that's why I'm curious.

jdkoeck3y ago

CI is a prerequisite for CD.

nerdponx3y ago

berkle44553y ago

It’s literally not.

2 more replies

imiric3y ago· 1 in thread

> We noticed a strong correlation between crazy utilization spikes and CI failure rates.

This is interesting, and is something I've also suspected on many CI systems that offer free public runners (CircleCI, GitHub Actions, etc.).

I suspect this is caused by over provisioning of the underlying hardware, where many VMs are competing for the same resources. This seems quite frequent on Azure (GH Actions).

alrocarOP3y ago

Yep, default runners in most CI platforms shared resources so they are prone to produce flakiness (depending on your set up).

That was one of the reasons we ended up setting up our own runners. Didn't mention in the post but we use spot VM instances.

CottonMcKnight3y ago

TL;DR: how a data company uses their own product.

j / k navigate · click thread line to collapse