If you really need consistency for the environment - Let them own the machine, and then give them a stable base VM image, and pay for decent virtualization tooling that they run... on their own machine.
I have seen several attempts to move dev environments to a remote host. They invariably suck.
Yes - that means you need to pay for decent hardware for your devs, it's usually cheaper than remote resources (for a lot of reasons).
Yes - that means you need to support running your stack locally. This is a good constraint (and a place where containers are your friend for consistency).
Yes - that means you need data generation tooling to populate a local env. This can be automated relatively well, and it's something you need with a remote env anyways.
---
The only real downside is data control (ie - the company has less control over how a developer manages assets like source code). I'm my experience, the vast majority of companies should worry less about this - your value as a company isn't your source code in 99.5% of cases, it's the team that executes that source code in production.
If you're in the 0.5% of other cases... you know it and you should be in an air-gapped closed room anyways (and I've worked in those too...)
The developers also lack knowledge about the environment; can't evolve the environment; can't test the environment for bugs; and invariably interfere with each other because it's never isolated well. And also, yes, it adds lag.
Anyway, yes, working locally on false data that little resemblance to production still beats remote environments.
Trying to boot the full service on a single machine required every single developer in the company installing ~50ish microservices on their machine, for things to work correctly. Became totally intractable.
I guess one can grumble about bad architecture all day but this had to be solved. we had to move to remote development environments which restored everyone’s sanity.
Both FAANG companies I’ve worked at had remote dev environments that were built in house.
This is certainly one of the critical mistakes you did.
No developer needs to launch half of the company's services to work on a local deployment. That's crazy, and awfully short-sighted.
The only services a developer ever needs to launch locally are the ones that are being changed. Anything else they can consume straight out of a non-prod development environment. That's what non-prod environments are for. You launch your local service locally, you consume whatever you need to consume straight from a cloud environment, you test the contract with a local test set, and you deploy the service. That's it.
> I guess one can grumble about bad architecture all day but this had to be solved.
Yes, it needs to be solved. You need to launch your service locally while consuming dependencies deployed to any cloud environment. That's not a company problem. That's a problem plaguing that particular service, and one which is trivial to solve.
> Both FAANG companies I’ve worked at had remote dev environments that were built in house.
All FANG companies I personally know had indeed remote dev environments. They also had their own custom tool sets to deploy services locally, either in isolation or consuming dependencies deployed to the cloud.
This is not a FANG cargo cult problem. This is a problem you created for yourself out of short-sightedness and for thinking you're too smart for your own good. Newbies know very well they need to launch one service instance alone because that's what they are changing. Veterans know that too well. Why on earth would anyone believe it's reasonable to launch 50 services to do anything at all? Just launch the one service you're working on. That's it. If you believe something prevents you from doing that, that's the problem you need to fix. Simple. Crazy.
Ouch. Where they using macOS at the time with laptops having not-enough-ram?
I've seen that go poorly on macOS with java based microservices. Largely due to java VMs wanting ram pre-assigned for each, which really chews though ram that mostly sits around unused.
This was a few years ago though, at the tail end of Intel based mac's where 32GB ram in a mac laptop wasn't really an option.
This is certainly not universal among FAANGs though.
Requiring 50 services to be up is absolutely nuts, but it’s actually pretty trivial using something like Nomad locally.
You would list the services you need (or service groups) in a config file, start a command and all services would start in containers. Sure, you need a lot of RAM with that but on 32Gb it was working fine.
I have worked on developing VMs for other developers that rely on a local IDE such. The main sticking point is syncing and schlepping source code (something my setup avoids because the source code and editor is on the remote machine). I have tried a number of approaches, and I sympathize with the article author. So, in response to "Devs need to create the software tooling to make remote dev less painful. I mean, they're devs... making software is kind of their whole thing." <-- syncing and schlepping source code is by no means a solved problem.
I can also say that, my spacemacs config is very vanilla. Like my phone, I don't want to be messing with it when I want to code. Writing tooling for my editor environment is a sideshow for the work I am trying to finish.
It doesn't have to be like that. I've worked on a 10MLOC codebase with 500+ committers - all perfectly runnable locally, on admittedly slightly beefy dev machines. It's true that systems will grow without limit unless some force exists to counter this, but keeping your stack something you can sanely run on a development machine is well worth spending some actual effort on.
So the solution here is to not have that kind of "stack".
I mean, if it's all so big and complex that it can't be run on a laptop then you almost certainly got a lot of problems regardless. What typically happens is tons of interconnected services without clear abstractions or interfaces, and no one really understands this spaghetti mess, and people just keep piling crap on top of it.
This leads to all sorts of problems. Everywhere I've seen this happen they had real problems running stuff in production too, because it was a complex spaghetti mess. The abstracted "easy" dev-env (in whatever form that came) is then also incredibly complex, finicky, and brittle. Never mind running tests, which is typically even worse. It's not uncommon for it all to be broken for every other new person who joins because changes somewhere broke the setup steps which are only run for new people. Everyone else is afraid to do anything with their machine "because it now works'.
There are some exceptions where you really need a big beefy machine for a dev env and tests, maybe, but they're few and far between.
Sounds like you have a different problem.
CPU resources required to run your stack should be very minimal if it's a single user accessing it for local testing idle threads don't consume oodles of cpu cycles to do nothing.
Memory use may be significant even in that case (depending on your stack) but let's be realistic. If your stack is so large that it alone requires more memory than a dev machine can spare with an IDE open, the cost of providing developers with capable workstations will pale in comparison to the cost of running the prod environment.
I have a client whose prod environment is 2x load balancer; 2x app server; 3x DB cluster node - all rented virtual machines. We just upgraded to higher spec machines to give headroom over the next couple of years (ie most machines doubled the RAM from the previous generation).
My old workstation bought in 2018 had enough memory that it could virtualise the current prod environment with the same amounts of RAM as prod, and still have 20GB free. My current workstation would have 80+ GB free.
In 95% of cases if you can't run the stack for a single user testing it, on a single physical machine, you're doing something drastically wrong somewhere.
Isn't this problem solved by CICD? When the developer is ready to test, they make a commit, and the pipeline deploys the code to a dev/test environment. That's how my teams have been doing it.
How tightly coupled are these systems?
https://github.com/89luca89/distrobox
It is sorta like Vagrant, but instead of using virtualbox virtual machines you use podman containers. This way you get to use OCI images for your "dev environment" that integrates directly into your desktop.
There is some challenges related to usermode networking for non-root-managed controllers and desktop integration has some additional complications. But besides that it has almost no overhead and you can have unfettered access to things like GPUs.
Also it is usually pretty easy to convert your normal docker or kubernetes containers over to something you can run on your desktop.
Also it is possible to use things like Kubernetes pods definitions to deploy sets of containers with podman and manage it with systemd and such things. So you can have "clouds of containers" that your dev container needs access to locally.
If there is a corporate need for window-specific applications then running Windows VMs or doing remote applications over RDP is a possible work around.
If everything you are targeting as a deployment is going to be Linux-everything then it doesn't make a lot of sense to jump through a bunch of hoops and cause a bunch of headaches just to avoid having it as workstation OS.
You'll run into occasional issues (e.g. if everyone is trying to run default node.js on default port) but with some basic guardrails it feels like it should be OK?
I'm remembering back to when my old company ran a lot of PHP projects. Each user just had their own development environment and their own Apache vhost. They wrote their code and tested it in their own vhost. Then we'd merge to a single separate vhost for further testing.
I am trying to remember anything about what was painful about it but it all basically Just Worked. Everyone had remote access via VPN; the worst case scenario for them was they'd have to work from home with a bit of extra latency.
Distrobox and podman are such a charm to use, and so easily integrated into dev environments and production environments.
The intentional daemon free concept is so much easier to setup in practice, as there's no fiddly group management necessary anymore.
Just a 5 line systemd service file and that's it. Easy as pie.
One of the benefits of moving away from Kubernetes, to a runner-based architecture , is that we can now seamlessly support cloud-based and local environments (https://www.gitpod.io/blog/introducing-gitpod-desktop).
What's really nice about this is that with this kind of integration there's very little difference in setting up a dev env in the cloud or locally. The behaviour and qualities of those environments can differ vastly though (network bandwidth, latency, GPU, RAM, CPUs, ARM/x86).
For example, when you're running on your local machine you've actually got the amount of RAM and CPU advertised :)
Kubernetes is another mess of userspace ops tools. Userspace is for composable UI not backend. Kube and Chef and all those other ops tools are backend functionality being used like UI by leet haxxors
Unfortunately, after a few hires (hand-picked by me), this is what happened:
1) People didn't want to learn Nix, neither did they want to ask me how to make something work with Nix, neither did they tell me they didn't want to learn Nix. In essence, I told them to set the project up with it, which they'd do (and which would be successful, at least initially), but forgot that I also had to sell them on it. In one case, a developer spent all weekend (of HIS time) uninstalling Nix and making things work using the "usual crap" (as I would call it), all because of an issue I could have fixed in probably 5 minutes if he had just reached out to me (which he did not, to my chagrin). The first time I heard them comment their true feelings on it was when I pushed back regarding this because I would have gladly helped... I've mentioned this on various Slacks to get feedback and people have basically said "you either insist on it and say it's the only supported developer-environment-defining framework, or you will lose control over it" /shrug
2) Developers really like to have control over their own machines (but I failed to assume they'd also want this control over the project dependencies, since, after all, I was the one who decided to control mine with the flake.nix in the first place!)
3) At a startup, execution is everything and time is possibly too short (especially if you have kids) to learn new things that aren't simple, even if better... that unfortunately may include Nix.
4) Nix would also be perfect for deployments... except that there is no (to my knowledge) general-purpose, broadly-accepted way to deploy via Nix, except to convert it to a Docker image and deploy that, which (almost) defeats most of the purpose of Nix.
I still believe in Nix but actually trying to use it to "perfectly control" a team's project dependencies (which I will insist it does do, pretty much, better than anything else) has been a mixed bag. And I will still insist that for every 5 minutes spent wrestling with Nix trying to get it to do what you need it to do, you are saving at least an order of magnitude more time spent debugging non-deterministic dependency issues that (as it turns out) were only "accidentally" working in the first place.
Having a kid has drastically altered my ability to learn new things outside of work, simply due to lack of time. I never could have imagined how big of an impact having a kid would be, its crazy!
The worst thing is when you actually manage to carve out some time to do some learning or experimentation with a new tool, library, etc only to find out that it sucks or you just don't have the time to pick up or whatever.
It took me a couple of days to get a supervisor-based setup working locally. I was the only person on the team who would run the backend and frontend when trying things out, because nobody was actually using the dev environments fully anyways. There was no buy-in for the dev environment!
I really feel like if you are in a position to determine tooling, it's so much more helpful to lean into whatever people on the ground want to use. Obviously there are times when the people on the ground don't care, but if you're spending your sweat and tears to put the square peg into the square hole suddenly you're the person with superpowers, and not the person pushing their pet project.
And sometimes that's just "wrap my thing with your thing".
I ended up going with Bazel, not because of this particular problem alone (though it was part of it; people we hired spent WEEKS trying to get a happy edit/test/debug cycle going), but because proper dependency-based test caching was sorely needed. Using Bazel and Buildbuddy brought CI down from about 17 minutes per run to 3-4 minutes for a typical change, which meant that even if people didn't want to get a local setup going, they could at least be slightly productive. I also made sure that every dependency / tool useful for developing the product was versioned in the repository, so if something needs `psql` you can `bazel run //tools/postgres/psql` and have it just work. (Hate that Postgres can't be statically linked, though.)
It was a lot of work for me, and people do gripe about some things ("I liked `go test ./...`, I can't adjust to `bazel test ...`"), but all in all, it does work well. I would do it again. Day 1 at the company; git clone our thing, install bazelisk, and your environment setup is done. All the tests pass. You can run the app locally with a simple `bazel run`. I'm pretty happy with the outcome.
Nix is something I looked into for our container images, but they just end up being too big. I never figured out why; I think a lot of things are dynamically linked and they include their own /usr/lib tree with the entire transitive dependency chain for that particular app, even if other things you have installed have some overlap with that dependency chain. I prefer the approach of statically linking everything and only including what you need. I compromised by basing things on Debian and rules_distroless, which at least lets you build a container image with the exact same sha256 on two different machines. (We previously just did "FROM scratch; COPY <statically linked binary> /app; ENTRYPOINT /app", but then started needing things like pg_dump in our image. If you can just have a single statically-linked binary be your entire app, great. Sometimes you can't, and then you need some sort of reasonable solution. Also everything ends up growing a dependency on ca-certificates...)
It's not the learning new things that's a problem, but rather the fact that every little issue turns into a 2-day marathon that's eventually solved with a 1-line fix. And that's because the feedback loop and general UX is just awful - I really started to feel like I needed a sacrificial chicken.
Docker may be a dumpster fire, but at least it's generally easy to see what you did wrong and fix it.
That's true for any architectural decision in an organization with more than 1 person.
It's really not something that should make you reconsider a decision. At the end of the day, an architecture that "people" actually want to use doesn't exist, "people" doesn't want any singular thing.
`nix copy .#my-crap --to ssh://remote`
What you do with it then on the remote depends on your environment. At the minimum do a `nix-store --add-root` to make a symlink to whatever you just copied.
(The most painless path is if you're deploying an entire NixOS system, but that requires converting the remote host to NixOS first.)
IMO there are some workloads, where it is beneficial for a developer to have access to a local repository with at least some snippets based on previous projects.
Having a leftover PoC of some concept written for a previous employer but never elevated to team use/production is both handy (at least to confirm that the build environment is still viable after an unspecified period of toolchain updates) and ethical (copying production code is not ethical - even if the old and new products are vastly different e.g. last job was taxi app, new app is banking app).
Making it all 'remote' and 'cloud' will eventually result in a bike reinvention penalty on each new employment - not everything can be rebuilt from memory only, especially things that are done 1-2 times a year; sure there is open-source documentation/examples, but at some point it'll just introduce even heavier penalty for a need to either know a lot of opensource stuff to have some reference points, or to work on a pet projects to get the same amount of references.
And the new company would also be liable for using trade secrets that they shouldn’t.
I've worked in a remote, secured development environment and it sucked, but to their credit the company did it for exactly this reason - control over the source. But bear in mind that source control is a two-way street.
Losing proprietary source can be harmful (especially in compiled languages where the source might carry much more information than the distributable). But they were mostly worried about the opposite way...that something malicious gets INTO the source which could pose an existential threat. You'd be correct to say "well that should be the domain of source control, peer review etc", but in this case the company assessed the risk high enough to do both.
I once had to burn a ton of political capital (including some on credit), because someone who didn't understand software thought that cutting-edge tech startup software developers, even including systems programmers working close to metal, could work effectively using only virtual remote desktops... with a terrible VM configuration... from servers literally halfway around the world... through a very dodgy firewall and VPN... of 10Mb/s total bandwidth... for the entire office of dozens of developers.
(And no other Internet access from the VMs. Administrators would copy whatever files from the Internet that are needed for work. And there was a bureaucratic form for a human process, if you wanted to request any code/data to go in or out. And the laptops/workstations used only as thin-clients for the remote VMs would have to be Windows and run this ridiculous obscure 'endpoint security' software that had changed hands from its ancient developer, and hadn't even updated the marketing materials (e.g., a top bulletpoint was keeping your employees from wasting time on a Web site that famously was wiped out over a decade earlier), and presumably was littered with introduced vulnerabilities and instabilities.)
Note that this was not something like DoD, nor HIPAA, nor finance. Just cutting-edge tech on which (ironically) we wanted first-mover advantage.
This escalated to the other top-titled software engineer and I together doing a presentation to C-suite, on why not only would this kill working productivity (especially in a startup that needed to do creative work fast!), but the bad actors someone was paranoid about could easily circumvent it anyway to exfiltrate data (using methods obvious to the skilled software people like they hired, some undetectable by any security product or even human monitoring they imagined), and all the good rule-following people would quit in incredulous frustration.
Unfortunately, it might not have been even the CEO's call, but a crazy investor.
If it doesn't fit on one machine, though, you don't have another option: Meta, for example, will never have a local dev env for Instagram or Blue. Then you need to make some hard choices.
Personally, my ideal cloud dev env is:
1. Local checkout of the code you're working on. You can use whatever IDE or text editor you prefer. For large monorepos, you'll need some special tooling to make sure it's easy to only check out slices of the repo.
2. Sync the code to the remote execution environment automatically, with hot-reloading.
3. Auto-port-forward from your local machine to the remote.
4. Optionally be able to run dependent services on your personal remote to debug/test their interactions with each other, and optionally be able to connect to a well-maintained shared environment for dependencies you aren't working on. If you have a shared environment, it can't be viewed as less-important than production: if it's broken, it's a SEV and the team that broke it needs to drop everything and fix it immediately. (Otherwise the shared env will be broken all the time, and your shipping speed will either drop, or you'll constantly be shipping bugs to prod due to lack of dev care.)
At Meta we didn't have (1): everyone had to use VSCode, with special in-house plugins that synced to the remote environment. It was okay but honestly a little soul-sucking; I think customizing your tooling is part of a lot of people's craft and helps maintain their flow state. Thankfully we had the rest, so it was tolerable if not enjoyable. At Airbnb we didn't have the political will to enforce (4), so the dev env was always broken. I think (4) is actually the most critical part: it doesn't matter how good the rest of it is, if the org doesn't care about it working.
But yeah — if you don't need it, that's a lot of work and politics. Use local environments as long as you possibly can.
It'll work if the company can offer something similar to EC2. Unfortunately most of the companies are not capable of doing so if they are not on cloud.
> I have seen several attempts to move dev environments to a remote host. They invariably suck.
To “therefore they will always suck and have no benefits and nobody should ever use them ever”. Apologies for the hyperbole but I’m making a point that comments like these tend to shut down interesting explorations of the state of the art of remote computing and what the pros/cons are.
Edit: In a world where users demand that companies implement excellent security then we must allow those same companies to limit physical access to their machines as much as possible.
Ex - even on a VERY good connection, RTT on the network is going to exceed your frame latency for a computer sitting in front of you (before we even get into the latency of the actual frame rendering of that remote computer). There's just not a solution for "make the light go faster".
Then we get into the issues the author actually laid out quite compellingly - Shared resources are unpredictable. Is my code running slowly right now because I just introduced an issue, or is it because I'm sharing an env and my neighbor just ate 99% of the CPU/IO, or my network provider has picked a different route and my latency just went up 500ms?
And that's before we even touch the "My machine is down/unreachable, I don't know why and I have no visibility into resolving the issue, when was my last commit again?" style problems...
> Edit: In a world where users demand that companies implement excellent security then we must allow those same companies to limit physical access to their machines as much as possible.
And this... is just bogus. We're not talking about machines running production data. We're talking about a developer environment. Sure - limit access to prod machines all you like, while you're at it, don't give me any production user data either - I sure as hell don't want it for local dev. What I do want is a fast system that I control so that I can actually tweak it as needed to develop and debug the system - it is almost impossible to give a developer "the least access needed" to do development locally because if you know what that access was you wouldn't be developing still.
I wonder if Microsoft's approach for Dev Box is the right one.
Overall I agree with you that this is how it should be, but as DevOps working with so many development teams, I can tell you that too many developers know a language or two but beyond that barely know how to use a computer. Most developers (yes even most of the ones in Silicon Valley or the larger Bay Area) with Macbooks will smile and nod at when you tell them that Docker Desktop runs a virtual machine to run a copy of Linux to run oci images, and then not too much later reveal themselves to have been clueless.
Commenters on this site are generally expected to be in a different category. Just wanted to share that, as a seasoned DevOps pro, I can tell you it's pretty rough out there.
I'm not recommending this as a best practice. I just believe that we, as developers, end up creating some myths to ourselves of what works and what doesn't. It's good to re-evaluate these beliefs now and then.
If you stick to the tried and true libs and change your function kwargs or method names when getting warnings, then I’ve had pretty rock steady reproducibility using even an un-versioned “python -m pip install -r requirements.txt” experience
I could also be a slob or just not working at the bleeding edge of python lib deployment tho so take it with a grain of salt.
python -m venv .venv> This is the story of how (not) to build development environments in the cloud.
I'd like to request that the comment thread not turn into a bunch of generic k8s complaints. This is a legitimately interesting article about complicated engineering trade-offs faced by an organization with a very unique workload. Let's talk about that instead of talking about the title!
Super useful negative example, and the lengths they pursued to make it fit! And no knock on the initial choice or impressive engineering, as many of the k8s problems they hit likely weren't understood gaps at the time they chose k8s.
Which makes sense, given k8s roots in (a) not being a security isolation tool & (b) targeting up-front configurability over runtime flexibility.
Neither of which mesh well with the co-hosted dev environment use case.
Because I don't understand most of the article if it's the former. How are things like performance are a concern for internal development environments? And why are so many things stateful - ideally there should be some kind of configuration/secret management solution so that deployments are consistent.
If it's the latter, then this is incredibly niche and maybe interesting, but unlikely to be applicable to anyone else.
> This is not a story of whether or not to use Kubernetes for production workloads that’s a whole separate conversation. As is the topic of how to build a comprehensive soup-to-nuts developer experience for shipping applications on Kubernetes.
> This is the story of how (not) to build development environments in the cloud.
Perhaps a followup article will go into detail about their replacement.
Gitpod Flex is runner-based. The runner interface is intentionally generic so that we can support different clouds, on-prem or just Linux in future.
The first implemented runner is built around AWS primitives like EC2, EBS and ECS. But because of the more generic interface Gitpod now supports local / desktop environments on MacOS. And again, future OS support will come.
There’s a bit more information in the docs, but we will do some follow ups!
- https://www.gitpod.io/docs/flex/runners/aws/setup-aws-runner... - https://www.gitpod.io/docs/flex/gitpod-desktop
(I work at Gitpod)
Did you use consul?
And that they're desperate to tell customers that they've fixed their problems.
Kubernetes is absolutely the wrong tool for this use case, and I argue that this should be obvious to someone in a CTO-level position, or their immediate advisors.
Kubernetes excels as a microservices platform, running reasonably trustworthy workloads. The key features of Kubernetes are rollout (highly available upgrades), elasticity (horizontal scaleout), bin packing (resource limits), CSI (dynamically mounted block storage), and so on. All this relates to a highly dynamic environment.
This is not at all what Gitpod needs. They need high performance disks, ballooning memory, live migrations, and isolated workloads.
Kubernetes does not provide you sufficient security boundaries for untrusted workloads. You need virtualization for that, and ideally physically separate machines.
Another major mistake they made was trying to build this on public cloud infrastructure. Of course the performance will be ridiculous.
However, one major reason for using Kubernetes is sharing the GPU. That is, to my knowledge, not possible with virtualization. But again, do you want to risk sharing your data, on a shared GPU?
To clarify on one of your points, Kubernetes itself has nothing to do with actually setting the security boundaries. It only providers a schema to describe resources and policies, and then an underlying system (perhaps Cilium for networking, or Kata Containers for micro VMs) can ensure that the resources created actually follow those schemas and policies.
For example, Neon have built https://github.com/neondatabase/autoscaling which manages Neon Instances with Kubernetes by running them with QEMU instead. This allows them to do live migrations and resource (de)allocation while the service is running, without having to replace Kubernetes. These workloads are, as far as I understand it, stateless.
We've always had issues with stateful kubernetes setups. Can you share what makes it easier today than before? Genuinely interested.
What Neon is doing is quite a feat: Live migration (of a VM) while preserving TCP connections. It also took a lot of customization to achieve that.
But I agree that Kubernetes can indeed be used this way.
If anything, it further cements my original point about the Gitpod leadership.
The problem was never Kubernetes, but the dimwitted notion of using containers.
And then blaming Kubernetes for it: We're leaving you.
Are you aware of the limits? It must run as root and privileged?
Example: What performance do you get out of your NVMe disks? Because these days you can build storage that delivers 100-200 GB/s.
https://www.graidtech.com/wp-content/uploads/2023/04/Results...
I bet few public cloud customers are seeing that kind of performance.
For anything stateful, monolithic, or that doesn't require autoscaling, I find LXC more appropriate:
- it can be clusterized (LXD/Incus), like K8S but unlike Compose
- it exposes some tooling to the data plane, especially a load balancer, like K8S
- it offers system instances with a complete distribution and a init system, like a VM but unlike a Docker container
- it can orchestrate both VMs (including Windows VMs) and LXC containers at the same time in the same cluster
- LXC containers have the same performance as Docker containers unlike a VM
- it uses a declarative syntax
- it can be used as a foundation layer for anything stateful or stateless, including the Kubernetes cluster
LXD/Incus sits somewhere between Docker Swarm and a vCenter cluster, which makes it one of the most versatile platform. Nomad is also a nice contender, it cannot orchestrate LXC containers but can autoscale a variety of workloads, including Java apps and qemu VMs.
In my opinion, k8s is great for stable and consistent deployment/orchestration of applications. Dev environments by default are in a constant state of flux.
I don’t understand the need for “cloud development environments” though. Isn’t the point of containerized apps is to avoid the need for synchronizing dev envs amongst teams?
Or maybe this product is supposed to decrease onboarding friction?
The rest of our eng team just did dev on their laptops though. I do think there was a level of batteries-included-ness that came with the ephemeral dev envs which our less technical data scientists appreciated, but the rest of our developers did not. Just my 2c
It's also much cheaper to hire contractors and give them the CDE that can be terminated on a moment notice.
>Kubernetes seems like the obvious choice for building out remote, standardized and automated development environments
- Is it really Obvious Choice™ though Fred?- Hmm, let's consult the graphs.
>Kubernetes is a container orchestration system for automating software deployment.
- It's about automating deployment Carl, not development environments! >Kubernetes is not the right choice for building development environments, as we’ve found.Based on this information, it is hard to justify to even consider k8s for the problem that gitpod has.
https://static.googleusercontent.com/media/research.google.c...
I am not sure what differences k8s has compare to Borg. At the concept level these are pretty comparable.
You're running hot pods for crypto miners and against people who really want to see the rest of the code that box has ever seen. You should be isolating with something purpose built like firecracker, and do your own dispatch & shred for security.
So if you started with kubernetes and fought the whole process of why it's not a great solution to the problem, I have to assume you didn't understand the problem. I :heart: kubernetes, its complexity pays my bills - but it's barely a good CI solution when you trust everyone involved, it's definitely not a good one where you're trying to be general-purpose to everyone with a makefile.
I ended up with a mix of nix and it's vm build system which is based on qemu. The issue is too tied to NixOS and all services run in the same place which forces you to manage ports and other things.
How I wish it could work is having a flake that defines certain services, these services could or could not run in different µVMs sharing an isolated linux network layer. Your flake could define your versions, your commands to interact and manage the lifecyle of those µVM's. As the nix store can be cached/shared, it can be provide fast and reproducible builds after the first build.
Can you expand on this? Are you talking about containers you create?
I think this approach works best in small teams where everyone agrees to drink the Nix juice. Otherwise, it's caused nothing but strife in my company.
What we have seen works especially when you are building developer centric product is expose these native issues around network, memory, compute and storage to engineers and they are more willing to work around it. Abstracting those issues leads to shift in responsibility on the product.
Having said that, I still think k8s is an upgrade when you have a large team.
1. Some operations on remote in local oriented way are time consuming and unmanageable.
2. With vendor specific way, our skill would be deprecated, having dependency to the vendors.
3. Kubernetes is not the best tools but it it popular.
As always, custom solution is the most powerful but should be replaced with more unified way for the stability of the development.
From a resource provider productive, the only way to squeeze a margin out of that space would be to reverse engineer 100% of human developer behavior so that you can ~perfectly predict "slack" in the system that could be reallocated to other users. Otherwise it's just a worse DX, like TFA gives examples of. Not a business I'm envious too be in... Just give everyone a dedicated VM or desktop, and make sure there's a batch system for big workloads.
A heterogeneous architecture with multi-tenancy poses some unique challenges because, as mentioned in the article, you get highly inconsistent usage patterns across different services. Also, arbitrary code execution (with sandboxing) can present a signifiant challenge. For security, you ideally need full isolation between services which belong to different users; this isolation wasn't a primary design goal of Kubernetes.
That said, you can probably still use K8s, but in a different way. For smaller customers, you could co-locate on the same cluster, but for larger customers which have high scalability requirements, you could have a separate K8s cluster for each one. Surely for such customers, it's worth the extra effort.
So in conclusion, I don't think the problems which were identified necessarily warrant abandoning K8s entirely, but maybe just a rethinking of how K8s is used. K8s still provides a lot of value in treating a whole cluster of computers as a single machine, especially if all your architecture is already set up for it. In addition to scheduling/orchestration, K8s offers a lot of very nice-to-have features like performance monitoring, dashboards, aggregated logs, ingress, health checks, ...
Also, there is a long tail of issues to be fixed if you do it with Kubernetes.
Kubernetes does not just give you scaling, it gives you many things: run on any architecture, be close to your deployment etc.
All the problems in the article also seem self-imposed. k8s can run stateful workloads just fine. Don't start and stop them. Figure out the math on how much it costs to run a container 24/7, add your margin, and pass that cost to the customer. Customer can decide to stop the containers to save $$, so the latency won't hurt, they'll accept it because they know they're saving money.
Oddly, I left with a funny alternate takeaway: One by one, their clever inhouse tweaks & scheduling preferences were recognized by the community and turned into standard k8s knobs
So I'm back to the original question... What is fundamentally left? It sounds like one part is maintaining a clean container path to simplify a local deploy, which a lot of k8s teams do (ex: most of our enterprise customers prefer our docker compose & AMIs over k8s). But more importantly, something fundamental architecturally about how envs run that k8s cannot do, but they do not identify?
Still, some of the core challenges remain: - the flexibility Kubernetes affords makes it hard to build and distribute a product with such specific requirements across the broad swath of differently set up Kubernetes installations. Managed Kubernetes services help, but come with their own restrictions (e.g. Kernel versions on GKE). - state handling and storage remains unsolved. PVCs are not reliable enough, subject to a lot of variance (see point above), and depending on the backing storage have vastly different behaviour. Local disks (which we use to this day), make workspace startup and backup expensive from a resource perspective and hard to predict timing wise. - user namespaces have come a long way in Kubernetes, but by themselves are not enough. /proc is still masked, FUSE is still not usable. - startup times, specifically container pulls and backup restoration, are hard to optimize because they depend on a lot of factors outside of our control (image homogeneity, cluster configuration)
Fundamentally, Kubernetes simply isn't the right choice here. It's possible to make it work, but at some point the ROI of running on Kubernetes simply isn't there.
AFAICT, a lot of that comes down to storage abstractions, which I'll be curious to see the answer on! Pinned localstorage <> cloud native is frustrating.
I sense another big chunk is the fast secure start problems that firecracker (noted in the blogpost) solve but k8s is not currently equipped for. Our team has been puzzling that one for awhile, and part of our guess is incentives. It's been 5+ years since firecracker came out, so likewise been frustrating to see.
Bottom of the post.
To the people saying ultra modern hardware could handle it: worth remembering the companies on question started on this path X years ago with Y set of technologies and Z set of experiences.
Because it made sense for Google in 2012 or whatever doesn't necessarily mean they would choose it again --or not-- given a do over (but there's basically no way back).
> A simpler version of this setup is to use a single SSD attached to the node. This approach provides lower IOPS and bandwidth, and still binds the data to individual nodes.
Are you sure SSD is that slow? NVMe devices are so fast that I hardly believe there's any need for RAID 0.
Does anyone have any links for cluster-autoscaler plugins? Searching drawing a blank, even in the cluster-autoscaler repo itself. Did this concept get ditched/removed?
Kubernetes has never ever struck me as a good idea for a development environment. I'm surprised it took the author this long to figure out.
K8s can be a lifesaver for production, staging, testing, ... depending on your requirements and infrastructure.
Sounds sane. Am i missing anything?
Glad someone said it out loud. So true. Apptainer has been a far better development experience for us.
The infrastructure now incredibly understandable and simple and cost effective.
Kubernetes cost us >$million in both DevOps time and actually Google Cloud costs unnecessarily, and even worse it cost us time to market. Stay off of Kubernetes as long as you can in your company, unless you are basically forced onto it. You should view it as an unnecessary evil that comes with massive downsides in terms of complexity and cost.
1.) What would you think of things like hetzner / linode / digitalocean (if stable work exists)
2.) What do you think of https://sst.dev/ or https://encore.dev/ ? (They support rather easier migration)
3.) Could you please indicate the split of that 1 million $ in devops time and google cloud costs unnecessarily & were there some outliers (like oh our intern didn't add this specific variable and this misconfigured cloud and wasted 10k on gcloud oops! or was it , that bandwidth causes this much more in gcloud (I don't think latter to be the case though))
Looking forward to chatting with you!
But this is really a spurious concern. I myself used to care about it years ago. But in practice, rarely do people switch between cloud providers because the incremental benefits are minor, they are nearly equivalent, there is nothing much to be gained by moving from one to the other unless politics are involved (e.g. someone high up wants a specific provider.)
https://github.com/bhouston/template-typescript-monorepo
This is my living template of best practices.
Yup. Isn't it Knative Serving or a home grown Google alternative to it? https://knative.dev/docs/serving/
The key is I am not managing Kubernetes and I am not paying for it - it is a fool's errand, and incredibly rarely needed. Who cares what is underneath the simple Cloud Run developer UX? What matters for me is cost, simplicity, speed and understandability. You get that with Cloud Run, and you don't with Kubernetes.
Anyway, as always it depends on what you want to use it for.
I guess team just wants to rewrite everything, it happens. Manager should prevent that.