However, after reading over this there are some serious red flags. I wonder if this team even understands what alternatives there are for scheduling at this scale or the real trade offs. It seems like an average choice at best and if I was paying the light bill I'd definitely object to going this route.
In the commercial proprietary world, clustered mainframes and supercomputers have addressed this niche for decades.
In the open source world, HashiCorp Nomad is the most analogous alternative to Kubernetes on commodity hardware, while SLURM is very successful for supercomputing.
I've also scale tested k8s to 15k nodes (in a limited configuration for a single application). At that point we ran out of underlying hardware budgeted for the test.
whatever Google and Facebook use internally.
I'm curious what they're doing now.
Scaling Kubernetes to 7,500 Nodes - https://news.ycombinator.com/item?id=25907312 - Jan 2021 (53 comments)
building skynet, apparently. All powered by k8s!
(kidding)
$ kubectl get nodes
I'm sorry, Dave, I can't do thatWith power to spare.
A single Kubernetes cluster can not be scaled indefinitely, and managing very large Kubernetes clusters is challenging. Hence, Armada is a multi-cluster scheduler built on top of several Kubernetes clusters.
Achieving very high throughput using the in-cluster storage backend, etcd, is challenging. Hence, queueing and scheduling is performed partly out-of-cluster using a specialized storage layer.
Armada is designed primarily for ML, AI, and data analytics workloads, and to:
- Manage compute clusters composed of tens of thousands of nodes in total. - Schedule a thousand or more pods per second, on average. - Enqueue tens of thousands of jobs over a few seconds. - Divide resources fairly between users. - Provide visibility for users and admins. - Ensure near-constant uptime.
Armada is written in Go, using Apache Pulsar for eventing, Postgresql, and Redis. A web-based front-end (named "Lookout") provides easy end-user access to see the state of enqueued/running/failed jobs. A Kubernetes Operator to provide quick installation and deployment of Armada is in development.
Source code is available at https://github.com/armadaproject/armada - we welcome contributors and user reports!
It would be nice if someone could solve this problem in a more Kubernetes native way. I.e. here is a container, run it on N nodes using MPI- optimizing for the right NUMA node / GPU configurations.
Perhaps even MPI itself needs an overhaul. Is a daemon really necessary within Kubernetes for example?