Scaling Kubernetes to 7,500 nodes (2021) (opens in new tab)

(openai.com)

95 pointsizwasm3y ago37 comments

37 comments

23 comments · 8 top-level

mrits3y ago· 7 in thread

I'm not a huge fan of Kubernetes. However, I think there are some great use cases and undeniably some super intelligent people pushing it to amazing limits.

However, after reading over this there are some serious red flags. I wonder if this team even understands what alternatives there are for scheduling at this scale or the real trade offs. It seems like an average choice at best and if I was paying the light bill I'd definitely object to going this route.

Thaxll3y ago

There is no alternative as far as I know. Which open source or private solution scale above 10k nodes and 100k apps ( pods )?

dharmab3y ago

There are many private, proprietary systems that exceed those scales. They are usually bespoke to the applications that they run. I work with two such systems and know of others in finance, energy and scientific computing. Not to mention Borg at Google.

In the commercial proprietary world, clustered mainframes and supercomputers have addressed this niche for decades.

In the open source world, HashiCorp Nomad is the most analogous alternative to Kubernetes on commodity hardware, while SLURM is very successful for supercomputing.

I've also scale tested k8s to 15k nodes (in a limited configuration for a single application). At that point we ran out of underlying hardware budgeted for the test.

2 more replies

emmp3y ago

One of Nomad's major pitches is that it can scale larger than K8s. It's all over any comparison between the two.

1 more reply

fmajid3y ago

HPC schedulers, most likely.

fancy_pantser3y ago

In HPC we see Slurm pretty often.

1 more reply

rco87863y ago

Mesos definitely does

1 more reply

IceWreck3y ago

> private solution

whatever Google and Facebook use internally.

sciurus3y ago· 4 in thread

This is from 2021 and was discussed then at https://news.ycombinator.com/item?id=25907312

I'm curious what they're doing now.

dang3y ago

Thanks! Macroexpanded:

Scaling Kubernetes to 7,500 Nodes - https://news.ycombinator.com/item?id=25907312 - Jan 2021 (53 comments)

MichaelMoser1233y ago

> I'm curious what they're doing now.

building skynet, apparently. All powered by k8s!

MuffinFlavored3y ago

Scaling it to 7,600 nodes

(kidding)

mdaniel3y ago

Given that their first post in this vein was https://openai.com/research/scaling-kubernetes-to-2500-nodes then one would expect it to be "Scaling it to 12,500 nodes" :-D

   $ kubectl get nodes
   I'm sorry, Dave, I can't do that

b1123y ago· 4 in thread

Success! Meanwhile, all 7500 nodes are, computationally, replaced by a 96 core, $10k server, in a dude's basement.

With power to spare.

intelVISA3y ago

But I thought Good System Design involved reserializing the same data multiple times across the cloud(tm) and had a dedicated SRE and infra team - it's cheaper than one sys admin!

jmillikin3y ago

You'd generally want each of those 7500 machines be a full-sized server. No point running Kubernetes on tiny VMs, since its purpose is to provide bin-packed scheduling in a datacenter.

electroly3y ago

This isn't some dipshit enterprise running LOB software. This is OpenAI. These are all giant multi-GPU nodes getting slammed all day with machine learning jobs.

mardifoufs3y ago

Yeah, I'm sure openai could've trained gpt4 on a 10k$ machine.

antonchekhov3y ago

To overcome the limitations on cluster size in Kubernetes, folks may want to look at the Armada Project ( https://armadaproject.io/ ). Armada is a multi-Kubernetes cluster batch job scheduler, and is designed to address the following issues:

A single Kubernetes cluster can not be scaled indefinitely, and managing very large Kubernetes clusters is challenging. Hence, Armada is a multi-cluster scheduler built on top of several Kubernetes clusters.

Achieving very high throughput using the in-cluster storage backend, etcd, is challenging. Hence, queueing and scheduling is performed partly out-of-cluster using a specialized storage layer.

Armada is designed primarily for ML, AI, and data analytics workloads, and to:

- Manage compute clusters composed of tens of thousands of nodes in total. - Schedule a thousand or more pods per second, on average. - Enqueue tens of thousands of jobs over a few seconds. - Divide resources fairly between users. - Provide visibility for users and admins. - Ensure near-constant uptime.

Armada is written in Go, using Apache Pulsar for eventing, Postgresql, and Redis. A web-based front-end (named "Lookout") provides easy end-user access to see the state of enqueued/running/failed jobs. A Kubernetes Operator to provide quick installation and deployment of Armada is in development.

Source code is available at https://github.com/armadaproject/armada - we welcome contributors and user reports!

vvladymyrov3y ago

Also they use Ray.io from Anyscale https://archive.ph/ZlMi5

osigurdson3y ago

>> Pods communicate directly with one another on their pod IP addresses with MPI via SSH

It would be nice if someone could solve this problem in a more Kubernetes native way. I.e. here is a container, run it on N nodes using MPI- optimizing for the right NUMA node / GPU configurations.

Perhaps even MPI itself needs an overhaul. Is a daemon really necessary within Kubernetes for example?

rmorey3y ago

good read. should probably get [2021] tag

satvikpendem3y ago

Is Kubernetes simply BEAM but not on Erlang?

j / k navigate · click thread line to collapse

37 comments

23 comments · 8 top-level

mrits3y ago· 7 in thread

I'm not a huge fan of Kubernetes. However, I think there are some great use cases and undeniably some super intelligent people pushing it to amazing limits.

Thaxll3y ago

There is no alternative as far as I know. Which open source or private solution scale above 10k nodes and 100k apps ( pods )?

dharmab3y ago

In the commercial proprietary world, clustered mainframes and supercomputers have addressed this niche for decades.

In the open source world, HashiCorp Nomad is the most analogous alternative to Kubernetes on commodity hardware, while SLURM is very successful for supercomputing.

I've also scale tested k8s to 15k nodes (in a limited configuration for a single application). At that point we ran out of underlying hardware budgeted for the test.

2 more replies

emmp3y ago

One of Nomad's major pitches is that it can scale larger than K8s. It's all over any comparison between the two.

1 more reply

fmajid3y ago

HPC schedulers, most likely.

fancy_pantser3y ago

In HPC we see Slurm pretty often.

1 more reply

rco87863y ago

Mesos definitely does

1 more reply

IceWreck3y ago

> private solution

whatever Google and Facebook use internally.

sciurus3y ago· 4 in thread

This is from 2021 and was discussed then at https://news.ycombinator.com/item?id=25907312

I'm curious what they're doing now.

dang3y ago

Thanks! Macroexpanded:

Scaling Kubernetes to 7,500 Nodes - https://news.ycombinator.com/item?id=25907312 - Jan 2021 (53 comments)

MichaelMoser1233y ago

> I'm curious what they're doing now.

building skynet, apparently. All powered by k8s!

MuffinFlavored3y ago

Scaling it to 7,600 nodes

(kidding)

mdaniel3y ago

Given that their first post in this vein was https://openai.com/research/scaling-kubernetes-to-2500-nodes then one would expect it to be "Scaling it to 12,500 nodes" :-D

   $ kubectl get nodes
   I'm sorry, Dave, I can't do that

b1123y ago· 4 in thread

Success! Meanwhile, all 7500 nodes are, computationally, replaced by a 96 core, $10k server, in a dude's basement.

With power to spare.

intelVISA3y ago

But I thought Good System Design involved reserializing the same data multiple times across the cloud(tm) and had a dedicated SRE and infra team - it's cheaper than one sys admin!

jmillikin3y ago

You'd generally want each of those 7500 machines be a full-sized server. No point running Kubernetes on tiny VMs, since its purpose is to provide bin-packed scheduling in a datacenter.

electroly3y ago

This isn't some dipshit enterprise running LOB software. This is OpenAI. These are all giant multi-GPU nodes getting slammed all day with machine learning jobs.

mardifoufs3y ago

Yeah, I'm sure openai could've trained gpt4 on a 10k$ machine.

antonchekhov3y ago

Achieving very high throughput using the in-cluster storage backend, etcd, is challenging. Hence, queueing and scheduling is performed partly out-of-cluster using a specialized storage layer.

Armada is designed primarily for ML, AI, and data analytics workloads, and to:

Source code is available at https://github.com/armadaproject/armada - we welcome contributors and user reports!

vvladymyrov3y ago

Also they use Ray.io from Anyscale https://archive.ph/ZlMi5

osigurdson3y ago

>> Pods communicate directly with one another on their pod IP addresses with MPI via SSH

It would be nice if someone could solve this problem in a more Kubernetes native way. I.e. here is a container, run it on N nodes using MPI- optimizing for the right NUMA node / GPU configurations.

Perhaps even MPI itself needs an overhaul. Is a daemon really necessary within Kubernetes for example?

rmorey3y ago

good read. should probably get [2021] tag

satvikpendem3y ago

Is Kubernetes simply BEAM but not on Erlang?

j / k navigate · click thread line to collapse