So important piece of advice. If you can, hire an admin with HPC experience. If you can't, find ML people with HPC experience. Things you can ask about are slurm, environment modules (this clear sign!), what a flash buffer is, zfs, what they know about pytorch DDP, their linux experience, if they've built a cluster before, adminning linux, and so on. If you need a test, ask them to write a simple bash script to run some task and see if everything has functions and if they know how to do variable defaults. With these guys, they won't know everything but they'll be able to pick up the slack and probably enjoy it. As long as you have more than one. Adminning is a shitty job so if you only have one they'll hate their life.
There are plenty of ML people who have this experience[0], and you'll really reap rewards for having a few people with even a bit of this knowledge. Without this knowledge it is easy to buy the wrong things or have your system run far from efficient and end up with frustrated engineers/researchers. Even with only a handful of people running experiments schedulers (like slurm) still have huge benefits. You can do more complicated sweeps than wandb, batch submit jobs, track usage, allocate usage, easily cut up your nodes or even a single machine into {dev,prod,train,etc} spaces, and much more. Most importantly, a scheduler (slurm) will help prevent your admin from quitting as it'll help prevent them from going into a spiral of frustration.
[0] At least in my experience these tend to be higher quality ML people too, but not always. I think we can infer why there would be a correlation (details).
No other task is needed and our Grafana monitors if the server (and its containers) are up and running.
Curious to know what you use other than grafana in your monitoring stack. We use prometheus for metrics/alerts and Loki/promtail for logs.
Nice! How much does this cost?
Once you have heavy and/or unconventional compute needs, it's likely cheaper to self-host or colo purchased hardware.
They are processing 2.5 Billion images and videos in a single day. They decided to self host their GPUs.
The solution uses off-the-shelf hardware, with GPU per "server", add it all together into a single rack? And that is the GPU compute needed to process all the videos 24/7?
Then they have this rack in the office, but they cant find a place to put it. That might be a decent thing to start out with, before the build. Where do we put it?
But no. Planning for multiple network links, multiple redundant power, cooling, security, monitoring, and backup generators, handling backups, fire suppression, and failover to a different region if something fails was not necessary.
Because Google book?
But our (insert ad here) WeWork let us put our servers in a room on the same floor, (their data centerish capabilities seem limited)
There are so many additional costs that are not factored into the article.
I am sure once they accrue serious downtime a few times and irate customers, then paying for hosting in a proper data center might start making sense.
Now I am basing this comment on the assumption that the company is providing continuous real-time operations for their clients.
If it is more batch operated, where downtime is fine as long as results are delivered let us say within 12 hours.
I'd personally have these on tailscale, not exposed to the internet, but at some point in self hosting, clients have to be able to talk to something.
I know tailscale has their endpoints but I can't expect this to be able to server a production API at scale.
> AMD 5700x processor
I find it to be an odd choice. I mean the CPU itself is perfectly fine (typing this myself on a 5600G, which I very much like), but AM4 socket is pretty much over - there is no upgrade path anymore once it starts getting long on the tooth. (Unlike the other parts, which can be bumped: RAM, GPU, storage...)Seeing AM4 boards and cpus easily 1/2 the price of AM5 gear in consumer sector. Imagine it's similar in the professional sector.
I was going to say, business just needs server to last 3 years, they are normally written off after 3 years and you don't do upgrade plans. Currently we're aiming more for 5 years, but budgeting for 3, that way anything beyond the 3 years is basically free. No one plans to purchase upgrade parts for their old Dell servers either.
You can also move some of these machines into other roles like QA later on.
Was going to toss an application your way since it sounds like interesting work, but it looks like the Google Form on your Careers page was deleted.
It says that would cost $6.51/hr and $4752/yr: I think you pay both of those things. I think the first number is the hourly cost, and the second number is the annual commitment. So its $56,246/year if you're running 24x7 + $4,756 = $61,002/year total.
- so you have to add the price of AMD/Intel bare metal servers.
- the price of "Networking" PER TB
- and the "Additional services pricing"
So even at the reserved price for a year (365 * 24 * $6.51) you're nowhere near $4,750 per year, it's closer to $60k