Some general feedback incase it's helpful.. -20K on contractors seems insane if we're talking about rack and stack for 10 racks. Many datacentres can be persuaded to do it for free as part of you agreeing to sign their contract. Your contractors should at least be using a server lift of some kind, again often provided kindly by the facility. If this included paying for server configuration and so on, then ignore that comment (bargin!).
-I would almost never expect to actually pay a setup fee (beyond something nominal like 500 per rack) to the datacentre either, certainly if you're going to be paying that fee it had better include rack and stack.
-A crash cart should not be used for a install of this size, the servers should be plugged into the network, and then automatically configured by a script/IPXE. It might sound intimidating or hard but it's not, doesn't even require IMPI (though frankly I would strongly, strongly recommend it, if you do't already have it). I would use managed switches for the management network too, for sure.
-Consider two switches, especially if they are second hand. The cost of the cluster not being usable for a few days while you source and install a replacement even here probably is still thousands.
-Personally not a big fan of the whole JBOD architecture and would have just filled by boots with single socket 4u supermicro chasis. To each their own, but JBOD's main benefit is a very small financial saving at the cost of quite a lot of drawbacks IMO. YMMV.
-Depending on who you use for GPUs, getting a private link or 'peering' to them might save you some cost and provide higher capacity.
-I'm kind of shocked that FMT2 didn't turn out much cheaper than your current colo, would expect less than those figures possibly with the 100G DIA included (normally about $3000/month no setup).
for IPXE do you have any reference material you'd recommend? we had 3 people each with reasonably substantial server experience try for like 6 hours each and for whatever reason it turned out to be too difficult.
For 10 racks it might not make sense.
For PXE / iPXE, there's several stages of boot. You have your NIC's option rom, which might be, but probably is not iPXE. That will hit DHCP to get its own IP and also request info about where to pull boot files. You'll need to give it a tftp server IP and a filename. DHCPD config below
I server iPXE executables to non-iPXE. When iPXE starts up, it again asks DHCP, but now you can give it an http boot script. The simplest thing is to have something like
kernel installer_kernel
initrd installer_initrd
boot
You can also boot isos, but that's a lot easier if you're in BIOS boot rather than UEFI. Better to practice booting kernels and initrds (unless you need to boot things like firmware update isos)Then you'll have your installer (or whatever) booted, and you might have an unattended install setup for that, or you can just setup a rescue image that does dhcp (again!) and opens sshd so you can shell in and do whatever. Up to you.
the pxe part of my isc dhcpd config is:
next-server 203.0.113.11;
if exists user-class and option user-class = "iPXE" {
option ipxe.no-pxedhcp 1;
filename "http://203.0.113.11/tftpboot/menu.ipxe";
} else {
if option client-arch = 00:06 {
filename "ipxe.efi-i386";
} else if option client-arch = 00:07 {
filename "ipxe.efi-x86_64";
} else {
filename "undionly.kpxe";
}
}
(This is mostly consoldidating bits and pieces from here [1] )And I have those three files in the root of my tftp server. There's all sorts of other stuff you could do, but this should get you started. You don't really need iPXE either, but it's a lot more flexible if you need anything more, and it can load from http which is gobs faster if you have large payloads.
If you really wanted to be highly automated, your image could be fully automated, pull in config from some system and reconfigure the BMC while it was there. But there's no need for that unless you've got tons of servers. Might be something to consider if you mass replace your disk shelves with 4U disk servers, although it might not save a ton of time. If you're super fancy, your colo network would have different vlans and one of them would be the pxe setup vlan --- new servers/servers needed reimaging could be put into the pxe vlan and the setup script could move them into the prod vlan when they're done. That's fun work, but not really needed, IMHO. Semi-automated setup scales a lot farther than people realize, couple hundred servers at least. autopw [2] can help a lot!
[1] https://ipxe.org/howto/dhcpd
[2] https://github.com/jschauma/sshscan/blob/master/src/autopw