We're thinking of tackling the challenges of new hardware procurement. From the initial "what do I need?", through generating a BoM with all the boilerplate, validating the quote, the delivery, etc.
We'd like to know if and what are the pains that you experience in this process. For example: - Who in your org generates the initial spec? - What's your workflow from spec to quote to delivery? - What works and what's broken?
- often you are limited to "off the shelf" approved subsets of hardware+rhel configurations that have been tested by a centralised team. These teams rarely have a clue about what they are doing, are highly inflexible and their specs very out of date.
- with colo there has been more flexibility as these are bought for the bank by a 3rd party
- very large number of NICs needed for bonding + seperate physical network for client connectivity/internal connectivity/exchange connectivity.
- driven by business requirements from whatever trading area
- their own technology team
- then massive multilayer procurement sign off process that can take 6+ months
- the available hardware might change or increment during this process requiring a reset if the appetite exists
- usually not bought direct from manufacturer but a "pre approved" vendor who passes bank procurement onboarding procedures
- servers all have iLo for mgmt/failure etc - usually requires interaction with an official systems team via internal bank ticketing/helpdesk system
- colo you open tickets with e.g. Eqinix to receive the hardware and have it put in your cage or whatever pending weekend install by bank or 3rd party team who hook them up to switches as per agreed design with network+systems and business tech team
- often when live there are 2/3 different teams monitoring the hosts 24/7 using various tools
prob left out a load of stuff.
By far, the most painful part of the process is the ends of the pipeline (procurement for the project, and the datacenter) not knowing anything about the opposite end. The team procuring the hardware would just assume it'd go fine into the datacenter however they dreamed it up, and the datacenter team would be flying blind with little details as to how the system needed to be set up. It was rare that a procuring team would know what they're doing with which nics need to go in which ports, and what VLANs would need to be spun up, and stuff like that.
yea. many of the teams would be devs who wouldn't understand that kind of thing (which was a large part of my job, being in such teams and knowing, roughly, how things worked helped get things done from day 1).
We're wondering if automating this headache to some SaaS that offers BoM templating, compatibility validation, interchangeable parts and such, could be of interest to such an org. Essentially they'd be left with approvals and the rest would be stepping through an automated workflow. WDYT?
The problem is finding anyone with experience doing this. It's a skillset that's sorta been lost.
And to top it off, it seems like it's all undocumented know-how. You have to be a practitioner to pick it up.
We find that the tools are focused on all sorts of inventory (DCIM) and provisioning. Processes are homegrown based on whatever ticketing system you have in place. That's what we're looking to change.
So far it feels like the colo side aren't hyped around introducing tools to automate things. It's like their jobs are at risk. And the customer side is bent on building their own tooling from scratch.
We have found some early adopters, but they're hard to come by. Even when we can quantity savings in money and time to market. Not to mention engineering frustration.
Would love to have a conversation around this!
It might make sense to add an Ansible type of reaction as well, though I'm unsure of the scenario.
I think abstracting ipmitool is nice but it's just a building block. The real killer is when you put it all together - recognizing a system is down, and bringing it back up automatically despite having no working OS. It might require using ipmitool. But it might also require clearing state in a provisioning system via an HTTP call. Or sending some keystrokes.
It's about moving beyond manual recovery when no OS is present.
On the other hand, there is also that growth tipping point where cloud is not cost effective anymore and you need to switch to physical servers, or at least it would be a cost savings doing so.
That's quite a while later though save for some niche cases.
One interesting lane might be to combine these two and offer a one stop service for getting cloud servers and hardware. In particular you could offer guidence (automated or otherwise) about when it makes the most sense to use cloud or onprem at what scale.
Scope could also include things like colo vs self hosting, What datacenters to use, where to buy land to build new datacenters, region and cdn consideration, etc.
Full workflow was customer generated and provided the initial build spec, her program (called "Palinode") validate this spec, made the purchase, received equipment from the vendor, assembled it in test racks at a contractor facility, performed a physical inspection and validation of the build, then shipped it to the production data center, installed it, and performed final validation on-site, usually with a customer witness. For that, they used a suite of custom tools that matched the functionality your linked toolkit seems to be providing, and also had management and alerting tools all developed in-house.
I'd say the biggest pain points were not really something any third-party or software provider could alleviate, and that was the locked down firmware and NDA agreements with the network appliance vendors. They experienced so many bizarre appliance failure modes that could not even be debugged except with chip-level log access that required vendor reps to come on-site just to be able to read them. And there was a major fire-drill level issue with defective memory they couldn't even disclose to their customers because of the vendor NDA, and it took a really long time to address it. I was honestly pretty surprised by that, as the vendor is a publicly traded enormous company that you think would need to legally disclose product defects like this in components that could potentially power safety-critical devices, but for whatever reason, that wasn't the case, and remediation was seriously slowed down by all the need for secrecy.
The only thing I can see changing this is competition, but it largely doesn't exist at the chip level or for specialized hardware. Many components and appliances only have two or three vendors, and the levels of compliance you have to go through to host classified data combined with patent law make it nearly undisruptable as an industry.