Having used both AWS Spot Instances for compute jobs, as well as hpc resources (slurm), the amount of work to getting it run is comparable... Except that the aws stuff quickly gets expensive when needing thousands of cpu hours per "experiment", and money is usually tight at universities ;-). Some ansible scripts automate basically everything away for myself. But using the tool you know is probably not always a bad idea.
I can see an alternate timeline where I set up Buildkite in AWS to run Slurm jobs on the HPC cluster, rather than using EC2 spot instances. Using Buildkite + S3 were probably the more important infra changes.
> That is my biggest hurdle working at a university right now.
We took the opposite approach and assembled our own equipment, which we self host and manage. The cost savings we're significant, and what we lose doing admin and maintenance is saved by having the flexibility and freedom to code and run what we want, when we want.
Probably because they are working with publicly available data so they don't need to worry about HIPAA/restricted data compliance. One of the main reasons for using the HPC/Slurm cluster is that IT (or the data compliance office) guarantees compliance with restricted data.