What are you running your big jobs on? Because I'm currently using Batch, but given you've got to wait for the compute environment/VM to start up (if it's not already running), and that's a pain because it takes forever to startup.
I wish I could just run containers on large hardware the same way we can run lambda's: press the button and it just runs, I don't really care about having my own full compute environment, I just need enough memory and CPU to run it.
The way we have it set up is:
- simple job queue with RQ (redis)
- monitoring watches the queue and pumps a metric into Cloud Watch (there are a few different types of job and it calculates a single aggregate value for "queue pressure")
- autoscale then sets the desired capacity for a fleet of r4.2xlarge machines (somewhere between 1 and 20)
- the autoscale config protects all those machines from scale-in so they have to be shutdown externally
- each of those machines has a cron on boot that tracks the start time
- this enables a cron to run just before the end of each hour. If that machine isn't doing anything at the time, it will shut itself down
- the machines are set to terminate on shutdown so they die completely
- additionally, we've hacked RQ so that workers that are closer to death will move themselves to the back of the queue more frequently. This ensures that we have a higher chance of not being busy / shutting them down at the end of the hour.