Ours are actually user generated and the running time of each task is variable (few minutes to an hour). Users can to dump anywhere between 1 and 200 tasks on at a time.
The way we have it set up is:
- simple job queue with RQ (redis)
- monitoring watches the queue and pumps a metric into Cloud Watch (there are a few different types of job and it calculates a single aggregate value for "queue pressure")
- autoscale then sets the desired capacity for a fleet of r4.2xlarge machines (somewhere between 1 and 20)
- the autoscale config protects all those machines from scale-in so they have to be shutdown externally
- each of those machines has a cron on boot that tracks the start time
- this enables a cron to run just before the end of each hour. If that machine isn't doing anything at the time, it will shut itself down
- the machines are set to terminate on shutdown so they die completely
- additionally, we've hacked RQ so that workers that are closer to death will move themselves to the back of the queue more frequently. This ensures that we have a higher chance of not being busy / shutting them down at the end of the hour.