undefined | Better HN

0 pointsSadWebDeveloper8y ago0 comments

That is basically the serverless main "point of sell" you are just one step away from automation (if you aren't already doing it) and it will be virtually the same as serverless

0 comments

3 comments · 1 top-level

aidos8y ago· 2 in thread

You're right, this is effectively a serverless mode. But the serverless instances that are currently available (at least on aws) aren't powerful enough for some applications. For those of us stuck in the middle, needing big machines with serverless behaviour, this is a massive win.

FridgeSeal8y ago

I'm in the same boat as you: data science workloads that run intermittently on big hardware.

What are you running your big jobs on? Because I'm currently using Batch, but given you've got to wait for the compute environment/VM to start up (if it's not already running), and that's a pain because it takes forever to startup.

I wish I could just run containers on large hardware the same way we can run lambda's: press the button and it just runs, I don't really care about having my own full compute environment, I just need enough memory and CPU to run it.

aidos8y ago

Ours are actually user generated and the running time of each task is variable (few minutes to an hour). Users can to dump anywhere between 1 and 200 tasks on at a time.

The way we have it set up is:

- simple job queue with RQ (redis)

- monitoring watches the queue and pumps a metric into Cloud Watch (there are a few different types of job and it calculates a single aggregate value for "queue pressure")

- autoscale then sets the desired capacity for a fleet of r4.2xlarge machines (somewhere between 1 and 20)

- the autoscale config protects all those machines from scale-in so they have to be shutdown externally

- each of those machines has a cron on boot that tracks the start time

- this enables a cron to run just before the end of each hour. If that machine isn't doing anything at the time, it will shut itself down

- the machines are set to terminate on shutdown so they die completely

- additionally, we've hacked RQ so that workers that are closer to death will move themselves to the back of the queue more frequently. This ensures that we have a higher chance of not being busy / shutting them down at the end of the hour.

j / k navigate · click thread line to collapse