Ask HN: How would you queue and process 10K+ long running jobs

14 pointsbballer7y ago28 comments

Hey guys wanted to ask a question about what technologies & methodologies you would architect together if you needed to constantly be queuing up 10K+ jobs, distributing the work out and then reporting that it was completed. It would be required that you should never be able to schedule a duplicate job when one is queued/running, and you need to ensure that each job only gets picked up by 1 worker and run once. These jobs could last anywhere (in run time) from 30 seconds to 1 hour.

I've tried googling but my fu is failing me. Would love to hear the thoughts from people who have maybe solved similar problems.

Thanks!

14 pointsbballer7y ago28 comments

I've tried googling but my fu is failing me. Would love to hear the thoughts from people who have maybe solved similar problems.

Thanks!

28 comments

27 comments · 16 top-level

mtmail7y ago· 2 in thread

Look for job scheduling software plus the name of your programming language of choice. Or background processing. https://sidekiq.org/ is one for Ruby for example, https://aws.amazon.com/sqs/ one that runs in the cloud. Those pages should give you more words to search for as "job scheduling" and "queue" gives too many non-software related results.

wallflower7y ago

Amazon SQS is not an ideal choice because it is not designed to be a “forever” queue. Messages will expire in two weeks.

rubenhak7y ago

He doesn't seem to need a forever queue. Processing might take a while, but he just needs to get it processed once (and exactly once) and move on. Yes, if he wants to do event sourcing, then yes SQS would not suffice. But I don't think he needs that. But SQS seems to be a great choice.

1 more reply

rubenhak7y ago· 2 in thread

You should provide more info regarding your environment. If you're running this in public cloud tell us which one. Every provider has native several queue services for different needs and makes things easier to work with and worry less about setting things up.

bballerOP7y ago

For workers we run only Java and run everything on AWS. We have an existing production system that uses a combination of SQS, DynamoDB, and Postgres, and EC2s to achieve something very similar to this. Just want to check all the boxes before we dive into building out something for a new system coming into production that shares many of the same requirements.

rubenhak7y ago

SQS-FIFO should let you process task once & only once. Just make sure you configure timing parameters correctly.

DynamoDB has triggers that get fired upon changes. That would strongly help with eventual consistency implementation (which i strongly recommend). But with this you should write Lambdas. Check how well is Lambda Java supported.

Are you sure you want to use EC2 directly? Why not to use ECS? This would let you focus more on the business and less on infrastructure

blcArmadillo7y ago· 2 in thread

Depending on what a job is, it seems like this could all be done with Jenkins.

bballerOP7y ago

Sorry but Jenkins doesn't achieve anything related to my question.

aprdm7y ago

How so? Jenkins is just a job scheduler, it has lot's of options for scheduling and dispatching jobs.

saluki7y ago· 1 in thread

Are you choosing a tech stack to do this?

Laravel has this built in with queues.

https://laravel.com/docs/5.7/queues

You can run multiple workers, it will intelligently distribute the jobs, and there is a Laravel Horizon package that can handle monitoring of the queues/jobs.

I expect Rails would have something similar, but I haven't used queues in Rails.

bballerOP7y ago

We run all Java so that is a no go and at the scale I'm talking about don't think I would trust those solutions. Thanks for the input though!

Sahhaese7y ago· 1 in thread

Any message queue or similar could help with this.

Popular solutions:

* RabbitMQ

* Service Bus

* Kafka

RabbitMQ would be well suited, you define a producer and can then spin up as many consumers as you would like, each consuming from the same queue.

Preventing duplicate queuing should be done on the producer before it is queued.

It depends on the nature of the scaling and how much durability you want though, you may wish to simply maintain an atomic queue of work to be done, in which case any thread-safe list would suffice as long as changes were done atomically.

10k isn't that much in the grand scheme of things, a simple database could easily store such a queue if you didn't have that many nodes trying to consume the same table at once.

You would need to write stored procedures for transactional read & delete and inserts to prevent duplication of jobs.

In that case something like Redis might be good, which itself can also act as pub/sub and used for messaging.

Would you look to scale up more consumers as the queue lengthened, or would there still be a fixed number of consumers?

bballerOP7y ago

Thanks for your comment! Don't have enough time to fully respond to it right now but will get back to you in the morning.

lfx7y ago· 1 in thread

It would be easily done by AWS SQS you can put as many elements to queue as you want (some limits apply, but may be easily lifted) and then remove items from the queue when the job is done, by tunning invisibility time-out you make work by your requirements. You can use lambdas (too short for your case) or ec2 or Fargate as your workes and it may scale up or down depending on loads. What is cool that you can create multiple shards if you could predict how long jobs would take so some could be done in lambda other using ec2 thus reducing costs.

bballerOP7y ago

Thanks for your insight. This is currently close to what we have in production for one of our systems and we are building a brand new system that shares a lot of the same requirements, just fishing to make sure we aren't missing anything :]

superasn7y ago· 1 in thread

We did something similar on our site but the max duration of each job was 5 mins. So maybe this solution may not be totally relevant to you but what were doing is we've created a aws lambda function that is triggered when a file is written on S3.

So instead of traditional SQS we just write unique files on S3 with job data and that triggers the lambda function to process the job and notify a URL upon completion.

bballerOP7y ago

Yeah lambda functions won't suffice for the kind of jobs we run. Plus we are a full java shop and the don't want to pay for the cold start times on lambda for the jobs that do end up only taking < 1 minutes.

Thanks for your thoughts!

dmarlow7y ago· 1 in thread

Can you elaborate on what the "job" is or does? Can things be batched?

bballerOP7y ago

I wont fully elaborate on what a job is but I'll give you a couple examples:

Updating anywhere from 500-5M items over APIs that are rate limited.

Dumping datasets that have to be normalized and massaged into files/ and dropped off at third party servers anywhere from every 15 minutes to once a month. These files could contain anywhere from 500 lines to 5M lines.

Injesting datasets just as large as described above but massaged and saved into our caches and DB.

sethammons7y ago

You have an interesting requirement: each job only gets picked up by 1 worker and run once (with up to an hour long job).

I'll contend that you can't do that. You could get at least once or at most once. Let's say that you go with at least once.

Just use a queue like RabbitMQ. Workers connect, request work, ack that they are done, and you should be good to go. Done. You can set this up today if you want.

If you need more thorough duplicate detection, you could sprinkle in some redis to store job state (in progress / complete). Using atomic operations like INCR/DECR on your key, you could pull a job from Rabbit (or your queue of choice), hit redis to ensure that the job is not in progress or already complete due to a network error between Rabbit and your workers, and then proceed appropriately.

The key problem here is that the network could drop requests. You could pull from your queue, complete the work, and think you ack'd, but the queue never gets the ack, so it hands out the work again to a new worker after the lock expires on it. So you could mitigate that with an additional layer of a distributed KV store. But that could have the same problem.

I run a system that processes billions of events a day and we use a system very similar to what I described above (though we have a custom queue solution and a pool of redis nodes that we have some custom quorum logic around). We don't seem to be duplicating hardly any jobs (maybe a handful a week).

If you use kafka, and only use the java clients, they say you can get exactly once delivery. See https://www.confluent.io/blog/exactly-once-semantics-are-pos....

They way they do it is by controlling the client and the server and sprinkling some write-ahead-logs with logical clocks and using a formal consensus protocol (paxos) under the hood. Even with all that, I'm skeptical of the exactly once claim.

kwillets7y ago

I did a lot of troubleshooting on a system like this a year or two ago, and most of it came down to making sure that global state transitions are atomic, and making communications as robust as possible.

We had the basics of execute-once using a leasing pattern, but we had a number of bugs related to multiple instances of a task existing in different threads (the executor would load the task object and then fork, leaving two instances in possibly stale states, and I also found failure paths that left multiple instances running), and we also saw a number of daily double-executions related to the lease-renewal process freezing, or non-transactional state transition.

We added a lot of state-transition auditing, including a pid/thread ID to find out where updates were coming from.

IIRC I eventually settled on having the executor (queue listener) do every possible check prior to execution (checking resource limits, process count limits, etc.) without loading the task instance itself (just the ID from the queue message). After the fork the child loads the task and does a single transaction that deletes the queue message and creates the execution record (the one-and-only-one run, basically). Every failure up to that point will requeue, but once the run is created, the queue message has to be deleted. We then transition to leasing the execution, and mark it failed if the lease expires.

We also created a centralized service to renew the leases on the execution objects after we found that to be a failure point. Long-running processes just have a lot of problems keeping connections open, etc.

shoo7y ago

it might be relevant to say how much throughput you need or other factors, such as if your problem in inherently concurrent (reacting to events outside your control) or if you actually just want to do parallel processing.

for example, "10k+ jobs" sounds like a large number, but depending on throughout perhaps it is trivial.

i have a hobby project to fetch data from external sources and store the results in a database. this has about 70k different jobs defined. each job is scheduled to be run at some frequency. i run everything on a single physical box with 4gb of ram and a low energy CPU. The worker processes are python scripts, the job queue state is stored in the same postgres database i use to store results. My throughout is low, i only need to process a job every few seconds. The workers run on the same box as the database as I am too lazy to maintain more machines and too cheap to rent cloud servers. Running costs are about $15 / year for energy.

The queue implementation is based on this: https://blog.2ndquadrant.com/what-is-select-skip-locked-for-...

from memory i think I am using primitives offered by database to prevent multiple workers from acquiring the same job (transactions, transaction isolation). this might not be very scalable but i only have two worker processes and each job takes seconds to process.

do you want to do what I'm doing? Probably not. but perhaps what you are doing is easy.

ecesena7y ago

There are great comments on the technology itself, including SQS/PubSub/RabbitMQ, or Celery.

However if you need 1) management, i.e. easy way to look at what went wrong and retry 2) dependencies between jobs, you should look into something like Airflow (instead of building your own): https://airflow.apache.org

It's also a very good example of architecture, in case you decide it's not good for you and you really want to build your own.

linksnapzz7y ago

Have you considered...a traditional batch system, like OpenPBS, SLURM, or Gridengine? That sorta sounds like the tasks they were meant to solve...

iAm256267y ago

The following comes to mind

Python based:

http://www.celeryproject.org/ <-- async task queue

https://github.com/spotify/luigi <-- more pipe line centric

There are many like it. RQ(more bare bone)

atomashpolskiy7y ago

I've done this for running long-lived subscriptions in a trading platform back-end service. The load is similar. Basically, you need a cluster software with sharding option. I personally used Akka Cluster Sharding (a Scala library, which also happens to have Java bindings).

It starts a network node in each instance of the service and binds all nodes into a cluster (discovery of other nodes is left for the service developer; simplest solution is to have a static list of node addresses). The sharding mechanism allows you to distribute arbitrary data objects among the nodes according to some rules (e.g. based on the value "object's hash modulo number of nodes in cluster", which produces a typical hashring). Data objects may originate on any node (e.g. on schedule or on some external event). They also need to be serializable to be passed between nodes and, obviously, the binary representation should not be too big. So, depending on the nature of jobs in your case, you may want to pass only the job ID as the data object and store the actual job definition and/or arguments in a separate place (e.g. a database).

Now, Akka will guarantee that the job will be run only by 1 node and run once (until completion that is; the job may be started more than once, see tips below). But if there are many jobs coming in, you risk overwhelming the cluster (because passing messages between nodes takes time, jobs themselves take time, etc.) So to defend your cluster against load, which it might not be able to handle, you may put a message queue in front of it. This will let you set up a max number of jobs that concurrently run in the cluster, and nodes will take new jobs from the queue only when some of the currently running jobs complete. Most MQs have persistence, so the jobs will be safe even if they need to wait for a while sitting in the queue.

Few tips:

1) Akka persists node's data, so the jobs that are taken from the queue are going to be safe in case the node they are located at fails

2) If a node leaves the cluster, all its' jobs will be moved to other nodes (according to the sharding rules mentioned above) and started over, so you may need to introduce some kind of transactions (same goes for cluster restart)

3) If the whole cluster is restarted, unfinished jobs will be distributed among the nodes according to the sharding rules, so it might be a good idea to make this rules "sticky", so that each individual job is always assigned to the same node (and each node will just have to load its' jobs from its' own persistent store). Otherwise there might be a lot of message passing which will slow down the startup.

aprdm7y ago

The visual effects industry has been doing that for ever to render frames.

Have a look on Tractor, Qube, Rush, Flamenco by blender

j / k navigate · click thread line to collapse

28 comments

27 comments · 16 top-level

mtmail7y ago· 2 in thread

wallflower7y ago

Amazon SQS is not an ideal choice because it is not designed to be a “forever” queue. Messages will expire in two weeks.

rubenhak7y ago

1 more reply

rubenhak7y ago· 2 in thread

bballerOP7y ago

rubenhak7y ago

SQS-FIFO should let you process task once & only once. Just make sure you configure timing parameters correctly.

Are you sure you want to use EC2 directly? Why not to use ECS? This would let you focus more on the business and less on infrastructure

blcArmadillo7y ago· 2 in thread

Depending on what a job is, it seems like this could all be done with Jenkins.

bballerOP7y ago

Sorry but Jenkins doesn't achieve anything related to my question.

aprdm7y ago

How so? Jenkins is just a job scheduler, it has lot's of options for scheduling and dispatching jobs.

saluki7y ago· 1 in thread

Are you choosing a tech stack to do this?

Laravel has this built in with queues.

https://laravel.com/docs/5.7/queues

You can run multiple workers, it will intelligently distribute the jobs, and there is a Laravel Horizon package that can handle monitoring of the queues/jobs.

I expect Rails would have something similar, but I haven't used queues in Rails.

bballerOP7y ago

We run all Java so that is a no go and at the scale I'm talking about don't think I would trust those solutions. Thanks for the input though!

Sahhaese7y ago· 1 in thread

Any message queue or similar could help with this.