I've tried googling but my fu is failing me. Would love to hear the thoughts from people who have maybe solved similar problems.
Thanks!
DynamoDB has triggers that get fired upon changes. That would strongly help with eventual consistency implementation (which i strongly recommend). But with this you should write Lambdas. Check how well is Lambda Java supported.
Are you sure you want to use EC2 directly? Why not to use ECS? This would let you focus more on the business and less on infrastructure
Laravel has this built in with queues.
https://laravel.com/docs/5.7/queues
You can run multiple workers, it will intelligently distribute the jobs, and there is a Laravel Horizon package that can handle monitoring of the queues/jobs.
I expect Rails would have something similar, but I haven't used queues in Rails.
Popular solutions:
* RabbitMQ
* Service Bus
* Kafka
RabbitMQ would be well suited, you define a producer and can then spin up as many consumers as you would like, each consuming from the same queue.
Preventing duplicate queuing should be done on the producer before it is queued.
It depends on the nature of the scaling and how much durability you want though, you may wish to simply maintain an atomic queue of work to be done, in which case any thread-safe list would suffice as long as changes were done atomically.
10k isn't that much in the grand scheme of things, a simple database could easily store such a queue if you didn't have that many nodes trying to consume the same table at once.
You would need to write stored procedures for transactional read & delete and inserts to prevent duplication of jobs.
In that case something like Redis might be good, which itself can also act as pub/sub and used for messaging.
Would you look to scale up more consumers as the queue lengthened, or would there still be a fixed number of consumers?
So instead of traditional SQS we just write unique files on S3 with job data and that triggers the lambda function to process the job and notify a URL upon completion.
Thanks for your thoughts!
Updating anywhere from 500-5M items over APIs that are rate limited.
Dumping datasets that have to be normalized and massaged into files/ and dropped off at third party servers anywhere from every 15 minutes to once a month. These files could contain anywhere from 500 lines to 5M lines.
Injesting datasets just as large as described above but massaged and saved into our caches and DB.
I'll contend that you can't do that. You could get at least once or at most once. Let's say that you go with at least once.
Just use a queue like RabbitMQ. Workers connect, request work, ack that they are done, and you should be good to go. Done. You can set this up today if you want.
If you need more thorough duplicate detection, you could sprinkle in some redis to store job state (in progress / complete). Using atomic operations like INCR/DECR on your key, you could pull a job from Rabbit (or your queue of choice), hit redis to ensure that the job is not in progress or already complete due to a network error between Rabbit and your workers, and then proceed appropriately.
The key problem here is that the network could drop requests. You could pull from your queue, complete the work, and think you ack'd, but the queue never gets the ack, so it hands out the work again to a new worker after the lock expires on it. So you could mitigate that with an additional layer of a distributed KV store. But that could have the same problem.
I run a system that processes billions of events a day and we use a system very similar to what I described above (though we have a custom queue solution and a pool of redis nodes that we have some custom quorum logic around). We don't seem to be duplicating hardly any jobs (maybe a handful a week).
If you use kafka, and only use the java clients, they say you can get exactly once delivery. See https://www.confluent.io/blog/exactly-once-semantics-are-pos....
They way they do it is by controlling the client and the server and sprinkling some write-ahead-logs with logical clocks and using a formal consensus protocol (paxos) under the hood. Even with all that, I'm skeptical of the exactly once claim.
We had the basics of execute-once using a leasing pattern, but we had a number of bugs related to multiple instances of a task existing in different threads (the executor would load the task object and then fork, leaving two instances in possibly stale states, and I also found failure paths that left multiple instances running), and we also saw a number of daily double-executions related to the lease-renewal process freezing, or non-transactional state transition.
We added a lot of state-transition auditing, including a pid/thread ID to find out where updates were coming from.
IIRC I eventually settled on having the executor (queue listener) do every possible check prior to execution (checking resource limits, process count limits, etc.) without loading the task instance itself (just the ID from the queue message). After the fork the child loads the task and does a single transaction that deletes the queue message and creates the execution record (the one-and-only-one run, basically). Every failure up to that point will requeue, but once the run is created, the queue message has to be deleted. We then transition to leasing the execution, and mark it failed if the lease expires.
We also created a centralized service to renew the leases on the execution objects after we found that to be a failure point. Long-running processes just have a lot of problems keeping connections open, etc.
for example, "10k+ jobs" sounds like a large number, but depending on throughout perhaps it is trivial.
i have a hobby project to fetch data from external sources and store the results in a database. this has about 70k different jobs defined. each job is scheduled to be run at some frequency. i run everything on a single physical box with 4gb of ram and a low energy CPU. The worker processes are python scripts, the job queue state is stored in the same postgres database i use to store results. My throughout is low, i only need to process a job every few seconds. The workers run on the same box as the database as I am too lazy to maintain more machines and too cheap to rent cloud servers. Running costs are about $15 / year for energy.
The queue implementation is based on this: https://blog.2ndquadrant.com/what-is-select-skip-locked-for-...
from memory i think I am using primitives offered by database to prevent multiple workers from acquiring the same job (transactions, transaction isolation). this might not be very scalable but i only have two worker processes and each job takes seconds to process.
do you want to do what I'm doing? Probably not. but perhaps what you are doing is easy.
However if you need 1) management, i.e. easy way to look at what went wrong and retry 2) dependencies between jobs, you should look into something like Airflow (instead of building your own): https://airflow.apache.org
It's also a very good example of architecture, in case you decide it's not good for you and you really want to build your own.
Python based:
http://www.celeryproject.org/ <-- async task queue
https://github.com/spotify/luigi <-- more pipe line centric
There are many like it. RQ(more bare bone)
It starts a network node in each instance of the service and binds all nodes into a cluster (discovery of other nodes is left for the service developer; simplest solution is to have a static list of node addresses). The sharding mechanism allows you to distribute arbitrary data objects among the nodes according to some rules (e.g. based on the value "object's hash modulo number of nodes in cluster", which produces a typical hashring). Data objects may originate on any node (e.g. on schedule or on some external event). They also need to be serializable to be passed between nodes and, obviously, the binary representation should not be too big. So, depending on the nature of jobs in your case, you may want to pass only the job ID as the data object and store the actual job definition and/or arguments in a separate place (e.g. a database).
Now, Akka will guarantee that the job will be run only by 1 node and run once (until completion that is; the job may be started more than once, see tips below). But if there are many jobs coming in, you risk overwhelming the cluster (because passing messages between nodes takes time, jobs themselves take time, etc.) So to defend your cluster against load, which it might not be able to handle, you may put a message queue in front of it. This will let you set up a max number of jobs that concurrently run in the cluster, and nodes will take new jobs from the queue only when some of the currently running jobs complete. Most MQs have persistence, so the jobs will be safe even if they need to wait for a while sitting in the queue.
Few tips:
1) Akka persists node's data, so the jobs that are taken from the queue are going to be safe in case the node they are located at fails
2) If a node leaves the cluster, all its' jobs will be moved to other nodes (according to the sharding rules mentioned above) and started over, so you may need to introduce some kind of transactions (same goes for cluster restart)
3) If the whole cluster is restarted, unfinished jobs will be distributed among the nodes according to the sharding rules, so it might be a good idea to make this rules "sticky", so that each individual job is always assigned to the same node (and each node will just have to load its' jobs from its' own persistent store). Otherwise there might be a lot of message passing which will slow down the startup.
Have a look on Tractor, Qube, Rush, Flamenco by blender