Yeah, everyone wants to process their queue as fast as possible but "as fast as possible" practically means a cap on the maximum allowed delay. Otherwise, why stop at 30 workers? Go for 300. 3000?
Also, if the workers shared all the code, you could have used unicorn to fork the processes after the code loading was complete. The 400MB per process would then instantly come down to something ~10MB per process at which point rewriting would have been delayed for another year or so.