The big question is, why does it need to run in 10s? The main reason I can see is to be able to run this analysis very frequently, but then your workload is approaching constant.
The total amount of data is 150 GB; that would easily fit into memory on a single powerful 2-socket server with 20 cores and would then run in less than 15 minutes. The hardware required to do that will cost you ~ $6000 from Dell; assuming a system lifetime of five years and assuming (like you do) that you can amortize across multiple jobs, the cost is roughly the same as from the cloud, about $0.036 per analysis.
I'm fairly certain that, in the end, it's not more expensive for the customer to just buy a server to run the analysis on.
Edit: I see OP says 80% of the time is spent reading data into memory, at about 100 MB/s. Add $500 worth of SSD to the example server I outlined, and we can cut the application runtime by >70%, making the dedicated hardware significantly cheaper.