I just had to implement GPU clustering in my inference stack to support Llama 3.1 70b, and even then I needed 2xA100 80GB SXMs.
I was initially running my inference servers on fly.io because they were so easy to get started with. But I eventually moved elsewhere because the prices were so high. I pointed out to someone there that e-mailed me that it was really expensive vs. others and they basically just waved me away.
For reference, you can get an A100 SXM 80GB spot instance on google cloud right now for $2.04/hr ($5.07 regular).
H100 will also be much faster, especially if you are willing to use fp8. Maybe 3-4x
Savage.
I wonder if we’ll see a resurgence of cloud game streaming
Amazon’s g6 instances are L4-based with 24gb vram, half the capacity of the L40S, with sagemaker in demand prices at this rate. Vast ai is cheaper, though a little more like bidding and varying in availability.
That's the medium Llama. Does anyone know if an L40S would run the 405B version?
One cost factor we have that other providers might not have (I'd love to know): we have to dedicate individual racked physical hosts to each group of GPUs we deploy, because we don't (/can't, depending on how you think about systems security) allow GPU-enabled workloads to share hardware with non-GPU-enabled workloads, and we don't allow anyone to share kernels.
But like we said in the post: we're still figuring this stuff out. What we know is: at the same price level, we're consistently sold out of A10 inventory.
Ya, that's a no from me.
They run on literally anything someone installs their agent on.
This all happened because we were having internal meetings about trying to find A10s to rack, and Kurt stopped and said "wtf are we doing".
If it'll make you feel better, we'll continue to charge you the previous list price for L40S GPU hours.
nice business to be in I guess.