You're forgetting a critical factor: concurrency. If a given hardware serves a single request at 150 tokens/s, it can also serve 20-30 requests at 100 tokens/s. Suddenly your $5K becomes $100K/month, enough to recoup the cost of the hardware in a year or so.
The reason it works: each time you read the model (memory bound) to calculate the next token, you can also update multiple requests (compute bound) while at it. It's also much more energy-efficient per token.
[1] https://aimultiple.com/gpu-benchmark