The big, impressive models all scale well for multi-customer setups because of the efficiency batching provides, but the base cost to run models like that as even a small business is incredibly high. If you can't saturate your LLM hardware almost 24/7, the time to earn back your investment is high unless you choose inferior models that are worse at their job.
But also the Strix Halo 128 is pretty hard to beat.
At the moment LLMs vendors are in market grab mode and take a loss on big subscription users, they are starting to try to move to profit but they must move carefully to not let a competitor steal their users so we will still have "cheap" tokens for a while.
Even if prices go up by a bit, they have the scale in their favor to optimize costs.
If commercial model providers go into "not competitive" territory with their prices compared to open models, wouldn't it always be cheaper to use an open models inference provider? They can take advantage of scale as well, and with no model moat, competition should keep prices honest.
And last ressort, renting GPU time in the cloud seem like a safer bet than buying a GPU to me?