Everything we’re using now is the equivalent of building a GPU on an FPGA: the hardware is general purpose at one abstraction level, and that comes with inefficiency at the next layer up. Collapse the levels, gain efficiency at the cost of generality.
To answer my own question, I bet they could figure out a way to still bill you per-token, if they wanted to.
And of course they could bill per-token, same way cable PPV worked (the bits were already in your house). But the cost structure of weights in silicon means that competitors would be encouraged to compete on this per-token cost, as their marginal cost would be zero.
I don’t see that being a durable business model, but I guess the counter argument is it’s also similar to game consoles, where initial hardware is subsidized and the business model assumes ongoing payment for bits.