cache hit rate alone would stand out
For a specific harness, they've all found ways to optimize to get higher cache hit rates with their harness. Common system prompts and all, and more and more users hitting cache really makes the cost of inference go down dramatically.
What bothers me about a lot of the discussion about providers disallowing other harnesses with the subscription plans around here is the complete lack of awareness of how economies of scale from common caching practices across more users can enable the higher, cheaper quotas subscriptions give you.
I’m sure they went with reusing the generic completions API to iterate faster and make it easier to support both subscription and pay-per-token users in the same client, but it feels like they’re burning trust/goodwill when a technical solution could at least be attempted.