A single user action can trigger anywhere from a few to dozens of LLM calls (tool use, retries, reasoning steps), and with token-based pricing the cost can vary a lot.
How are builders here planning for this when pricing their SaaS?
Are you just padding margins, limiting usage, or building internal cost tracking? Also curious, would a service that offers predictable pricing for AI APIs (like a fixed subscription cost) actually be useful for people building agentic workflows?
The unpredictability is worse than the absolute cost. Our billing model broke several times not because costs were high, but because we couldn't bound them. One approach that helped: define a 'token budget' per user action at design time - cap total tokens per session and treat hitting the cap as a first-class outcome your product handles gracefully, not an error.
On the forecasting side, we track cost per workflow step rather than per request. Step-level cost is much more stable than request-level because it absorbs the variance in tool calls and retries. Once you have step costs, you can forecast by expected workflow composition.
On fixed subscription pricing for AI APIs - I'd actually pay a premium for that. The unpredictability creates a hidden cost: you over-provision margins and add complexity to your pricing tier design. A flat rate for a capacity bucket would eliminate both.
The question I'd ask about any such service: how do they handle the tail cases where agents go off-rails and rack up 10x normal token usage? That's where the cost risk actually lives.
most observability tools show you the LLM call as one flat span. you can see it cost X tokens but you cant correlate it with the API request that triggered it, or see that the agent looped 4 times because the first 3 outputs failed validation. so you end up building custom logging and hoping the numbers add up.
we've been building an APM (immersivefusion.com) where cost is a first-class dimension on every trace. so you can see one request flow from the UI through your backend through the agent workflow, and each span carries its token cost. the idea is you should be able to answer "what does a checkout cost when the recommendation agent is in the loop" without stitching together 3 different tools.
for the forecasting question specifically, i think the answer is you need a few weeks of production data with good instrumentation and then you can build a distribution. the variance is real but its not random, its usually a few specific flows that blow up (retries on bad structured output like @hkonte mentioned, or RAG queries that hit the wrong chunk size). once you can see which flows are expensive the guardrails become obvious.
also wrote a longer piece on this if anyone's interested: immersivefusion.com/blog/end-to-end-observability-from-ui-to-ai-agent-to-invoice
We ran into the same issue and ended up building https://oxlo.ai to make the cost side more predictable for agent workloads.
We usually look at cost per workflow run, runs per active account, and the heavier paths separately, then keep retries and tool calls as their own line items. That makes the pricing side easier to reason about.
This takes a couple of hours maximum at best.
and this topic actually inspires me that I can introduce a builtin gas meter for tokens