The per-token price war is obscuring the real opportunity – and the CIOs who see it first will pull ahead of those who don’t.
At Google I/O yesterday, Sundar Pichai mentioned a figure that should have made every strategist in the audience reach for a notepad. Google is now processing 3.2 quadrillion tokens a month, up from 480 trillion a year ago. That is a seven-fold increase in twelve months.
The day before, on Michael Dell’s keynote stage in Las Vegas, Jensen Huang told the audience that “we now have, for the very first time, useful AI” – and that demand was “going parabolic”.
Both statements are true. Together, they describe an opportunity that most organisations are currently misreading – and that is precisely where the gap opens up.
The dominant story in AI right now is that per-token prices keep falling. Google’s new Gemini 3.5 Flash, announced yesterday, was positioned at less than half the price of comparable frontier models. OpenAI, Anthropic and the rest have been running similar moves on cost all year. This is the line every vendor wants you to take away.
But the per-token price is the wrong number to track. The number that matters is tokens per business outcome – and the organisations that instrument that metric now are the ones that will scale AI profitably while their competitors are still arguing about invoices.

The compounding dynamic most teams haven’t priced into their advantage
An analyst asks your data warehouse a question and gets an answer. An agent doing the same job runs a retrieval step, a reasoning step, a tool call, a verification check, and probably a retry or two when something doesn’t quite work first time. Each of those steps is multiple model calls. Each model call burns tokens.
Dell put this in plain language at its conference this week: agentic architectures cause “token usage [to] compound at an accelerating rate, driving cloud costs that can quickly become unsustainable despite falling token prices.” That is a vendor admitting, in writing, that the raw economics of cloud-only AI reward the teams who architect thoughtfully over those who don’t.
The pattern is straightforward enough to sketch on a napkin. If your per-token cost halves but your per-task token count goes up tenfold because you’ve moved from “user asks chatbot a question” to “agent completes a piece of work end-to-end”, your bill goes up by a factor of five. The agentic shift is exactly that move. The teams building it carelessly will hit a wall. The teams building it well will have a structural cost advantage that compounds over time.
Why the headline price obscures the real opportunity
Talk to any vendor and you’ll hear the same line: prices are dropping, the models are getting cheaper, this is a good time to expand. What they are less keen to discuss is that the cost per useful output – the actual finance question – is something the buyer has to calculate themselves. That gap between what vendors surface and what actually drives value is exactly where good architecture earns its keep.
There are three dynamics at play that the best teams are already turning to their advantage.
The first is reasoning depth. The newer “thinking” models produce better outputs by spending more tokens on internal reasoning before they answer. Understanding where that reasoning spend is justified – and where it isn’t – is one of the clearest levers available for improving the economics of any AI workflow.
The second is context bloat. Agents need context to make sensible decisions, and the path of least resistance is to give them more of it – whole document collections, full database schemas, every email in a thread. This is the same instinct that gave us 600-line PowerPoint decks. Organisations that treat the context window as a scarce resource rather than a buffer consistently get better outputs at a fraction of the token cost.
The third is loop cost. When an agent has to retry, double-check or correct itself – all things you actively want it to do – those attempts are billed individually. A small improvement in first-pass reliability can produce an outsized reduction in overall spend. That’s a quality and cost win at the same time.

The metric that separates the programmes that scale from the ones that stall
If you’re a CIO right now and you want your AI investment to compound rather than just accumulate, the first question to ask isn’t “how do we cut costs?” – it’s “what does each pound of AI spend actually buy us?”
The right unit is tokens per resolved outcome. For a customer service agent, that means tokens per resolved ticket. For a sales assistant, tokens per qualified lead or closed deal. For an internal research agent, tokens per usable briefing document. For a coding assistant, tokens per accepted pull request.
These numbers are not surfaced natively by most AI platforms. You have to instrument them yourself, with a logging layer that ties token consumption back to the business event the workflow has actually completed. It isn’t a heavy engineering lift, but it does have to be a deliberate one. The teams that have this instrumentation in place can make architectural decisions with confidence. Everyone else is optimising by intuition.
This isn’t a finance exercise for the sake of it. The programmes that can articulate token cost per business outcome are the ones that get funded at scale six months from now, when boards start asking harder questions about returns – and the ones still expanding when competitors are explaining why their pilots never grew up.
The architectural moves that change the maths
Once you have the metric, the architectural answers cluster into three areas – each of them a meaningful competitive lever, not just a cost control.
Task decomposition. A workflow doesn’t need the most capable model at every step. Use a small, fast model for routing and classification, and reserve the top-tier model for the steps that genuinely need its reasoning. This single change can drop the token spend on a workflow by 60-80% without measurable quality loss. Most enterprise agentic deployments don’t do it, because the initial proof-of-concept was built on whichever model was easiest to call. The organisations that do it have a durable cost advantage over those that don’t.
Context discipline. Treat your context window as a scarce resource, not a buffer. Retrieval should be ranked, filtered and trimmed before it reaches the model. Vector indexes need attention – poorly tuned retrieval feeds the agent too much, and the agent obediently processes all of it. A well-tuned retrieval pipeline pays for itself in tokens, on every call, indefinitely. A sloppy one charges you for the same inefficiency at the same rate, forever.
Evaluation-driven model selection. Without an eval framework, teams pick models based on benchmark rankings, vendor relationships, or whichever one was discussed at the last conference. With a framework, you pick based on which model produces acceptable outputs for your specific task at the lowest cost. The gap between those two approaches, over a year, is the difference between an AI programme that scales profitably and one that gets quietly defunded.

The procurement angle that most boards are leaving on the table
There’s a strategic observation worth raising at your next leadership meeting. If your organisation signed a three-year cloud commit in 2024, the usage forecasts in that contract were almost certainly built on the assumption that AI workloads would grow at 30-40% a year. The actual rate, going by Pichai’s figures, is closer to 600%.
This creates real leverage for organisations that move early. Your committed spend will be consumed faster than planned, which means you’ll be back at the negotiating table sooner – and the organisations that arrive with clear workload data and architectural options will negotiate from strength. The Google-Blackstone TPU joint venture announced this week, Anthropic’s $1.8bn deal with Akamai, and Dell’s push for on-premises agentic AI all point in the same direction: the next round of AI compute is going to be procured very differently from the last round. The boards that understand their workload economics now will shape those contracts on their own terms.
Q&A: Token Economics for the Agentic Era
Per-token prices keep falling. Why isn’t our AI spend becoming more predictable?
Because the number of tokens consumed per task is growing faster than the per-token price is dropping. A workflow handled by an agent involves multiple model calls – retrieval, reasoning, verification, retries – where a workflow handled by a person with a chatbot involved one or two. Predictability comes from measuring tokens per outcome, not tokens per call. That’s the number that makes agentic economics legible.
How do we actually measure “tokens per business outcome”?
You have to instrument it. Most AI platforms don’t surface this natively, so it requires a logging layer that ties token consumption to the business event the workflow has completed – a closed ticket, a generated report, an accepted recommendation. It isn’t a heavy engineering lift, but it does need to be deliberate. Once you have it, every architectural decision becomes easier to justify – and every conversation about ROI has the right data behind it.
Should we move AI workloads on-premises to improve the economics?
For some workloads, yes – particularly anything where the demand curve is predictable and sustained. The economics shift in favour of on-premises once you have steady, consistent inference load. For bursty workloads, multi-modal generation, or anything still in experimental phase, the public cloud usually still makes sense. Most enterprises will end up running a mix, and the mix should be driven by workload economics rather than vendor preference. Getting there intentionally is better than arriving by accident.
We built our first agent on a top-tier frontier model. Where’s the optimisation opportunity?
Almost certainly in task decomposition. The opportunity tends to sit in the routing, classification and basic retrieval steps – the parts of a workflow where a smaller, cheaper model performs just as well. A decomposition review will typically surface 40-60% of token spend that could move to a lower tier without quality degradation. That’s where the quick wins are, and they compound across every workflow you run.
How do we get our finance team genuinely engaged with AI investment decisions?
Speak their language. Build a simple cost-per-outcome model for one workflow, show it alongside the alternative architectures, and let them see the leverage. Finance teams aren’t resistant to AI – they’re resistant to AI conversations that don’t have proper unit economics behind them. The programmes that get scaled are the ones that walk into budget conversations with numbers, not vision slides.
Working through this with Vertex Agility
The shift from per-token thinking to per-outcome thinking is a conversation our AI Consultancy practice is having with technology leaders almost daily right now. Some clients are mapping the architecture for their next agentic deployment and want to get the economics right from the start. Others have a working pilot that performs well but can’t yet make the unit economics convincing enough to justify scaling. And some are approaching cloud renewal conversations and want to understand their workload picture before they negotiate.
Our AI Consultancy works with organisations on AI strategy, custom development, model selection, and the architectural patterns – task decomposition, context discipline, evaluation frameworks – that create durable cost and performance advantages. Our Data Consultancy practice covers the data architecture beneath all of it, because most of the optimisation opportunity traces back to retrieval and context layers that weren’t designed for agentic consumption in the first place.
The combination matters. The architectural decisions that move your token economics in the right direction sit right on the boundary between data and AI, and trying to solve one without the other tends to produce temporary fixes rather than structural improvements.
If you’d like an honest assessment of where your AI programme stands today – and where the clearest opportunities are – we offer a free AI Readiness Mini Audit on our website. For something more in-depth, please feel free to get in touch with us directly below.