June 24, 2026

The AI Token Cost Problem Is a Design Flaw

Anmol Jawandha

Staff AI Engineer

Share
  1. LI Test

  2. LI Test

TLDR: Uber burned through its entire 2026 AI coding budget in four months — then its COO admitted he couldn't link rising token consumption to shipped features. This post goes over the different sources of token waste and suggests how to design token-efficient systems that ultimately move business outcomes.

The Uber Problem Is Your Problem Too

Uber burned through its entire 2026 AI coding budget in the first four months of the year, then capped every employee at $1,500 per month per tool. The scale explains how. Uber has about 5,000 engineers, and 84–95% of them used these tools each month. Before the caps, individual engineers were running up $500 to $2,000 a month in tokens. The CTO reportedly spent $1,200 in a single two-hour demo.

The numbers are striking, but the revealing part is a quote from COO Andrew Macdonald. Asked whether all that token usage was actually improving Uber's products, he said: "That link is not there yet, right?" Uber was tracking volume, not value.

That's the real problem: most token spend is unattributed, and most token waste is baked into the architecture. So even though Gartner expects inference costs to drop more than 90% by 2030 as hardware improves and specialized chips come online, cheaper tokens won't fix the waste.

Why Agentic Systems Burn Tokens

Before getting into the fixes, it helps to understand why agentic token consumption is structurally different from chat-based consumption.

In a stateless chat interaction, each request is independent. In an agentic loop, every subsequent turn re-sends the full conversation trajectory — tool calls, results, prior reasoning — as context. This means a bloated tool output is a tax you pay on every turn that follows. 

Outside of the system architecture warranting context reuse, here are other sources of context bloat in practice:

  • Uncompressed payloads: returning full HTML, raw JSON blobs, or complete file contents when only specific excerpts are needed
  • Token-heavy noise: full URLs instead of compact link identifiers, verbose tool descriptions, repeated boilerplate in schemas
  • Over-engineered tool interfaces: complex multi-parameter schemas that force the model to spend tokens reasoning about which argument to use and more importantly costing the agent valuable iterations
  • Context rot: when a long-running agent loses track of a fact it established earlier, re-researches it, and spends additional tokens recovering from a context management failure that didn't have to happen

Three Engineering Principles That Reduce Token Spend

Principle 1: Simple Abstractions

The model should spend its budget on reasoning about your problem, not on navigating your tool interface. Every parameter you add to a tool schema is a decision surface for the model, and decision surfaces cost tokens — both in prompt construction and in the model's reasoning about how to call the tool correctly.

The design target is tools that are minimal and semantically familiar. Create a small set of well-named primitives with narrow inputs and predictable outputs. Push complexity down into your API layer rather than surfacing it as model-facing parameters.

Principle 2: Dense Payloads

Your tool shouldn’t return everything it could return, it should return only what's relevant to the current task.

In practice, this means a few things:

  • Right-size your extractors by document type. A structured data source (JSON API, database row) should return a compact, field-selective response. A web page should return a cleaned excerpt, not the full DOM. A PDF should return the relevant section, not the full text with headers, footers, and navigation stripped but the content still intact.
  • Encode links compactly. Instead of returning full URLs in tool outputs—which are typically 60–150 characters of noise—assign short identifiers and maintain a citation map in your system prompt or a dedicated context slot. Something like [src:3] instead of https://example.com/very/long/path/to/document.html saves tokens on every subsequent turn that references that source.
  • Weave citation markers into the content itself. If you're building a research or retrieval pipeline, grounding claims in-line with compact markers keeps the context coherent without requiring the model to re-fetch attribution from somewhere else in the trajectory.

Principle 3: Budget as a First-Class Signal

Agentic systems often treat token budget as an infrastructure concern—something the billing system handles after the fact. A better design makes the agent budget-aware from turn zero.

This means passing the remaining token budget (or a proxy like turn count) as an explicit field in the system prompt or as a special context variable the model can read. 

Other Strategies That Don't Require Architecture Changes

  • Model routing/tiering: Not every task needs a frontier model. Classification, summarization, format normalization, and structured extraction all perform well on smaller, faster, cheaper models. Routing simple agentic sub-tasks to a small model while reserving the large model for complex reasoning steps can cut costs dramatically without measurable quality loss on most benchmarks.
  • Prompt caching: If your system prompt is large and stable — detailed instructions, tool schemas, reference documents — most providers offer caching at the prompt level. Cached tokens are typically billed at a fraction of the cost of fresh input tokens.
  • Semantic caching: At the application layer, if a question or sub-task is semantically identical to one recently answered, returning the cached result skips the inference call entirely. This requires a similarity search layer, but for high-throughput applications where queries cluster, the savings are real. This layer is generally worth investing in at a large scale. 
  • Batch processing: For non-interactive workloads — nightly analysis jobs, bulk document processing, evaluation runs — batch API endpoints are typically cheaper than synchronous inference. The latency trade-off is usually irrelevant for offline tasks.

Token Cost is Not the Only Metric

The number that matters is cost per task — or better, if you can measure it, cost per business outcome. A task is whatever the agent was actually for: a passing test suite, a merged PR, a resolved support ticket, a clean data extraction. Pick the unit that maps to the work, then measure how many tokens it took to get there. 

Cost per task is the easy one to instrument because it's narrow, close to the tools, and it's the number engineers can move week to week — tokens per resolved ticket, per generated test, per document summarized, per successful tool call. 

Cost per business outcome is the harder (and more valuable) one to measure — support deflection rate, time-to-merge, weekly impressions for a marketing team, site reliability for an infra team, revenue per engineer etc. are all hard to associate with upstream tokens consumed. The task metrics are your control surface but the business metrics are what you're accountable for. 

Once you measure outcomes, the finding is almost always two-sided: AI is creating real value, and token spend could be cut sharply without touching any of it. 

Build AI agents that don't waste tokens on retrieval noise.

Featured resources.

Paying 10x More After Google’s num=100 Change? Migrate to You.com in Under 10 Minutes

September 18, 2025

Blog

September 2025 API Roundup: Introducing Express & Contents APIs

September 16, 2025

Blog

You.com vs. Microsoft Copilot: How They Compare for Enterprise Teams

September 10, 2025

Blog

All resources.

Browse our complete collection of tools, guides, and expert insights — helping your team turn AI into ROI.

Abstract render of overlapping glossy blue oval shapes against a dark gradient background, accented by small glowing squares around the central composition.
Modular AI & ML Workflows

You.com Skill Is Now Live For OpenClaw—and It Took Hours, Not Weeks

Edward Irby

Senior Software Engineer

February 3, 2026

Blog

AI-themed graphic with abstract geometric shapes and the text “AI Training: Why It Matters” centered on a purple background.
Future-Proofing & Change Management

Why Personal and Practical AI Training Matters

Doug Duker

Head of Customer Success

February 2, 2026

Blog

Dark blue graphic with the text 'What Are AI Search Engines and How Do They Work?' alongside simple white line drawings of a magnifying glass and a gear icon.
AI Search Infrastructure

What Are AI Search Engines and How Do They Work?

Chris Mann

Product Lead, Enterprise AI Products

January 29, 2026

Blog

A man with light hair speaks in a bright office, gesturing with one hand while wearing a gray shirt and lapel mic, with blurred city buildings behind him.
Company

How Richard Socher, Inventor of Prompt Engineering, Built a $1.5B AI Search Company

You.com Team

January 29, 2026

Blog

An image with the text “What is AI Search Infrastructure?” above a geometric grid with a star-like logo on the left and a stacked arrangement of white cubes on the right.
AI Search Infrastructure

What Is AI Search Infrastructure?

Brooke Grief

Head of Content & Web

January 28, 2026

Guides

Two men speaking onstage in separate panels, each gesturing during a presentation, framed by geometric shapes and gradient color blocks.
Company

AI in 2026: Inside the Future-Shaping Predictions from You.com Co-Founders

You.com Team

January 27, 2026

Blog

Black you.com cover reading “What Is AI Grounding and How Does It Work?” above a blue geometric pattern on a gradient purple background.
AI 101

What Is AI Grounding and How Does it Work?

Brooke Grief

Head of Content & Web

January 26, 2026

Guides

Book cover titled “AI Predictions for 2026” with gradient background, text blocks showing names, and two men pictured speaking onstage in small photo panels.
Company

2026 AI Predictions: Insights from You.com Co-Founders

Richard Socher

You.com Co-Founder & CEO

January 23, 2026

Guides