//pragmatic leaders

latency and cost

5 min left0%
latency and cost0%
5 min left
A feature that costs $0.08 per query at 1,000 users costs $80,000 per month at 1,000,000. Most AI features are shipped without a cost model. Most AI teams face an uncomfortable conversation with finance six months after launch.
Talvinder Singh, Pragmatic Leaders

Token economics are real. They don't feel real when you're building — the OpenAI dashboard is abstract, the numbers are small, and the early user volume is tiny. They become real when you scale. This page gives you the cost model you should have before you ship, not after you get the finance email.

Reading a model pricing page

Model pricing is almost universally quoted as cost per million tokens — for input (the prompt) and output (the generated response) separately. Output tokens are always more expensive than input tokens because generation is more compute-intensive than reading.

2026 reference pricing (approximate, check current pages):

ModelInput (per 1M tokens)Output (per 1M tokens)Best for
GPT-4o-mini$0.15$0.60High-volume classification, simple extraction
Claude 3.5 Haiku$0.25$1.25High-volume tasks, good reasoning/cost ratio
GPT-4o$2.50$10.00Complex reasoning, generation quality
Claude 4.5 Sonnet$3.00$15.00Complex reasoning, long context
GPT-5$10.00$30.00+Frontier reasoning, complex agent tasks
Claude 4.5 Opus$15.00$75.00Maximum quality, low-volume high-stakes

A token is roughly 0.75 words. 1,000 words ≈ 1,333 tokens.

The cost-per-query calculation:

cost_per_query = (avg_input_tokens / 1M × input_price) + (avg_output_tokens / 1M × output_price)

For a customer support draft generator with a 500-token system prompt, a 200-token user message, and a 300-token output, using GPT-4o:

Input cost: (700 tokens / 1M) × $2.50 = $0.00175
Output cost: (300 tokens / 1M) × $10.00 = $0.003
Total per query: $0.00475

At 100k queries/month: $475/month. At 1M queries/month: $4,750/month.

Now model the same feature at 10M queries/month (a successful scaled product): $47,500/month. This is where you want to have already switched to a cheaper model for the cases that don't require GPT-4o quality.

The trap: teams estimate cost at their current volume and fail to model what happens at 10x. Build the cost model at 1x, 10x, and 100x before committing to an architecture.

The four cost levers

You have four levers for controlling AI inference cost. Use them in order of simplicity.

1. Model tier

The most impactful lever. A 10x cheaper model (GPT-4o-mini vs GPT-4o) that performs equally well on your specific task is 10x cheaper. Many tasks don't need frontier-model quality.

The practical test: benchmark your task on mini-tier and full-tier models against your golden eval set. If the quality difference is < 5% on dimensions users care about, use the cheaper model. "We use GPT-4o because quality matters" is not a cost argument — it's a default that hasn't been tested.

2. Prompt length

Every token in your prompt costs money. Every unnecessary token in a long system prompt is a tax on every query. A 2,000-token system prompt costs $5/1M input tokens on GPT-4o — at 1M queries, that's $5,000/month just in system prompt overhead.

Audit your prompts. Remove instructions that don't change behavior. Move static context into fine-tuning if query volume is high enough. Use prompt compression (summarizing or trimming examples to the minimum that preserves quality).

3. Prompt caching

Prompt caching (available from Anthropic and OpenAI as of 2025) allows you to mark parts of a prompt as cacheable. When a cached segment is sent again — same bytes, same position — the API doesn't re-process it, and charges a reduced rate (typically 50-90% lower than full input pricing).

When caching saves significant money:

  • You have a long system prompt (500+ tokens) that is static across requests
  • You inject large reference documents into every prompt (documentation, product catalog, knowledge base context)
  • You have multi-turn conversations where earlier turns repeat on each API call

When caching saves little:

  • Your prompts are short (< 200 tokens) — the cache minimum is typically 1,024 tokens for Anthropic, 512 for OpenAI
  • Your prompts change significantly per request (high dynamic content)
  • Low query volume — cache benefits scale with volume

A concrete example: if your RAG system injects 3,000 tokens of retrieved context that is the same across a session, caching those tokens at Anthropic's 90% discount drops your input cost from $3.00 to $0.30 per 1M tokens for that segment. At 100k queries/month with 3k-token context, that saves roughly $810/month.

4. Model cascading (routing)

Cascading is the architecture where you route requests to different model tiers based on estimated complexity. Simple requests go to a cheap model; complex requests escalate to an expensive model.

A basic cascade:

  1. Run the request through a cheap classifier or the cheaper model
  2. If the output confidence is high and the task is simple, return that output
  3. If the output is flagged as low-confidence, complex, or a sensitive topic, escalate to the expensive model

Example routing logic for a customer support feature:

  • FAQ / known-answer questions → GPT-4o-mini ($0.00075/query)
  • Complex multi-step questions → GPT-4o ($0.00475/query)
  • Policy / compliance questions → GPT-4o with explicit grounding + human review

If 70% of queries are FAQ-level, this cascade cuts average cost by roughly 60% vs. running everything through GPT-4o.

The engineering cost: cascading adds a classification step, increases system complexity, and requires eval infrastructure for both tiers. It's justified once your monthly AI cost exceeds ~$5,000 and you have a large, heterogeneous query distribution. Below that threshold, optimize prompt length first.

Real latency targets

Latency is the other side of the cost/quality triangle. Here are real-world P50/P95 targets for common patterns in 2026:

PatternP50 targetP95 targetNotes
Single-call chat response (GPT-4o, ~300 output tokens, streaming)1.2s to first token, 3-4s complete2s / 8sStreaming first-token latency is what users feel
RAG pipeline (retrieval + generation)2-3s to first token4-6sRetrieval adds ~500ms-1s depending on vector store
Single-step agent tool call5-8s15sEach additional tool call adds ~2-4s
5-step agent task15-25s45sPush to async UX at this range
Document analysis (20-page PDF, single call)8-15s25sAcceptable for async; too slow for interactive

First-token latency is what users perceive as "response time" in streaming interfaces. The total generation time matters for how long users wait overall, but the first-token latency determines whether the interface feels responsive. A response that starts streaming in 0.8 seconds and takes 6 seconds to complete feels faster than one that starts streaming in 3 seconds and completes in 4 seconds.

The latency-cost-quality tradeoff. Smaller models are cheaper AND faster. The decision is usually: does the quality loss from a cheaper, faster model matter for this use case? Test empirically against your golden set rather than assuming quality requires the expensive model.

Building a real cost model

Before you ship an AI feature, build this table:

ScenarioMonthly query volumeAvg input tokensAvg output tokensModelCost/queryMonthly cost
Current (MVP)10,000700300GPT-4o$0.005$50
10x growth100,000700300GPT-4o$0.005$500
100x growth1,000,000700300GPT-4o$0.005$5,000
100x with cascade1,000,000variesvaries70% mini / 30% GPT-4o~$0.002~$2,000

Add: embedding costs (if RAG), reranker costs, external tool API costs, your vector DB hosting cost. AI features often have a stack of costs beyond the LLM call itself.

The business model check: at your target scale, does the AI feature cost fit within your unit economics? If you're charging ₹499/month for a product and the AI feature costs ₹120/user/month at 10x growth, you have a margin problem that price increases or cascade architecture must solve before you hit that scale.

What to do this week

  1. Run the cost model for one AI feature. Measure your actual average input and output tokens from logs or test runs. Apply current pricing. Project at 10x and 100x current volume. Write down the number.

  2. Check whether you have any prompt caching implemented. If you have system prompts over 1,024 tokens and you're running on Anthropic or OpenAI, caching is likely available and not enabled. Enabling it is usually a one-line change.

  3. Run a quality comparison: mini vs. full model. Take 20 examples from your golden set. Run both models. Score quality on your dimensions. If the quality delta is < 5%, you have a case for switching or cascading.

Where to go next