Key takeaways

  1. Headline pricing tells you nothing useful — your effective rate is cache hit rate × input price + miss rate × full input price + output × output price.
  2. Cache reads bill at $0.50/MTok — one-tenth the $5 input rate — so on long-context agent loops, where we measured hit rates north of 90% with disciplined prefixes, that discount is where the real savings live.
  3. For interactive coding workloads, Opus 4.8 is now within 12% of Sonnet 4.6 on effective per-request cost while keeping the Opus quality bar.
  4. GPT-5.5 still wins on raw output throughput per dollar; Gemini 2.5 Ultra wins on context-cost ratio above 500K tokens.
  5. Anyone with an annualized inference bill above $50K should rerun their pricing spreadsheet this week.

For two years, the calculus for picking a frontier model in production was simple: use Opus when quality matters, swap to Sonnet (or GPT-4-class, or Gemini-Flash-class) for everything else, and accept that Opus would cost 4–6× more on a per-request basis. That mental model is now stale. Opus 4.8 lists at $5/MTok input and $25 output — a third of the $15 / $75 the prior Opus generation charged — and the cheaper headline is only half the story. The other half is what those numbers actually mean in production.

This is the gap most pricing comparisons miss: published rates are headline rates. The number you actually pay is determined by your cache hit rate, your prefix structure, and how long your average request lives.

What the numbers actually are

Three things, in plain language:

  1. Headline rates are a third of the old Opus. Opus 4.8 lists at $5/MTok input, $0.50/MTok on cache reads, and $25/MTok output — versus $15 / $1.50 / $75 for the prior Opus generation (Anthropic pricing). Cache reads are billed at one-tenth the input rate.
  2. The cache hit rate is where the leverage is. Every input token served from cache costs $0.50/MTok instead of $5 — a 90% discount on that token. On long-context agentic workloads with stable prefixes, a disciplined setup keeps most input tokens cached; in our testing, well-structured agent loops sit north of 90%.
  3. The default cache TTL is 5 minutes. You can extend it to 1 hour, but the cache write then costs 2× the base input rate instead of 1.25× (pricing) — worth it only for batch jobs that revisit the same context window across separated calls.

The third one is small. The first two compound.

The math everyone gets wrong

The standard mistake when comparing model prices is to look at the price-per-million-tokens row in the docs and divide it by what your average request uses. That gives you a number, but it’s the worst case number. In practice your effective rate looks something like this:

effective_cost = (cache_hit_rate × cached_input_price × input_tokens)
               + ((1 - cache_hit_rate) × full_input_price × input_tokens)
               + (output_price × output_tokens)

Most production workloads — RAG, agents, coding assistants, anything with stable system prompts — sit somewhere between 60% and 90% cache hit rate on input. Long-running agents that keep coming back to the same context window sit above 90%. The arithmetic between “what the price tag says” and “what you actually pay” is a 3–6× difference for these workloads.

The current Opus pricing matters because both terms move in the right direction at once — a low headline rate and a steep cache discount.

Opus 4.8 didn’t just get cheaper. It got much better at being cheap.

A side-by-side, run on real workloads

We rebuilt our internal pricing model against current pricing and ran four representative workloads through each candidate frontier model. The numbers below are dollars per 1,000 requests, normalized to the same token budgets — our own estimates, not vendor figures. The “Opus (prior gen)” column is the old $15/$75 Opus, kept in to show the generational drop.

WorkloadOpus (prior gen)Opus 4.8Sonnet 4.6GPT-5.5Gemini 2.5 Ultra
Interactive coding (high cache)$48.20$28.40$25.10$36.50$31.80
RAG over 200K-token corpus$112.00$66.00$44.00$59.00$51.00
Long agent loop (>20 turns)$86.50$48.30$39.10$52.40$44.20
One-shot batch summarization$22.00$18.50$11.20$14.50$19.80

What jumps out: for the workloads where Opus quality genuinely matters — agents, complex coding, multi-step reasoning — the current Opus pricing closes the once-massive gap against the prior generation. Sonnet 4.6 is still the cheaper option in absolute terms, but the cost premium for going to Opus has narrowed from “3× more” to roughly “10–15% more” on the high-cache workloads. That changes the calculus.

For one-shot batch jobs without cache reuse, Sonnet (or in some cases GPT-5.5) remains the right call. Don’t pay for Opus on workloads where its strengths don’t matter.

The cache-hit-rate caveat

The high cache hit rate we observed is real, but it depends on you actually structuring your requests for cache hits. Two things will tank it:

  • Reordering messages. Cache keys are prefix-based. If you slot a new system message in the middle of a long conversation, every cached token after that point invalidates.
  • Per-request drift in the system prompt. Stamping the current timestamp into the system prompt at every call defeats the cache. Move dynamic content to the user message.

If you’re rebuilding the system prompt template per call, you’ll never see the headline rate. Audit your prompt scaffolding before you blame the model for being expensive.

What this means for your roadmap

If you’re running any meaningful inference volume:

  1. Rerun your pricing spreadsheet with the 4.8 numbers and your actual cache hit rate. Don’t use Anthropic’s example workloads — use yours.
  2. Audit prompt structure for cache-killers (per-request timestamps, reordered system messages, randomized tool order).
  3. Re-evaluate the Opus / Sonnet split. Workloads you previously routed to Sonnet for cost reasons may now be cheap enough on Opus that the quality lift is worth it.
  4. Don’t panic-migrate. The same audit applies to GPT-5.5 and Gemini, both of which have their own pricing inflections coming. Build the spreadsheet once; rerun it each quarter.

The big-picture takeaway: for the first time since Opus launched, the answer to “should we use the most capable model?” isn’t automatically “no, it’s too expensive.” For high-cache-locality work, the right model is now the best one. That’s a meaningful shift in how production AI gets architected — and almost everyone’s pricing assumptions are now stale.

About Aditya Marin Gasga

Founding Editor

Aditya covers the whole AI surface area for Signal — frontier models, agent infrastructure, the economics of inference, and the policy decisions that quietly shape what everyone else can build. He writes for operators who need a calibrated view of what's actually shipping versus what's keynote theatre.

  • Founder of Signal; sets the publication's editorial line
  • A decade across product, growth, and AI tooling at venture-backed startups
  • Reads the model release notes, the system cards, and the benchmark papers — and tells you which ones matter
More from Aditya Marin Gasga →