Key takeaways

  1. Prefix caching is character-exact. One dynamic character in the middle of your system prompt resets you to the full input rate for every token after it.
  2. The three most common cache-killers don't show up in any dashboard: per-request timestamps, per-user template variables, and silent tool-array reorders.
  3. A '90% hit rate' reported by your logging stack can easily mean 30% on actual API spend.
  4. The fix is almost always the same: move the dynamic content out of the system prompt and into the user message.
  5. A five-minute audit on your live system prompt usually saves 30-50% on a real workload's inference bill.

A founder I know spent thirty minutes last Friday explaining to me why his team’s API bill couldn’t possibly be right. Their logging dashboard showed a 91% prefix cache hit rate across their main agent loop. Anthropic’s invoice arrived; the number their CFO sent back assumed something closer to 40%. He was sure it was a billing bug. It wasn’t.

It almost never is.

The gap between the cache hit rate your stack reports and the cache hit rate you actually pay for is the most common AI infra leak I see at small companies. It’s invisible from the inside, costs real money, and the fix takes an afternoon. The reason it persists is structural: the things that kill your cache the most don’t show up in the metric your stack measures.

The number you report vs. the number you pay

A normal logging setup counts a cache hit when the API response says one. That’s correct for individual requests but wildly misleading at the aggregate. Prefix caching is character-exact: the provider hashes your input from position zero through some boundary, looks up that hash, and either serves the cached prefix at the discounted rate or doesn’t. The moment a single character in your prompt differs from what was cached, every token from that character to the end pays the full rate.

So when your dashboard reports “91% hit rate,” what it actually measured was: 91% of requests had some cached prefix. That’s a different number. The question that determines your bill is how much of each request paid the cached rate — and that’s a weighted average of token positions, not a percentage of requests.

A request that hits 95% of the cached prefix and then diverges at position 4,800 of 5,000 is reported as a “cache hit.” A request that diverges at position 200 of 5,000 is also reported as a hit. The first one paid the cached rate on 4,800 tokens; the second on 200. Same boolean, very different bills.

The math gets worse the longer your system prompts get. Opus 4.8 lists cached input at $0.50 / Mtok and full input at $5 / Mtok — a 10× gap. On a 5,000-token system prompt with a single dynamic character at position 200, that’s 4,800 tokens paying $5/Mtok instead of $0.50/Mtok. Per request, the difference is small. At 100,000 requests a month, it’s roughly $2,200 you didn’t budget for.

The first killer: timestamps you didn’t put there

The most common offender is a current-time string in the system prompt. Often nobody on the team put it there on purpose — it was in the template scaffolding from an earlier prototype, or a well-meaning framework auto-injects it as a “context aid,” or a long-ago datetime.now() survived a refactor inside a Python f-string and nobody noticed.

It looks like this in a real prompt:

You are a customer support agent for Acme Software.
Current time: 2026-05-31T14:23:55Z
Request ID: req_abc123
...

That second line resolves to a different string on every request. The cache key matches your prompt up through “You are a customer support agent for Acme Software.” — and then diverges. Every token from “Current time” forward pays the full input rate. If your real system prompt is 2,000 tokens, you just lost the cache on ~1,990 of them.

The fix is mechanical. Move the timestamp into the user message. If the model actually needs the current time for the task — and ask honestly, because in most workflows it doesn’t — pass it after the cacheable system instructions.

The second killer: template variables that look harmless

Worse than timestamps because they look like normal templating: {{user_id}} or {{tenant}} or {{session.email}} baked into the system prompt itself. The reasoning is usually “we want the model to know who it’s talking to.” The cost is that every user gets their own cache prefix. You’ll only see cache hits across repeat calls from the same user, not across users.

For a customer-facing app with 50,000 monthly active users, this is the difference between a single cached prefix shared by all of them and 50,000 separate prefixes that each need to warm up. Most warm-up traffic is below the cache TTL threshold, so most users functionally never hit a warmed cache. Your dashboard still says 50% hit rate (because power users hit on their own repeat calls), but the population-wide effective rate is far lower.

The pattern is the same fix: the per-user identifier goes into the user message, not the system prompt. The system prompt should be byte-identical for every user of the application.

If the model genuinely needs to personalize behavior based on the user — paid tier vs free tier, account age, language preference — that information also belongs in the user turn, structured as context the model can read. Don’t bake it into the cacheable instructions.

The third killer: tool arrays your framework reorders

This one is invisible without going looking. Many agent frameworks (and several mainstream SDKs) accept a tool list and pass it to the model API in whatever order Python iterated the dictionary, or whatever order JavaScript serialized the set. That order is mostly stable within a runtime, but not guaranteed across deploys, library upgrades, or — in some cases — across requests in the same process.

The model API serializes that tool list into the prompt as JSON. If the order differs by even one entry between two requests, the rendered prompt differs, the cache key differs, and your hit rate craters. The horror is that your application code is identical between the two requests; the bug is happening inside the framework you don’t control.

The diagnostic for this one is to dump your rendered prompt — the actual string the SDK sends to the API — and diff it between two consecutive requests. If they differ and you didn’t mean them to, you’ve found it.

The fix: sort your tool list deterministically before passing it. Most modern frameworks have started doing this by default; some haven’t. If you can’t tell which group your framework is in, sort defensively.

The 5-minute audit

The three killers above account for the large majority of “my logs say one thing, my bill says another” cases I see. There are minor variations — random number generators inside Jinja templates, UUIDs baked into agent metadata, dev-vs-prod environment strings — but the diagnostic pattern is the same: dump the actual rendered system prompt, scan for dynamic content, move it elsewhere.

Two ways to do the audit. The slow way is to dump your rendered prompt to a file and grep it manually for timestamp formats, UUID patterns, and template-variable syntax. That works but takes thirty minutes if you do it carefully.

The fast way is to paste your rendered prompt into the prompt-cache diagnostic. It scans for eleven specific patterns — every variation of the three killers above, plus less-common ones like unsorted JSON serialization and mixed tab/space indentation — and tells you exactly what to fix. It runs entirely in your browser, so nothing leaves the page. Score 95+ means your prompt should cache cleanly. Below that, the diagnostic surfaces each issue with the specific fix.

If the diagnostic flags problems and you’d rather start from a known-good system prompt than rewrite yours, the template library has six vetted prompts — coding agent, customer support, RAG assistant, structured-output classifier, content drafter, research analyst — each one structured for cache-friendliness. Copy, customize the bracketed bits, audit the result, ship.

The reported number on your dashboard is not the number on your bill.

What the savings actually look like

The founder I mentioned above ran his agent’s system prompt through the diagnostic on the same Friday afternoon. It scored 18 out of 100. Three findings: a timestamp injection from a long-deprecated logging helper, a per-user template variable from an early personalization experiment that nobody documented, and a tool-list order that varied between development and production because of a dependency version mismatch.

He fixed all three in about an hour. The following Monday’s API spend ran 47% lower than the previous Monday on identical traffic. His dashboard’s reported hit rate barely moved — it had always been high. The real one, the one that actually shows up on the invoice, was the one that needed fixing.

The honest version of the story is that the team had been paying for a 30% effective hit rate while believing they had 91% for several months. Nobody had done the audit because nothing was obviously broken. Their dashboards looked fine. The invoice was the only place the truth surfaced, and the invoice told them in aggregate, not in a way that pointed at the cause.

This is the worst kind of infrastructure bug — one that costs real money continuously and doesn’t trigger any alarm. The way you find it is by going looking. If you’re running production AI workloads with any meaningful volume, run the diagnostic on your live rendered system prompt today. Five minutes. The reported number on your dashboard is not the number on your bill, and the gap is usually three lines of templating code.

If the diagnostic flags issues that drop your effective rate — even by 20-30 percentage points — your downstream model choice probably wants a second look. The cost calculator shows what the savings would be on your specific token volume, and the picker can tell you whether the savings justify staying on Opus or whether the new effective price makes Sonnet the right default. Most workloads I’ve audited end up either keeping Opus (because the cache works as intended once cleaned up) or moving to Sonnet (because the cache is genuinely unfixable for their architecture and Sonnet’s headline rate already wins). Either way, the decision should be made on the real cache hit rate, not the reported one.

Five minutes. Run the audit.

About Aditya Marin Gasga

Founding Editor

Aditya covers the whole AI surface area for Signal — frontier models, agent infrastructure, the economics of inference, and the policy decisions that quietly shape what everyone else can build. He writes for operators who need a calibrated view of what's actually shipping versus what's keynote theatre.

  • Founder of Signal; sets the publication's editorial line
  • A decade across product, growth, and AI tooling at venture-backed startups
  • Reads the model release notes, the system cards, and the benchmark papers — and tells you which ones matter
More from Aditya Marin Gasga →