Key takeaways

  1. Prompt caching is a critical, often overlooked, lever for reducing Opus 4.8 inference costs and latency.
  2. Any deviation in the exact string of a prompt's cached prefix (typically the system prompt) will result in a cache miss.
  3. Avoid dynamic elements like timestamps, random IDs, or user-specific PII in your system prompts.
  4. Structure your system prompts as static, version-controlled templates, passing variable data in the user message.
  5. Consistent formatting (whitespace, JSON structure) across identical prompts is crucial for cache hits.
  6. Disciplined prompt construction is the difference between a cache that mostly misses and one that mostly hits — and Anthropic bills cache hits at one-tenth the price of fresh input tokens.

A team I talked to was paying full price on a system prompt they sent on every request — the same few thousand tokens of instructions, tools, and examples, billed at the base input rate thousands of times a day. They didn’t have a volume problem or a model problem. They had a trailing-timestamp problem: one dynamic line at the top of the prompt was busting the cache on every call. Anthropic bills a cache hit at one-tenth the base input rate (prompt caching docs) — so that one line was charging them roughly 10× for the part of the prompt that never actually changed.

That’s the whole game with prompt caching, and the whole trap. The core principle is simple: the cache key is an exact string match. Even a single character difference—a trailing space, a different timestamp, a reordered JSON key—will result in a cache miss. Understanding this is the first step to unlocking substantial savings, particularly for applications with repetitive query patterns.

We’ve previously discussed how Opus 4.8’s pricing structure makes token efficiency critical (see our “Opus 4.8 Cost Per Token” piece). Prompt caching directly extends this by making re-computation cheap: a cached prefix is billed at one-tenth the price of the same tokens read fresh — not free, but close enough that re-sending identical context stops being a cost you think about.

What quietly kills your cache

A handful of ordinary prompt-engineering habits look harmless and quietly wreck your hit rate. These are the ones I see most:

1. Dynamic timestamps and random IDs

Including current_datetime or a request_id directly in your system prompt ensures every single request is unique, guaranteeing a cache miss. This is the most common offender.

Bad Example:

# This will always miss the cache
system_prompt = f"""
You are a helpful assistant. Current time: {datetime.now().isoformat()}. Request ID: {uuid.uuid4()}.
Your task is to summarize user input.
"""

If you need the current time or an ID for logging or a specific instruction, pass it as part of the user message, or ensure it’s outside the cached prefix.

2. Per-user data in the system prompt

Embedding data that changes per user or per session (e.g., user_id, session_token, user_preferences) into the system prompt makes it unique for each user interaction. This belongs in the user message or as part of a context block.

Bad Example:

# This makes the system prompt unique per user
system_prompt = f"""
You are a personalized assistant for user {user.id}. Their preferred language is {user.language}.
Respond in a helpful and concise manner.
"""

3. Inconsistent formatting and whitespace

Subtle differences in how you construct your prompt strings, such as varied newline characters, extra spaces, or inconsistent JSON key ordering, will cause cache misses. The cache sees {"a": 1, "b": 2} as different from {"b": 2, "a": 1} or { "a": 1, "b": 2 }.

Recommendation: Use a linter or a canonical serialization method (e.g., json.dumps with sort_keys=True and separators=(',', ':')) for any structured data you include in the system prompt.

4. Irrelevant context bloat

Don’t include context that doesn’t directly influence the model’s behavior for the specific system prompt’s instruction. If a piece of context changes frequently but doesn’t alter the core task definition, it’s likely better placed in the user message.

What keeps the hit rate high

Flip each of those around and you get the patterns that work. The throughline: treat your system prompt like version-controlled code, not like a string you rebuild per request.

1. Static, deterministic system prompts

If you do one thing, do this. Your system prompt should be a fixed string for a given application context. It defines the model’s persona, core instructions, and constraints. It should ideally be checked into source control.

Good Example:

# This system prompt is static and highly cacheable
STATIC_SYSTEM_PROMPT = """
You are an expert financial analyst. Your task is to provide concise, factual summaries of earnings reports.
Only use information explicitly provided in the user's input. Do not hallucinate or add external data.
Format your response as a JSON object with 'summary' and 'key_metrics'.
"""

# Dynamic data goes into the user message
user_input = {
    "report_title": "Q1 2026 Earnings for Acme Corp",
    "report_text": "... full earnings report content ..."
}

# Use a consistent serialization for user input
user_message_content = json.dumps(user_input, sort_keys=True, separators=(',', ':'))

# Construct the messages list
messages = [
    {"role": "user", "content": STATIC_SYSTEM_PROMPT},
    {"role": "user", "content": user_message_content}
]

# In this structure, the first user message (the system prompt) is highly cacheable.
# Only the second user message changes per request.

Notice how the STATIC_SYSTEM_PROMPT is passed as the first user message. While Anthropic’s API has a dedicated system role, for caching purposes, the content of that initial instruction is what matters. If your system role content is static, it will be cached. If you pass dynamic content in the system role, that part of the cache key will change.

2. Structured inputs and outputs

When passing complex information, use structured formats like JSON or YAML. More importantly, use a canonical serialization method. This ensures that the exact string representation is consistent across requests, even if the underlying data structure is logically identical.

import json

def get_canonical_json(data):
    return json.dumps(data, sort_keys=True, separators=(',', ':'))

# ... inside your request logic ...

user_context = {
    "customer_id": "cust_123",
    "recent_orders": [
        {"order_id": "O1", "amount": 100},
        {"order_id": "O2", "amount": 150}
    ]
}

user_message_content = f"""
Here is the customer's context:
{get_canonical_json(user_context)}

Please analyze their recent orders and suggest a relevant upsell.
"""

messages = [
    {"role": "user", "content": STATIC_SYSTEM_PROMPT},
    {"role": "user", "content": user_message_content}
]

This ensures that if user_context is identical for two requests, its string representation will also be identical, contributing to a cache hit for the combined prompt if the STATIC_SYSTEM_PROMPT is also identical.

3. Dynamic context as a separate message

If you’re using retrieval-augmented generation (RAG) or passing large blocks of dynamic context, structure your messages array so the static system prompt comes first, followed by the dynamic context in subsequent user messages.

This allows the initial, cacheable part of the prompt to hit the cache. The dynamic context will then be processed, but the initial token processing for the shared prefix is saved.

import anthropic
import json

client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")

STATIC_SYSTEM_PROMPT = """
You are a question-answering bot for a documentation system.
Answer the user's question based *strictly* on the provided documentation snippets.
If the answer is not in the snippets, state that you don't have enough information.
"""

def query_docs(user_query):
    # Simulate RAG retrieval
    retrieved_docs = [
        {"id": "doc1", "content": "Installation requires Python 3.9+."},
        {"id": "doc2", "content": "Authentication uses OAuth2."}
    ]
    return retrieved_docs

user_question = "How do I install the system?"
retrieved_docs = query_docs(user_question)

docs_context = """
<documents>
"""
for doc in retrieved_docs:
    docs_context += f"<doc id=\"{doc['id']}\">{doc['content']}</doc>\n"
docs_context += "</documents>\n"

full_user_message = f"""
{docs_context}

User Question: {user_question}
"""

messages = [
    {"role": "user", "content": STATIC_SYSTEM_PROMPT}, # This is highly cacheable
    {"role": "user", "content": full_user_message} # This contains dynamic RAG context
]

# Example API call
# response = client.messages.create(
#     model="claude-3-opus-20240229",
#     max_tokens=1024,
#     messages=messages
# )
# print(response.content)

In this setup, any request that uses the exact same STATIC_SYSTEM_PROMPT will benefit from the cache for that initial segment, even if the full_user_message (containing docs_context and user_question) is unique.

4. Version-control your prompts

Treat your system prompts like code. Store them in your repository, subject them to review, and version them. This ensures consistency and prevents accidental changes that could invalidate the cache. A simple prompt_v1.txt or prompt_v2.json approach is often sufficient.

How to actually measure it

You can’t fix what you don’t watch. Measuring your real hit rate comes down to three things:

  1. Logging: For every API call, log the exact system prompt content (or the initial user message content if that’s your static prefix) and a hash of that content.
  2. Counting: Over a given period (e.g., 24 hours), count how many unique hashes appear versus the total number of requests. The closer the unique count is to 1, the higher your potential hit rate for that prompt template.
  3. Monitoring: Integrate this into your observability stack. A dashboard showing unique_prompt_hashes / total_requests can quickly highlight regressions.

What it’s worth

Go back to that team with the trailing timestamp. A 4,000-token prefix sent 50,000 times a day is 200M input tokens daily. Missing the cache, that’s billed at the full base rate; hitting it, the prefix costs one-tenth as much — a 90% cut on the tokens that never change, plus a faster time-to-first-token on every call. They moved the timestamp into the user message, changed nothing else, and the prefix went from a line item they argued about to one they stopped noticing. That’s the shape of the win: not clever, just disciplined — keep the bytes that repeat byte-identical, and let the cache do the rest.

About Aditya Marin Gasga

Founding Editor

Aditya covers the whole AI surface area for Signal — frontier models, agent infrastructure, the economics of inference, and the policy decisions that quietly shape what everyone else can build. He writes for operators who need a calibrated view of what's actually shipping versus what's keynote theatre.

  • Founder of Signal; sets the publication's editorial line
  • A decade across product, growth, and AI tooling at venture-backed startups
  • Reads the model release notes, the system cards, and the benchmark papers — and tells you which ones matter
More from Aditya Marin Gasga →