This site is not affiliated with, endorsed by, or connected to Anthropic, PBC. Claude and the Claude logo are trademarks of Anthropic. All pricing shown is sourced from public Anthropic documentation. Verify current pricing at claude.com/pricing.
Updated April 2026

Claude Prompt Caching: How It Works and How Much It Saves (2026)

Prompt caching is one of the most effective ways to cut Claude API costs. Here is exactly how it works, what it costs to set up, and worked examples showing real savings.

What Is Prompt Caching?

Prompt caching is an API feature that stores repeated context (like a large system prompt or shared document) in fast-retrieval storage. Instead of paying full input token rates every time you include that content, cached reads cost 90% less. The trade-off: the first time you cache content costs 25% more than standard. After that, every read is dramatically cheaper.

Prompt Caching Pricing (Sonnet 4.6)

Standard input

$3.00

per 1M tokens

Cache write (25% premium)

$3.75

per 1M tokens (first cache)

Cache read (90% off)

$0.30

per 1M tokens (subsequent)

ModelStandard InputCache WriteCache Read
Claude Opus 4.6$5.00$6.25$0.50
Claude Sonnet 4.6$3.00$3.75$0.30
Claude Haiku 3.5$0.80$1.00$0.08

Worked Savings Examples

Customer support app with 10,000-token system prompt

Without caching

1,000 API calls x 10,000 tokens x $3/1M = $30.00

With caching

1 cache write: 10,000 x $3.75/1M = $0.038 999 cache reads: 10,000 x $0.30/1M = $2.997 Total: $3.04

90% saving - from $30 to $3.04

Code review app with 50,000-token codebase as context

Without caching

200 reviews x 50,000 tokens x $3/1M = $30.00

With caching

1 cache write: 50,000 x $3.75/1M = $0.19 199 cache reads: 50,000 x $0.30/1M = $2.99 Total: $3.18

89% saving - from $30 to $3.18

Cache TTL: The 5-Minute Rule

Cache entries have a 5-minute Time To Live (TTL). Each request that uses the cache refreshes the TTL by another 5 minutes. This means:

  • Active apps: As long as requests come in at least once every 5 minutes, the cache stays warm indefinitely.
  • Bursty apps: If you have a gap of more than 5 minutes between requests, the cache expires and the next request pays cache write rate again.
  • Overnight jobs: Batch API is a better fit than caching for overnight jobs - no warm cache to maintain.

Prompt Caching vs Batch API: Which to Use?

Use prompt caching when...

  • + You need real-time responses (<1 second)
  • + You have a large static system prompt repeated every request
  • + Your app has consistent traffic with short gaps between requests
  • + You want to cache shared documents, codebases, or few-shot examples

Use Batch API when...

  • + Results within 24 hours are acceptable
  • + Processing large volumes at lowest possible cost (50% off)
  • + Overnight runs, bulk analysis, evaluation jobs
  • + Traffic is bursty with long idle periods (cache would expire)

Frequently Asked Questions

Does prompt caching work with all Claude models?
Prompt caching is available for Claude Sonnet 4.6, Claude Opus 4.6, and Claude Haiku 3.5 via the API. The pricing discount is consistent across models (90% off cached reads, 25% premium on cache writes). Caching is an API feature only - it is not applicable to subscription plans which do not use token-based billing.
What can be cached in Claude prompt caching?
You can cache system prompts, large context documents, few-shot examples, and any other static or slowly-changing content that repeats across requests. The content must be in the 'cache_control' parameter of your API request. Images and tool definitions can also be cached. Dynamic content (the user's actual message) is not cached - only the static context around it.
Does the cache persist across different user sessions?
No. The cache is per-API-key and not shared across different users or sessions. Each cache entry has a 5-minute TTL that resets on every request that uses that cache. The cache does not persist if you go 5 minutes without any request hitting it. This means caching is most valuable for applications with continuous active usage rather than infrequent requests.
How do I know if caching is working?
Anthropic's API response includes usage metadata that shows how many tokens were served from cache (cache_read_input_tokens) versus charged at standard rates. Monitor this field in your API responses to verify caching is active and measure your actual savings rate.
Is there a minimum size for cacheable content?
Yes. There is a minimum block size of 1,024 tokens for content to be eligible for caching. This means very short system prompts (under about 750-800 words) may not benefit from caching. For most production applications with meaningful system prompts or shared context documents, this threshold is easily exceeded.

Related Pages