Prompt caching — Physea Wiki

Prompt caching saves the unchanging front of a prompt so later requests reuse it instead of paying to reprocess it. Cached input reads can cost a small fraction of the normal input price, often around a tenth.

Many applications send the same large block of text over and over: a long system prompt, a set of instructions, a reference document. Reprocessing that block on every request is wasted work, and you pay input rates for it each time. Prompt caching fixes that by storing the result of processing the unchanging front of a prompt, so later requests reuse it.

Caching works on a prefix match: the cached part has to be the exact same bytes from the very start of the prompt. Anthropic’s docs note that a cache hit “require[s] 100% identical prompt segments” up to the cached point.^[1] One change near the beginning — a timestamp, a reordered line — invalidates everything after it. So the design rule is to put stable content first and volatile content (the actual question, per-request IDs) last.

The savings are large. A cache read is billed at a small fraction of the base input price — Anthropic prices reads at 0.1 times the normal input rate, with a one-time write that costs 1.25 times the base price for a 5-minute cache or 2 times for a 1-hour cache.^[1] OpenAI’s caching is automatic, kicks in for prompts of 1,024 tokens or longer, and can cut input cost by up to 90%.^[2] Caches are short-lived by design — OpenAI keeps entries warm for a few minutes of inactivity up to about an hour^[2] — so caching pays off most when the same prefix is reused in quick succession.

Prompt caching docs

Anthropic prompt caching ↗
Developer-controlled caching with cache_control breakpoints, 5-minute and 1-hour TTLs.
OpenAI prompt caching ↗
Automatic prefix caching for prompts of 1,024 tokens or longer, no code changes.

References

Prompt caching — Anthropic
Prompt caching — OpenAI

How does prompt caching cut cost?

Prompt caching docs

References