How Claude Code Uses Prompt Caching During Compaction

#claude-code #compaction #prompt-cache #caching

Source: raw/docs/claude-code-compaction-prompt-cache.md

Summary

Every compaction event risks wiping the server-side KV prompt cache. Claude Code applies six engineering layers to minimize this cost.

Layer 1 — Cache-sharing fork (compact.ts / tengu_compact_cache_prefix): The summarization API call goes through a forked agent that carries the parent's CacheSafeParams (system prompt + tools + model + messages prefix + thinking config). The fork hits the parent's warm cache instead of paying cold cache_creation tokens.
skipCacheWrite: true shifts the cache_control marker to the second-to-last message, so the fork's own summarization turn is never written into the KVCC. Fallback: direct streaming call without cache sharing.

Layer 2 — Exactly one cache_control marker per request (claude.ts / addCacheBreakpoints()): Only one marker per API call (last or second-to-last user/assistant message). Two markers would protect second-to-last KV pages unnecessarily, preventing immediate freeing of local-attention pages.

Layer 3 — Cache TTL stability (claude.ts / should1hCacheTTL() / tengu_prompt_cache_1h_config): TTL (5 min vs 1 h) is latched at session start into bootstrap state and never re-read, even if GrowthBook updates mid-session. Mixing TTLs mid-session busts ~20K tokens. 1h granted to Anthropic internal users and Claude.ai subscribers not on overage.

Layer 4 — Cached Microcompaction (microCompact.ts / CACHED_MICROCOMPACT): Uses cache_edits deletion blocks to remove old tool results server-side without touching message content. Previously-sent blocks are pinned and re-sent on every subsequent turn (server requires this to apply edits on top of cached prefix). Time-based fallback mutates content directly (cache is cold anyway) and calls resetMicrocompactState() to discard stale pinned-edit state.

Layer 5 — Post-compaction baseline reset (promptCacheBreakDetection.ts / notifyCompaction()): After any compaction, prevCacheReadTokens = null resets the break-detection baseline. Without this, the first post-compact response's token drop would be misread as a cache break. notifyCacheDeletion() similarly suppresses expected drops from cached microcompact deletions.

Layer 6 — Cache Break Detection (promptCacheBreakDetection.ts): Two-phase per API call:

Pre-call recordPromptState(): snapshots system prompt hash, tool schema hash, cache_control hash, per-tool hashes, model, fast mode, beta headers, effort value, extra body params, global cache strategy.
Post-call checkResponseForCacheBreak(): compares cache_read_input_tokens to previous turn. Drops <5% or <2,000 tokens ignored as noise. Larger drops with no pending changes → server eviction/TTL. Larger drops with pending changes → explained break. Writes before/after diff to temp file for debugging.

Cache impact by compaction type

Compaction type	Cache impact	Mitigation
Cached microcompact	None	`cache_edits` API + pinned re-sent blocks
Time-based microcompact	Cold (long idle)	Direct content mutation + `resetMicrocompactState()`
Session memory compaction	Replaces message array	`notifyCompaction()` resets baseline
Full LLM compaction	Replaces message array	Fork reuses parent cached prefix + `notifyCompaction()`
Partial compaction (`from`)	Prefix preserved	No bust for kept portion
Partial compaction (`up_to`)	Cache invalidated	`notifyCompaction()` resets baseline