innFactory · pull · Jun 17, 2026 · Jun 17, 2026 · Jun 17, 2026
diff --git a/docs/prompt-cache-benchmark.md b/docs/prompt-cache-benchmark.md
@@ -0,0 +1,126 @@
+# Prompt-cache strategy benchmark
+
+This documents _why_ the default prompt-cache strategy is a **single tail
+breakpoint** anchored on the conversation tail (the Claude Code approach),
+rather than the legacy **"last two user messages"** markers — and how to
+reproduce the comparison live against a real provider.
+
+- Tail strategy: [`addTailCacheControl`](../src/messages/cache.ts) /
+  [`addBedrockTailCacheControl`](../src/messages/cache.ts)
+- Legacy strategy: `addCacheControl` / `addBedrockCacheControl` (still exported
+  for back-compat)
+- Benchmark: [`src/scripts/bench-prompt-cache.ts`](../src/scripts/bench-prompt-cache.ts)
+
+## Why the tail strategy wins
+
+A prompt-cache breakpoint caches everything _before_ it; the provider then
+reads back the longest matching cached prefix on the next call, automatically,
+regardless of where that call's own breakpoints sit.
+
+- **Legacy** pins markers on the **last two user messages**. In an agent tool
+  loop there is often only **one** user message for many turns, so the only
+  breakpoint sits at the top of the conversation. Every assistant/tool turn
+  appended afterwards falls _outside_ the cached prefix and is re-sent
+  **uncached (full price) on every subsequent call**. Cache write/fresh ≫ read.
+- **Tail** rides the true tail, so the transcript is written to cache once and
+  read back as history grows append-only. Freshly appended turns enter the
+  cached prefix on the next call instead of being reprocessed.
+
+This is the dominant agent shape (one request → many tool calls), which is
+exactly where the legacy approach degrades hardest.
+
+### Truncation and compaction
+
+Two harness behaviours mutate the transcript rather than append to it, so they
+deserve explicit treatment:
+
+- **Tool-output truncation** is applied **once, at tool-execution time**
+  ([`ToolNode`](../src/tools/ToolNode.ts) via
+  [`truncateToolResultContent`](../src/utils/truncation.ts)) with a cap derived
+  from the model's **fixed context window**, and the truncated string is what's
+  persisted. It is a pure, deterministic function (covered by
+  [`truncation.test.ts`](../src/utils/__tests__/truncation.test.ts)) and the cap
+  does not vary turn to turn, so a truncated result is a stable block in the
+  prefix — it never re-truncates differently and so never busts the cache.
+- **Compaction (summarization)** replaces the head with a durable summary
+  (`AgentContext.summaryText`, re-injected identically each turn). The
+  compaction event is a one-time cache miss for **any** strategy — the cached
+  prefix genuinely changed. Afterwards the summary is the new stable head and
+  the tail strategy re-establishes append-only caching over the continued
+  transcript. The benchmark's **post-compaction** scenario exercises exactly
+  this transition, and it is one of the largest wins (after compaction the
+  summary is the only user message, so legacy re-sends all continued tool work
+  uncached).
+
+## Metric
+
+For each model call the provider reports a token breakdown. Summed per scenario:
+
+- `read` — tokens served from cache (**higher is better**)
+- `write` — tokens written to cache (`cache_creation`)
+- `fresh` — uncached input processed at full price; this balloons when caching
+  fails to cover the transcript (**lower is better**)
+- `effective` — a cost proxy in input-token-equivalents using the published
+  multipliers: `read ×0.1 + write ×1.25 + fresh ×1.0` (**lower is better**)
+
+`fresh` is computed provider-agnostically as
+`(total_tokens − output_tokens) − cache_read − cache_creation`. This matters
+because the two providers report `input_tokens` differently: Anthropic folds the
+cached tokens _into_ `input_tokens`, while Bedrock reports `input_tokens` as the
+fresh delta only with the cache buckets separate. Deriving `fresh` from
+`total_tokens` is correct on both.
+
+## Representative results
+
+Live, `claude-sonnet-4-5`, `rounds=6` (exact counts vary run to run; the
+direction is stable). `effective` is the headline — lower is cheaper.
+
+### Anthropic
+
+| Scenario                                     | strategy |    read |  write |      fresh |          effective |
+| -------------------------------------------- | -------- | ------: | -----: | ---------: | -----------------: |
+| Agent tool loop (1 user turn, N tool rounds) | legacy   |  92,348 | 23,087 | **44,705** |             82,799 |
+|                                              | tail     | 129,823 | 30,284 |     **33** |  **50,870** (−39%) |
+| Multi-turn chat (frequent user messages)     | legacy   |  90,478 | 23,662 | **21,595** |             60,220 |
+|                                              | tail     | 118,765 | 25,004 |     **18** |  **43,150** (−28%) |
+| Realistic agent (user turns + tool rounds)   | legacy   | 498,344 | 39,514 | **50,635** |            149,862 |
+|                                              | tail     | 545,327 | 43,202 |     **90** | **108,625** (−28%) |
+| Post-compaction (summary head + tool loop)   | legacy   |  69,852 | 40,538 | **63,346** |            121,004 |
+|                                              | tail     | 123,118 | 47,576 |     **42** |  **71,824** (−41%) |
+
+### Bedrock (Converse)
+
+| Scenario                                     | strategy |    read |  write |      fresh |          effective |
+| -------------------------------------------- | -------- | ------: | -----: | ---------: | -----------------: |
+| Agent tool loop (1 user turn, N tool rounds) | legacy   | 122,940 | 24,588 | **21,633** |             64,662 |
+|                                              | tail     | 121,518 | 28,623 |     **33** |  **47,964** (−26%) |
+| Multi-turn chat (frequent user messages)     | legacy   | 119,560 | 25,163 |         18 |             43,428 |
+|                                              | tail     | 104,555 | 22,162 |         18 |  **38,176** (−12%) |
+| Realistic agent (user turns + tool rounds)   | legacy   | 495,826 | 38,003 | **27,538** |            124,624 |
+|                                              | tail     | 545,327 | 43,202 |     **90** | **108,625** (−13%) |
+| Post-compaction (summary head + tool loop)   | legacy   |  96,139 | 35,287 | **22,808** |             76,531 |
+|                                              | tail     | 123,118 | 47,576 |     **42** |   **71,824** (−6%) |
+
+The tail strategy is cheaper (lower `effective`) in **every** scenario on both
+providers (4/4 each). The clearest signal is `fresh`: the legacy approach
+reprocesses tens of thousands of full-price tokens in any tool-bearing
+conversation, which the tail strategy reduces to near zero. Even the legacy
+strong case (frequent user messages, no tools) is a tie-or-win for the tail
+strategy.
+
+## Reproduce
+
+Requires real credentials in `.env` (or point `BENCH_ENV_FILE` at one):
+`ANTHROPIC_API_KEY` for Anthropic, `BEDROCK_AWS_ACCESS_KEY_ID` /
+`BEDROCK_AWS_SECRET_ACCESS_KEY` (and a region) for Bedrock. It makes real, paid
+API calls and is **not** a unit test (CI never runs it).
+
+```bash
+npm run bench:cache                          # Anthropic (default)
+npm run bench:cache -- --provider bedrock     # Bedrock Converse
+npm run bench:cache -- --rounds 10 --model claude-sonnet-4-5
+```
+
+Each scenario runs the _same_ conversation under both strategies in separate
+cache namespaces (unique per run), then prints the per-strategy totals and the
+delta.
diff --git a/package.json b/package.json
@@ -139,6 +139,7 @@
     "tool": "node --trace-warnings -r dotenv/config --loader ./tsconfig-paths-bootstrap.mjs --experimental-specifier-resolution=node ./src/scripts/tools.ts --provider 'bedrock' --name 'Jo' --location 'New York, NY'",
     "search": "node -r dotenv/config --loader ./tsconfig-paths-bootstrap.mjs --experimental-specifier-resolution=node ./src/scripts/search.ts --provider 'bedrock' --name 'Jo' --location 'New York, NY'",
     "tool_search": "node -r dotenv/config --loader ./tsconfig-paths-bootstrap.mjs --experimental-specifier-resolution=node ./src/scripts/tool_search.ts",
+    "bench:cache": "node --loader ./tsconfig-paths-bootstrap.mjs --experimental-specifier-resolution=node ./src/scripts/bench-prompt-cache.ts",
     "subagent": "node -r dotenv/config --loader ./tsconfig-paths-bootstrap.mjs --experimental-specifier-resolution=node ./src/scripts/multi-agent-subagent.ts",
     "subagent:events": "node -r dotenv/config --loader ./tsconfig-paths-bootstrap.mjs --experimental-specifier-resolution=node ./src/scripts/subagent-event-driven-debug.ts",
     "subagent:tools": "node -r dotenv/config --loader ./tsconfig-paths-bootstrap.mjs --experimental-specifier-resolution=node ./src/scripts/subagent-tools-debug.ts",

diff --git a/src/agents/AgentContext.ts b/src/agents/AgentContext.ts
@@ -16,7 +16,7 @@ import {
   Providers,
 } from '@/common';
 import {
-  addCacheControl,
+  addTailCacheControl,
   addCacheControlToStablePrefixMessages,
   cloneMessage,
 } from '@/messages/cache';
@@ -689,7 +689,7 @@ export class AgentContext {
         dynamicTail.length === 0 &&
         body.length >= 2
       ) {
-        body = addCacheControl(body);
+        body = addTailCacheControl(body);
       }
       return [...prefix, ...body];
     }).withConfig({ runName: 'prompt' });

diff --git a/src/agents/__tests__/AgentContext.test.ts b/src/agents/__tests__/AgentContext.test.ts
@@ -274,16 +274,11 @@ describe('AgentContext', () => {
         new HumanMessage('First'),
         new HumanMessage('Second'),
       ]);
-      const firstContent = result[1].content as TestSystemContentBlock[];
       const secondContent = result[2].content as TestSystemContentBlock[];
 
       expect(result).toHaveLength(3);
       expect(result[0].content).toBe('Dynamic only');
-      expect(firstContent[0]).toMatchObject({
-        type: 'text',
-        text: 'First',
-        cache_control: { type: 'ephemeral' },
-      });
+      expect(result[1].content).toBe('First');
       expect(secondContent[0]).toMatchObject({
         type: 'text',
         text: 'Second',
@@ -686,7 +681,7 @@ describe('AgentContext', () => {
       expect(result[8].content).toBe('Now answer without tools');
     });
 
-    it('adds OpenRouter body cache points when there is no dynamic tail', async () => {
+    it('adds a single OpenRouter body cache point on the tail when there is no dynamic tail', async () => {
       const ctx = createBasicContext({
         agentConfig: {
           provider: Providers.OPENROUTER,
@@ -702,9 +697,8 @@ describe('AgentContext', () => {
         new HumanMessage('First'),
         new HumanMessage('Second'),
       ]);
-      const firstContent = result[1].content as TestSystemContentBlock[];
       const secondContent = result[2].content as TestSystemContentBlock[];
-      expect(firstContent[0]).toHaveProperty('cache_control');
+      expect(result[1].content).toBe('First');
       expect(secondContent[0]).toHaveProperty('cache_control');
     });
 

diff --git a/src/graphs/Graph.ts b/src/graphs/Graph.ts
@@ -19,14 +19,14 @@ import {
   convertMessagesToContent,
   sanitizeOrphanToolBlocks,
   extractToolDiscoveries,
-  addBedrockCacheControl,
+  addBedrockTailCacheControl,
   formatArtifactPayload,
   enforceOriginalContentCap,
   formatContentStrings,
   isLegacyConvertible,
   createPruneMessages,
   syncBudgetDerivedFields,
-  addCacheControl,
+  addTailCacheControl,
   getMessageId,
   makeIsDeferred,
   partitionAndMarkAnthropicToolCache,
@@ -1733,35 +1733,6 @@ export class StandardGraph extends Graph<t.BaseGraphState, t.GraphNode> {
         }
       }
 
-      if (agentContext.provider === Providers.ANTHROPIC) {
-        const anthropicOptions = agentContext.clientOptions as
-          | t.AnthropicClientOptions
-          | undefined;
-        if (
-          anthropicOptions?.promptCache === true &&
-          !agentContext.systemRunnable
-        ) {
-          finalMessages = addCacheControl<BaseMessage>(finalMessages);
-        }
-      } else if (agentContext.provider === Providers.BEDROCK) {
-        const bedrockOptions = agentContext.clientOptions as
-          | t.BedrockAnthropicClientOptions
-          | undefined;
-        if (bedrockOptions?.promptCache === true) {
-          finalMessages = addBedrockCacheControl<BaseMessage>(finalMessages);
-        }
-      } else if (agentContext.provider === Providers.OPENROUTER) {
-        const openRouterOptions = agentContext.clientOptions as
-          | t.ProviderOptionsMap[Providers.OPENROUTER]
-          | undefined;
-        if (
-          openRouterOptions?.promptCache === true &&
-          !agentContext.systemRunnable
-        ) {
-          finalMessages = addCacheControl<BaseMessage>(finalMessages);
-        }
-      }
-
       if (
         isThinkingEnabled(agentContext.provider, agentContext.clientOptions)
       ) {
@@ -1783,13 +1754,53 @@ export class StandardGraph extends Graph<t.BaseGraphState, t.GraphNode> {
         );
       }
 
-      // Intentionally broad: runs when the pruner wasn't used OR any post-pruning
-      // transform (addCacheControl, ensureThinkingBlock, etc.) reassigned finalMessages.
-      // sanitizeOrphanToolBlocks fast-paths to a Set diff check when no orphans exist,
-      // so the cost is negligible and this acts as a safety net for Anthropic/Bedrock.
+      // Determine the prompt-cache strategy up front. Two distinct facts:
+      //
+      //   `providerPromptCacheEnabled` — prompt caching is on for this provider
+      //   at all. This drives orphan cleanup, because EVERY cached send must be
+      //   sanitized — including the system-runnable path, where AgentContext (not
+      //   this node) adds the body marker.
+      //
+      //   `willAddTailCache` — THIS node will add the marker itself. Anthropic /
+      //   OpenRouter defer to the system runnable when one owns the system-prompt
+      //   breakpoint, so they exclude that case; Bedrock always marks here.
+      const anthropicPromptCacheEnabled =
+        agentContext.provider === Providers.ANTHROPIC &&
+        (agentContext.clientOptions as t.AnthropicClientOptions | undefined)
+          ?.promptCache === true;
+      const openRouterPromptCacheEnabled =
+        agentContext.provider === Providers.OPENROUTER &&
+        (
+          agentContext.clientOptions as
+            | t.ProviderOptionsMap[Providers.OPENROUTER]
+            | undefined
+        )?.promptCache === true;
+      const bedrockPromptCacheEnabled =
+        agentContext.provider === Providers.BEDROCK &&
+        (
+          agentContext.clientOptions as
+            | t.BedrockAnthropicClientOptions
+            | undefined
+        )?.promptCache === true;
+      const providerPromptCacheEnabled =
+        anthropicPromptCacheEnabled ||
+        openRouterPromptCacheEnabled ||
+        bedrockPromptCacheEnabled;
+
+      // Intentionally broad: runs when the pruner wasn't used, when any
+      // post-pruning transform (ensureThinkingBlock, etc.) reassigned
+      // finalMessages, OR when this is a prompt-cached send. The last clause
+      // matters because the marker is now applied AFTER this gate (and, for the
+      // system-runnable path, in AgentContext entirely): without it, a cached
+      // send whose pruner returned the context unchanged would skip cleanup and
+      // could ship orphaned AI/tool pairs from persisted history.
+      // sanitizeOrphanToolBlocks fast-paths to a Set diff check when no orphans
+      // exist, so the cost is negligible.
       const needsOrphanSanitize =
         anthropicLike &&
-        (!agentContext.pruneMessages || finalMessages !== messagesToUse);
+        (!agentContext.pruneMessages ||
+          finalMessages !== messagesToUse ||
+          providerPromptCacheEnabled);
       if (needsOrphanSanitize) {
         const beforeSanitize = finalMessages.length;
         finalMessages = sanitizeOrphanToolBlocks(finalMessages);
@@ -1809,6 +1820,24 @@ export class StandardGraph extends Graph<t.BaseGraphState, t.GraphNode> {
         }
       }
 
+      // Place the single tail prompt-cache breakpoint LAST, after thinking
+      // normalization and orphan sanitization. ensureThinkingBlockInMessages can
+      // fold a trailing non-thinking AI→Tool chain into a `[Previous agent
+      // context]` HumanMessage whose builder copies text but not cache_control /
+      // cachePoint, and sanitizeOrphanToolBlocks can drop the anchored block — so
+      // marking earlier would let the only breakpoint vanish before the model
+      // call (zero message caching). Anchoring on the final message list keeps
+      // the marker on a block that actually ships. The system-runnable path
+      // adds its body marker in AgentContext, so this node skips it there.
+      if (
+        (anthropicPromptCacheEnabled || openRouterPromptCacheEnabled) &&
+        !agentContext.systemRunnable
+      ) {
+        finalMessages = addTailCacheControl<BaseMessage>(finalMessages);
+      } else if (bedrockPromptCacheEnabled) {
+        finalMessages = addBedrockTailCacheControl<BaseMessage>(finalMessages);
+      }
+
       if (
         agentContext.lastStreamCall != null &&
         agentContext.streamBuffer != null