Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 126 additions & 0 deletions docs/prompt-cache-benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Prompt-cache strategy benchmark

This documents _why_ the default prompt-cache strategy is a **single tail
breakpoint** anchored on the conversation tail (the Claude Code approach),
rather than the legacy **"last two user messages"** markers — and how to
reproduce the comparison live against a real provider.

- Tail strategy: [`addTailCacheControl`](../src/messages/cache.ts) /
[`addBedrockTailCacheControl`](../src/messages/cache.ts)
- Legacy strategy: `addCacheControl` / `addBedrockCacheControl` (still exported
for back-compat)
- Benchmark: [`src/scripts/bench-prompt-cache.ts`](../src/scripts/bench-prompt-cache.ts)

## Why the tail strategy wins

A prompt-cache breakpoint caches everything _before_ it; the provider then
reads back the longest matching cached prefix on the next call, automatically,
regardless of where that call's own breakpoints sit.

- **Legacy** pins markers on the **last two user messages**. In an agent tool
loop there is often only **one** user message for many turns, so the only
breakpoint sits at the top of the conversation. Every assistant/tool turn
appended afterwards falls _outside_ the cached prefix and is re-sent
**uncached (full price) on every subsequent call**. Cache write/fresh ≫ read.
- **Tail** rides the true tail, so the transcript is written to cache once and
read back as history grows append-only. Freshly appended turns enter the
cached prefix on the next call instead of being reprocessed.

This is the dominant agent shape (one request → many tool calls), which is
exactly where the legacy approach degrades hardest.

### Truncation and compaction

Two harness behaviours mutate the transcript rather than append to it, so they
deserve explicit treatment:

- **Tool-output truncation** is applied **once, at tool-execution time**
([`ToolNode`](../src/tools/ToolNode.ts) via
[`truncateToolResultContent`](../src/utils/truncation.ts)) with a cap derived
from the model's **fixed context window**, and the truncated string is what's
persisted. It is a pure, deterministic function (covered by
[`truncation.test.ts`](../src/utils/__tests__/truncation.test.ts)) and the cap
does not vary turn to turn, so a truncated result is a stable block in the
prefix — it never re-truncates differently and so never busts the cache.
- **Compaction (summarization)** replaces the head with a durable summary
(`AgentContext.summaryText`, re-injected identically each turn). The
compaction event is a one-time cache miss for **any** strategy — the cached
prefix genuinely changed. Afterwards the summary is the new stable head and
the tail strategy re-establishes append-only caching over the continued
transcript. The benchmark's **post-compaction** scenario exercises exactly
this transition, and it is one of the largest wins (after compaction the
summary is the only user message, so legacy re-sends all continued tool work
uncached).

## Metric

For each model call the provider reports a token breakdown. Summed per scenario:

- `read` — tokens served from cache (**higher is better**)
- `write` — tokens written to cache (`cache_creation`)
- `fresh` — uncached input processed at full price; this balloons when caching
fails to cover the transcript (**lower is better**)
- `effective` — a cost proxy in input-token-equivalents using the published
multipliers: `read ×0.1 + write ×1.25 + fresh ×1.0` (**lower is better**)

`fresh` is computed provider-agnostically as
`(total_tokens − output_tokens) − cache_read − cache_creation`. This matters
because the two providers report `input_tokens` differently: Anthropic folds the
cached tokens _into_ `input_tokens`, while Bedrock reports `input_tokens` as the
fresh delta only with the cache buckets separate. Deriving `fresh` from
`total_tokens` is correct on both.

## Representative results

Live, `claude-sonnet-4-5`, `rounds=6` (exact counts vary run to run; the
direction is stable). `effective` is the headline — lower is cheaper.

### Anthropic

| Scenario | strategy | read | write | fresh | effective |
| -------------------------------------------- | -------- | ------: | -----: | ---------: | -----------------: |
| Agent tool loop (1 user turn, N tool rounds) | legacy | 92,348 | 23,087 | **44,705** | 82,799 |
| | tail | 129,823 | 30,284 | **33** | **50,870** (−39%) |
| Multi-turn chat (frequent user messages) | legacy | 90,478 | 23,662 | **21,595** | 60,220 |
| | tail | 118,765 | 25,004 | **18** | **43,150** (−28%) |
| Realistic agent (user turns + tool rounds) | legacy | 498,344 | 39,514 | **50,635** | 149,862 |
| | tail | 545,327 | 43,202 | **90** | **108,625** (−28%) |
| Post-compaction (summary head + tool loop) | legacy | 69,852 | 40,538 | **63,346** | 121,004 |
| | tail | 123,118 | 47,576 | **42** | **71,824** (−41%) |

### Bedrock (Converse)

| Scenario | strategy | read | write | fresh | effective |
| -------------------------------------------- | -------- | ------: | -----: | ---------: | -----------------: |
| Agent tool loop (1 user turn, N tool rounds) | legacy | 122,940 | 24,588 | **21,633** | 64,662 |
| | tail | 121,518 | 28,623 | **33** | **47,964** (−26%) |
| Multi-turn chat (frequent user messages) | legacy | 119,560 | 25,163 | 18 | 43,428 |
| | tail | 104,555 | 22,162 | 18 | **38,176** (−12%) |
| Realistic agent (user turns + tool rounds) | legacy | 495,826 | 38,003 | **27,538** | 124,624 |
| | tail | 545,327 | 43,202 | **90** | **108,625** (−13%) |
| Post-compaction (summary head + tool loop) | legacy | 96,139 | 35,287 | **22,808** | 76,531 |
| | tail | 123,118 | 47,576 | **42** | **71,824** (−6%) |

The tail strategy is cheaper (lower `effective`) in **every** scenario on both
providers (4/4 each). The clearest signal is `fresh`: the legacy approach
reprocesses tens of thousands of full-price tokens in any tool-bearing
conversation, which the tail strategy reduces to near zero. Even the legacy
strong case (frequent user messages, no tools) is a tie-or-win for the tail
strategy.

## Reproduce

Requires real credentials in `.env` (or point `BENCH_ENV_FILE` at one):
`ANTHROPIC_API_KEY` for Anthropic, `BEDROCK_AWS_ACCESS_KEY_ID` /
`BEDROCK_AWS_SECRET_ACCESS_KEY` (and a region) for Bedrock. It makes real, paid
API calls and is **not** a unit test (CI never runs it).

```bash
npm run bench:cache # Anthropic (default)
npm run bench:cache -- --provider bedrock # Bedrock Converse
npm run bench:cache -- --rounds 10 --model claude-sonnet-4-5
```

Each scenario runs the _same_ conversation under both strategies in separate
cache namespaces (unique per run), then prints the per-strategy totals and the
delta.
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@
"tool": "node --trace-warnings -r dotenv/config --loader ./tsconfig-paths-bootstrap.mjs --experimental-specifier-resolution=node ./src/scripts/tools.ts --provider 'bedrock' --name 'Jo' --location 'New York, NY'",
"search": "node -r dotenv/config --loader ./tsconfig-paths-bootstrap.mjs --experimental-specifier-resolution=node ./src/scripts/search.ts --provider 'bedrock' --name 'Jo' --location 'New York, NY'",
"tool_search": "node -r dotenv/config --loader ./tsconfig-paths-bootstrap.mjs --experimental-specifier-resolution=node ./src/scripts/tool_search.ts",
"bench:cache": "node --loader ./tsconfig-paths-bootstrap.mjs --experimental-specifier-resolution=node ./src/scripts/bench-prompt-cache.ts",
"subagent": "node -r dotenv/config --loader ./tsconfig-paths-bootstrap.mjs --experimental-specifier-resolution=node ./src/scripts/multi-agent-subagent.ts",
"subagent:events": "node -r dotenv/config --loader ./tsconfig-paths-bootstrap.mjs --experimental-specifier-resolution=node ./src/scripts/subagent-event-driven-debug.ts",
"subagent:tools": "node -r dotenv/config --loader ./tsconfig-paths-bootstrap.mjs --experimental-specifier-resolution=node ./src/scripts/subagent-tools-debug.ts",
Expand Down
4 changes: 2 additions & 2 deletions src/agents/AgentContext.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ import {
Providers,
} from '@/common';
import {
addCacheControl,
addTailCacheControl,
addCacheControlToStablePrefixMessages,
cloneMessage,
} from '@/messages/cache';
Expand Down Expand Up @@ -689,7 +689,7 @@ export class AgentContext {
dynamicTail.length === 0 &&
body.length >= 2
) {
body = addCacheControl(body);
body = addTailCacheControl(body);
}
return [...prefix, ...body];
}).withConfig({ runName: 'prompt' });
Expand Down
12 changes: 3 additions & 9 deletions src/agents/__tests__/AgentContext.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -274,16 +274,11 @@ describe('AgentContext', () => {
new HumanMessage('First'),
new HumanMessage('Second'),
]);
const firstContent = result[1].content as TestSystemContentBlock[];
const secondContent = result[2].content as TestSystemContentBlock[];

expect(result).toHaveLength(3);
expect(result[0].content).toBe('Dynamic only');
expect(firstContent[0]).toMatchObject({
type: 'text',
text: 'First',
cache_control: { type: 'ephemeral' },
});
expect(result[1].content).toBe('First');
expect(secondContent[0]).toMatchObject({
type: 'text',
text: 'Second',
Expand Down Expand Up @@ -686,7 +681,7 @@ describe('AgentContext', () => {
expect(result[8].content).toBe('Now answer without tools');
});

it('adds OpenRouter body cache points when there is no dynamic tail', async () => {
it('adds a single OpenRouter body cache point on the tail when there is no dynamic tail', async () => {
const ctx = createBasicContext({
agentConfig: {
provider: Providers.OPENROUTER,
Expand All @@ -702,9 +697,8 @@ describe('AgentContext', () => {
new HumanMessage('First'),
new HumanMessage('Second'),
]);
const firstContent = result[1].content as TestSystemContentBlock[];
const secondContent = result[2].content as TestSystemContentBlock[];
expect(firstContent[0]).toHaveProperty('cache_control');
expect(result[1].content).toBe('First');
expect(secondContent[0]).toHaveProperty('cache_control');
});

Expand Down
101 changes: 65 additions & 36 deletions src/graphs/Graph.ts
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,14 @@ import {
convertMessagesToContent,
sanitizeOrphanToolBlocks,
extractToolDiscoveries,
addBedrockCacheControl,
addBedrockTailCacheControl,
formatArtifactPayload,
enforceOriginalContentCap,
formatContentStrings,
isLegacyConvertible,
createPruneMessages,
syncBudgetDerivedFields,
addCacheControl,
addTailCacheControl,
getMessageId,
makeIsDeferred,
partitionAndMarkAnthropicToolCache,
Expand Down Expand Up @@ -1733,35 +1733,6 @@ export class StandardGraph extends Graph<t.BaseGraphState, t.GraphNode> {
}
}

if (agentContext.provider === Providers.ANTHROPIC) {
const anthropicOptions = agentContext.clientOptions as
| t.AnthropicClientOptions
| undefined;
if (
anthropicOptions?.promptCache === true &&
!agentContext.systemRunnable
) {
finalMessages = addCacheControl<BaseMessage>(finalMessages);
}
} else if (agentContext.provider === Providers.BEDROCK) {
const bedrockOptions = agentContext.clientOptions as
| t.BedrockAnthropicClientOptions
| undefined;
if (bedrockOptions?.promptCache === true) {
finalMessages = addBedrockCacheControl<BaseMessage>(finalMessages);
}
} else if (agentContext.provider === Providers.OPENROUTER) {
const openRouterOptions = agentContext.clientOptions as
| t.ProviderOptionsMap[Providers.OPENROUTER]
| undefined;
if (
openRouterOptions?.promptCache === true &&
!agentContext.systemRunnable
) {
finalMessages = addCacheControl<BaseMessage>(finalMessages);
}
}

if (
isThinkingEnabled(agentContext.provider, agentContext.clientOptions)
) {
Expand All @@ -1783,13 +1754,53 @@ export class StandardGraph extends Graph<t.BaseGraphState, t.GraphNode> {
);
}

// Intentionally broad: runs when the pruner wasn't used OR any post-pruning
// transform (addCacheControl, ensureThinkingBlock, etc.) reassigned finalMessages.
// sanitizeOrphanToolBlocks fast-paths to a Set diff check when no orphans exist,
// so the cost is negligible and this acts as a safety net for Anthropic/Bedrock.
// Determine the prompt-cache strategy up front. Two distinct facts:
//
// `providerPromptCacheEnabled` — prompt caching is on for this provider
// at all. This drives orphan cleanup, because EVERY cached send must be
// sanitized — including the system-runnable path, where AgentContext (not
// this node) adds the body marker.
//
// `willAddTailCache` — THIS node will add the marker itself. Anthropic /
// OpenRouter defer to the system runnable when one owns the system-prompt
// breakpoint, so they exclude that case; Bedrock always marks here.
const anthropicPromptCacheEnabled =
agentContext.provider === Providers.ANTHROPIC &&
(agentContext.clientOptions as t.AnthropicClientOptions | undefined)
?.promptCache === true;
const openRouterPromptCacheEnabled =
agentContext.provider === Providers.OPENROUTER &&
(
agentContext.clientOptions as
| t.ProviderOptionsMap[Providers.OPENROUTER]
| undefined
)?.promptCache === true;
const bedrockPromptCacheEnabled =
agentContext.provider === Providers.BEDROCK &&
(
agentContext.clientOptions as
| t.BedrockAnthropicClientOptions
| undefined
)?.promptCache === true;
const providerPromptCacheEnabled =
anthropicPromptCacheEnabled ||
openRouterPromptCacheEnabled ||
bedrockPromptCacheEnabled;

// Intentionally broad: runs when the pruner wasn't used, when any
// post-pruning transform (ensureThinkingBlock, etc.) reassigned
// finalMessages, OR when this is a prompt-cached send. The last clause
// matters because the marker is now applied AFTER this gate (and, for the
// system-runnable path, in AgentContext entirely): without it, a cached
// send whose pruner returned the context unchanged would skip cleanup and
// could ship orphaned AI/tool pairs from persisted history.
// sanitizeOrphanToolBlocks fast-paths to a Set diff check when no orphans
// exist, so the cost is negligible.
const needsOrphanSanitize =
anthropicLike &&
(!agentContext.pruneMessages || finalMessages !== messagesToUse);
(!agentContext.pruneMessages ||
finalMessages !== messagesToUse ||
providerPromptCacheEnabled);
if (needsOrphanSanitize) {
const beforeSanitize = finalMessages.length;
finalMessages = sanitizeOrphanToolBlocks(finalMessages);
Expand All @@ -1809,6 +1820,24 @@ export class StandardGraph extends Graph<t.BaseGraphState, t.GraphNode> {
}
}

// Place the single tail prompt-cache breakpoint LAST, after thinking
// normalization and orphan sanitization. ensureThinkingBlockInMessages can
// fold a trailing non-thinking AI→Tool chain into a `[Previous agent
// context]` HumanMessage whose builder copies text but not cache_control /
// cachePoint, and sanitizeOrphanToolBlocks can drop the anchored block — so
// marking earlier would let the only breakpoint vanish before the model
// call (zero message caching). Anchoring on the final message list keeps
// the marker on a block that actually ships. The system-runnable path
// adds its body marker in AgentContext, so this node skips it there.
if (
(anthropicPromptCacheEnabled || openRouterPromptCacheEnabled) &&
!agentContext.systemRunnable
) {
finalMessages = addTailCacheControl<BaseMessage>(finalMessages);
} else if (bedrockPromptCacheEnabled) {
finalMessages = addBedrockTailCacheControl<BaseMessage>(finalMessages);
}

if (
agentContext.lastStreamCall != null &&
agentContext.streamBuffer != null
Expand Down
Loading
Loading