why
CUA function-response handling has had two recent bugs (#2046 fixed by #2159, #2035 with proposed fix in #2036). Both went undetected because the CUA agent loop is not exercised by any small, deterministic bench task — mode: "cua" is only used today inside the WebVoyager/OnlineMind2Web benchmark suites, which are too heavyweight to run as quick CI signal and too non-deterministic to attribute failures to a specific provider.
So the gap is: there's no fast, fixture-backed regression check that just answers "does the Google/Anthropic CUA function-response loop work end-to-end?"
proposal
Add one new bench task under packages/evals/tasks/bench/agent/:
- File:
packages/evals/tasks/bench/agent/cua_amazon_checkout.ts
- Fixture site:
https://browserbase.github.io/stagehand-eval-sites/sites/amazon/ — already used by act/amazon_add_to_cart.ts, so the page is known-stable.
- Flow:
agent.execute({ instruction: "Add the product to the cart and proceed to checkout", maxSteps: 10 }).
- Pass criterion: final URL is
.../amazon/sign-in.html (same check the existing act task uses).
- Intended invocation:
evals run agent/cua_amazon_checkout --agent-mode cua --model google/gemini-2.5-computer-use-preview-10-2025 (and analogous Anthropic / OpenAI CUA models).
Mirrors the shape of the existing agent/google_flights.ts etc., adds no new framework surface, no new dependencies.
what this catches
ask
@miguelg719 — would you approve a PR of this shape, or is there a different fixture / pass-criterion / file layout you'd prefer? I'll write it as soon as you approve the shape. (For context, I also have #2159 open with a direction question for #2035 — both are CUA function-response fixes so the same task would cover regressions for both.)
why
CUA function-response handling has had two recent bugs (#2046 fixed by #2159, #2035 with proposed fix in #2036). Both went undetected because the CUA agent loop is not exercised by any small, deterministic bench task —
mode: "cua"is only used today inside the WebVoyager/OnlineMind2Web benchmark suites, which are too heavyweight to run as quick CI signal and too non-deterministic to attribute failures to a specific provider.So the gap is: there's no fast, fixture-backed regression check that just answers "does the Google/Anthropic CUA function-response loop work end-to-end?"
proposal
Add one new bench task under
packages/evals/tasks/bench/agent/:packages/evals/tasks/bench/agent/cua_amazon_checkout.tshttps://browserbase.github.io/stagehand-eval-sites/sites/amazon/— already used byact/amazon_add_to_cart.ts, so the page is known-stable.agent.execute({ instruction: "Add the product to the cart and proceed to checkout", maxSteps: 10 })..../amazon/sign-in.html(same check the existing act task uses).evals run agent/cua_amazon_checkout --agent-mode cua --model google/gemini-2.5-computer-use-preview-10-2025(and analogous Anthropic / OpenAI CUA models).Mirrors the shape of the existing
agent/google_flights.tsetc., adds no new framework surface, no new dependencies.what this catches
--agent-mode cuaend-to-end (currently only validated transitively via benchmark suites).webvoyager— much faster to run than the full benchmark.ask
@miguelg719 — would you approve a PR of this shape, or is there a different fixture / pass-criterion / file layout you'd prefer? I'll write it as soon as you approve the shape. (For context, I also have #2159 open with a direction question for #2035 — both are CUA function-response fixes so the same task would cover regressions for both.)