evals(cua): add a deterministic CUA agent regression task

## why

CUA function-response handling has had two recent bugs (#2046 fixed by #2159, #2035 with proposed fix in #2036). Both went undetected because the CUA agent loop is not exercised by any small, deterministic bench task — `mode: "cua"` is only used today inside the WebVoyager/OnlineMind2Web benchmark suites, which are too heavyweight to run as quick CI signal and too non-deterministic to attribute failures to a specific provider.

So the gap is: there's no fast, fixture-backed regression check that just answers "does the Google/Anthropic CUA function-response loop work end-to-end?"

## proposal

Add **one** new bench task under `packages/evals/tasks/bench/agent/`:

- **File**: `packages/evals/tasks/bench/agent/cua_amazon_checkout.ts`
- **Fixture site**: `https://browserbase.github.io/stagehand-eval-sites/sites/amazon/` — already used by `act/amazon_add_to_cart.ts`, so the page is known-stable.
- **Flow**: `agent.execute({ instruction: "Add the product to the cart and proceed to checkout", maxSteps: 10 })`.
- **Pass criterion**: final URL is `.../amazon/sign-in.html` (same check the existing act task uses).
- **Intended invocation**: `evals run agent/cua_amazon_checkout --agent-mode cua --model google/gemini-2.5-computer-use-preview-10-2025` (and analogous Anthropic / OpenAI CUA models).

Mirrors the shape of the existing `agent/google_flights.ts` etc., adds no new framework surface, no new dependencies.

## what this catches

- Provider-specific image/function-response encoding regressions like #2046 and #2035 (which break only outside the PNG happy-path, so they're invisible to the act-tier tests).
- CUA-mode plumbing failures in `--agent-mode cua` end-to-end (currently only validated transitively via benchmark suites).
- Provider-level differences without touching `webvoyager` — much faster to run than the full benchmark.

## ask

@miguelg719 — would you approve a PR of this shape, or is there a different fixture / pass-criterion / file layout you'd prefer? I'll write it as soon as you approve the shape. (For context, I also have #2159 open with a direction question for #2035 — both are CUA function-response fixes so the same task would cover regressions for both.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals(cua): add a deterministic CUA agent regression task #2188

why

proposal

what this catches

ask

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

evals(cua): add a deterministic CUA agent regression task #2188

Description

why

proposal

what this catches

ask

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions