Skip to content

evals(cua): add a deterministic CUA agent regression task #2188

@yawbtng

Description

@yawbtng

why

CUA function-response handling has had two recent bugs (#2046 fixed by #2159, #2035 with proposed fix in #2036). Both went undetected because the CUA agent loop is not exercised by any small, deterministic bench task — mode: "cua" is only used today inside the WebVoyager/OnlineMind2Web benchmark suites, which are too heavyweight to run as quick CI signal and too non-deterministic to attribute failures to a specific provider.

So the gap is: there's no fast, fixture-backed regression check that just answers "does the Google/Anthropic CUA function-response loop work end-to-end?"

proposal

Add one new bench task under packages/evals/tasks/bench/agent/:

  • File: packages/evals/tasks/bench/agent/cua_amazon_checkout.ts
  • Fixture site: https://browserbase.github.io/stagehand-eval-sites/sites/amazon/ — already used by act/amazon_add_to_cart.ts, so the page is known-stable.
  • Flow: agent.execute({ instruction: "Add the product to the cart and proceed to checkout", maxSteps: 10 }).
  • Pass criterion: final URL is .../amazon/sign-in.html (same check the existing act task uses).
  • Intended invocation: evals run agent/cua_amazon_checkout --agent-mode cua --model google/gemini-2.5-computer-use-preview-10-2025 (and analogous Anthropic / OpenAI CUA models).

Mirrors the shape of the existing agent/google_flights.ts etc., adds no new framework surface, no new dependencies.

what this catches

ask

@miguelg719 — would you approve a PR of this shape, or is there a different fixture / pass-criterion / file layout you'd prefer? I'll write it as soon as you approve the shape. (For context, I also have #2159 open with a direction question for #2035 — both are CUA function-response fixes so the same task would cover regressions for both.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions