Skip to content

feat: schema-validated router labels via standard-schema#3748

Open
B4nan wants to merge 5 commits into
v4from
feat/typed-router-schema-validation
Open

feat: schema-validated router labels via standard-schema#3748
B4nan wants to merge 5 commits into
v4from
feat/typed-router-schema-validation

Conversation

@B4nan

@B4nan B4nan commented Jun 16, 2026

Copy link
Copy Markdown
Member

Makes the router's request label drive the type of request.userData end-to-end — from declaring routes, through the request handler, all the way to the methods that enqueue new requests. Targets v4.

The compile-time-only subset (typed route map for the router) was split out and already merged to master in #3747. This PR is the full v4 version: it adds runtime schema validation and propagates the route map to the crawler/context request methods.

1. Typed route map (compile-time)

Declare a label → userData map and pass it as the router's second type argument. Handlers get request.userData typed per label, and unknown labels are a compile error. Fully backwards compatible — the default route map is open (Record<string, …>), so existing untyped usage and the legacy flat-userData generic keep working.

interface Routes {
    PRODUCT: { sku: string; price: number };
    CATEGORY: { categoryId: string };
}

const router = createCheerioRouter<CheerioCrawlingContext, Routes>();

router.addHandler('PRODUCT', async ({ request }) => {
    request.userData.sku; // string
    request.userData.price; // number
});

router.addHandler('TYPO', async () => {}); // ❌ not a known label

The default handler (addDefaultHandler) is a fallback for any request, so its userData stays loosely typed.

2. Schema-validated labels (types + runtime validation)

Passing a per-label Standard Schema (Zod, Valibot, ArkType, …) both infers the userData types and validates them at runtime before the handler runs, replacing request.userData with the parsed value. Invalid userData throws a new non-retryable RequestValidationError (the data won't change between attempts) carrying the schema issues.

import { z } from 'zod';

const router = createCheerioRouter({
    PRODUCT: z.object({ sku: z.string(), price: z.coerce.number() }),
    CATEGORY: z.object({ categoryId: z.string() }),
});

router.addHandler('PRODUCT', async ({ request }) => {
    request.userData.price; // number — inferred from the schema, coerced/validated at runtime
});

3. Propagation to crawler & context request methods

When a typed router is used, the route map now also types the request inputs: providing a declared label requires the matching userData shape (and rejects unknown labels); unlabeled requests keep loose userData (they hit the default handler).

  • Handler contextctx.addRequests / ctx.enqueueLinks are typed from the route map. Driven by the router itself, so it works for every crawler type.
  • Crawler instanceRoutes is inferred from the requestHandler option and types crawler.addRequests / crawler.run. Threaded through all crawler classes (Basic/Http/Cheerio/JSDOM/LinkeDOM and the browser family: Browser/Playwright/Puppeteer/Stagehand/AdaptivePlaywright).
const crawler = new PlaywrightCrawler({ requestHandler: router });

await crawler.addRequests([{ url, label: 'PRODUCT', userData: { sku: 's', price: 1 } }]); // ✅
await crawler.addRequests([{ url, label: 'PRODUCT', userData: { sku: 1 } }]);             // ❌ wrong userData
await crawler.addRequests([{ url, label: 'NOPE' }]);                                      // ❌ unknown label

How

  • New RouteSchemas / RoutesFromSchemas (via StandardSchemaV1.InferOutput); Router.create and the createXRouter factories gain overloads for the route map, the legacy flat-userData form, and the schema map (disambiguated by value type). Adds the tiny, types-only @standard-schema/spec dep to @crawlee/core (zod is already a core dep on v4 and implements the interface).
  • LabeledSource / TypedRequestsLike build a label-discriminated request-input type; a signature-preserving transform retypes enqueueLinks's label/userData without changing argument optionality or return type (which differ per crawler).
  • A Routes generic is inferred from requestHandler and threaded through the crawler-class hierarchy.

All of it is backwards compatible via the open route-map default.

Relates to #3082. The remaining parts of that issue (typed pushData / key-value store) are a separate axis and are not in scope here.

🤖 Generated with Claude Code

Introduce per-label typing of `request.userData` for the router, in two
layers:

- A `label -> userData` map can be passed as the router's `Routes` type
  argument, typing `request.userData` per label and rejecting unknown
  labels at compile time. Backwards compatible (default is an open map).
- A per-label Standard Schema map (Zod, Valibot, ArkType, …) passed to
  `Router.create`/`createXRouter` both infers the `userData` types and
  validates them at runtime before the handler runs, replacing
  `request.userData` with the parsed value. Invalid requests throw a new
  non-retryable `RequestValidationError`.

Adds the types-only `@standard-schema/spec` dependency to `@crawlee/core`.

Relates to #3082
@B4nan B4nan added the adhoc Ad-hoc unplanned task added during the sprint. label Jun 16, 2026
B4nan added 4 commits June 18, 2026 19:06
- default handler keeps `request.userData` loosely typed (it is a fallback
  for any request, including labels not in the route map)
- split factory/`Router.create` into explicit overloads (route map vs legacy
  flat userData) for backwards compatibility, keeping the schema overload
- drop the exported `RouteMap` alias (referenced as prose in docs instead)
When a typed router is used, the route map now also types the request inputs:

- handler context: `ctx.addRequests` and `ctx.enqueueLinks` require the
  `userData` shape matching the request's `label` (and reject unknown labels);
  this is driven by the router, so it works for every crawler type.
- crawler instance: `Routes` is inferred from the `requestHandler` option and
  used to type `crawler.addRequests`/`crawler.run` for the HTTP-based crawlers
  (Basic/Http/Cheerio/JSDOM/LinkeDOM).

Unlabeled requests keep loose `userData` (they hit the default handler). All
typing is backwards compatible via the open-map default.

Note: crawler-instance `addRequests` typing is not yet wired for the browser
crawlers (Playwright/Puppeteer/Stagehand) — their `requestHandler` is redefined
in BrowserCrawlerOptions which breaks generic inference through the hierarchy;
their handler-context methods are still fully typed via the router.
Threads the `Routes` generic through the browser crawler hierarchy
(`BrowserCrawler` + Playwright/Puppeteer/Stagehand/AdaptivePlaywright) so
`crawler.addRequests`/`run` are route-typed for them as well, completing the
propagation started for the HTTP-based crawlers.

The missing piece was that `PlaywrightCrawlerOptions`/`StagehandCrawlerOptions`
redefined `requestHandler` without the route-aware `RouterHandler` overload,
shadowing the one on `BrowserCrawlerOptions` and preventing `Routes` inference.
`Routes` is also placed before the internal `__Browser*` generics on
`BrowserCrawler(Options)` so subclasses can forward it positionally.
The route-aware requestHandler union paired `RouterHandler<ExtendedContext>`
with `StagehandRequestHandler` (which wraps the context in `LoadedContext`).
The two union members had different call-signature parameter types, so TS
could not contextually type an inline `requestHandler({ page, request, ... })`
and reported the params as implicit `any` (breaking the Stagehand guide
examples). Use `RequestHandler<ExtendedContext>` for the plain member so both
signatures are identical; `StagehandCrawlingContext` already carries a
`LoadedRequest`, so nothing is lost.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants