Skip to content

JS screenshot suite: ~25% flaky hard-stall at ButtonTheme (worker↔host Document-wrapper degradation) #5145

@shai-almog

Description

@shai-almog

Summary

After the host-ref canvas-leak fix (#5143) the JavaScript screenshot suite reliably reaches 94/94 matched in ~75% of CI runs, but ~25% of runs hard-stall at ButtonThemeScreenshotTest (the first DualAppearanceBaseTest theme test, ~suite index 85) and time out (exit 5, SUITE:FINISHED never emitted).

Symptom

The worker goes completely silent immediately after CN1SS:INFO:suite starting test=ButtonThemeScreenshotTest — no further console output, not even other scheduler tasks. Because the cooperative scheduler is single-threaded, total silence means a synchronous loop with no yield (a parked host-call would let other tasks keep logging). The in-worker per-test watchdog (awaitTestCompletion) therefore can't fire, so the whole run times out regardless of CN1SS_ALLOWED_MISSING.

Root cause (diagnosis)

The stall is inside DualAppearanceBaseTest.installModernThemeIfRequested()Resources.open(in) parsing the modern .res, where in comes from the JS port's getBundledAssetAsDataURL asset-read bridge. Deep in the suite the worker↔host bridge intermittently returns a degraded receiver — the chartDocStaleness family where getDocument() / createElement returns null (or a coerced Number), re-wrapping the Document and wiping its cached __class. A degraded asset stream then wedges Resources.open in a parse loop.

This is the same degradation behind the historical chart-tail issues (partially addressed in 08b1248, 5dce6a2) and is orthogonal to the canvas leak fixed in #5143.

Attempts that BACKFIRED (don't repeat)

  • Pre-warming the modern .res at suite start (read+cache once, before pressure): made it worse — the suite-start Resources.open did host-bridge round-trips that wiped the Document class early, turning the one late ButtonTheme wedge into early animation_grid_failed=NullPointerException (hung at index 11/21/25). Reverted.
  • Trimming the per-test settle (1500→700ms): perturbed animation-grid capture timing → same early grid NPE. Reverted.

Real lever

Fix the worker↔host Document/canvas wrapper degradation itself (in browser_bridge.js / parparvm_runtime.jswrapJsObject class-wipe on null re-resolve, and/or the host bridge returning a Number/degenerate for getDocument/getContext under load). This has resisted multiple prior partial fixes and needs a dedicated investigation.

Follow-up to #5143.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions