ci: reduce unit test flakiness and shard re-run cost#3844
Conversation
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (5)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
WalkthroughThis PR updates CI infrastructure and test configuration across the codebase. GitHub Actions unit test workflows now disable matrix fail-fast where set, use shallow repository clones (fetch-depth: 1) for unitTests and merge-reports jobs, and pre-pull updated Docker images (redis:7.2, testcontainers/ryuk:0.14.0, plus minio in webapp). Three vitest config files add a CI-only retry setting (2 retries when running under CI, otherwise 0). 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
7c70982 to
f380957
Compare
@trigger.dev/build
trigger.dev
@trigger.dev/core
@trigger.dev/plugins
@trigger.dev/python
@trigger.dev/react-hooks
@trigger.dev/redis-worker
@trigger.dev/rsc
@trigger.dev/schema-to-json
@trigger.dev/sdk
commit: |
A unit-test shard recently failed on a timing race rather than a real regression - a run-engine waitpoint test sleeps 1250ms waiting on a 1000ms timeout that's processed by a ~1000ms worker poll, so on a CPU-starved shard the margin evaporates and the whole matrix goes red. Because
fail-fastdefaults on, that one flake cancels the sibling shards, and the only recovery is re-running the entire matrix "just to be sure" - which is itself slow.This is the low-risk first pass at that pain:
fail-fast: falseon the webapp and internal shard matrices, so one flaky shard no longer cancels its siblings. "Re-run failed jobs" now re-runs just the failed shard instead of the whole matrix.retry: process.env.CI ? 2 : 0on the timing-sensitive packages (run-engine,redis-worker,schedule-engine). Flakes self-heal in CI; local runs stay atretry: 0so they still surface in dev. A stopgap until the timing tests are made deterministic.fetch-depth: 1on the unit-test checkouts - they don't use git history, so the full clone was wasted setup time across ~20 jobs.redis:7-alpine->redis:7.2,ryuk:0.11.0->ryuk:0.14.0) and addminio/minio:latestto the webapp pre-pull. Otherwise those images pull unauthenticated at test time and risk Docker Hub rate-limit flakes (worst on fork PRs, where the authenticated pre-pull is skipped entirely).Deeper follow-ups - bigger runners, turbo remote cache, runtime-weighted sharding, and the real root-cause fix (container reuse / template-DB isolation + deterministic timing tests) - are tracked under TRI-10484.