Skip to content

feat(metrics): emit workflow execution and per-block metrics to CloudWatch#4931

Open
TheodoreSpeaks wants to merge 4 commits into
stagingfrom
feat/workflow-metrics-cloudwatch
Open

feat(metrics): emit workflow execution and per-block metrics to CloudWatch#4931
TheodoreSpeaks wants to merge 4 commits into
stagingfrom
feat/workflow-metrics-cloudwatch

Conversation

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator

Summary

  • generalize lib/monitoring/metrics.ts into a namespace-aware CloudWatch emitter (per-namespace PutMetricData batching, same buffer/flush/exit-drain machinery); hostedKeyMetrics API unchanged
  • new Sim/Workflow namespace: ExecutionStarted, ExecutionCompleted (Trigger/Status dims), ExecutionDuration, ExecutionPaused, BlockExecuted (BlockType/Operation/Success dims), BlockDuration
  • emit lifecycle metrics from LoggingSession (all completion paths incl. cost-only fallback and markAsFailed), guarded against double-counting; pause is tracked separately from terminal completions
  • emit per-block metrics from BlockExecutor at blockLog finalization (success + error paths); Operation dim is regex-guarded against dynamic values exploding cardinality
  • cleanup-stale-executions cron now counts crashed-worker executions as failed completions (only place those can be recorded)
  • add GRAFANA_DEPLOYMENT_ENVIRONMENT to the Environment dimension fallback chain so staging workers stop tagging metrics as production

Type of Change

  • New feature

Testing

Unit tests for the emitter (namespace grouping, buffer drain, dimension shapes, failure drop) and LoggingSession metric emission (start/success/failed/cancelled/paused, dedup across fallback/markAsFailed). bun run lint:check, tsc --noEmit, and check:api-validation:strict pass.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 9, 2026 11:45pm

Request Review

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

@cursor

cursor Bot commented Jun 9, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Touches execution logging completion paths and the stale-execution cron; incorrect dedup or race handling could skew error-rate metrics, but changes are guarded and mostly additive telemetry.

Overview
Extends CloudWatch application metrics with a Sim/Workflow namespace and wires emission through execution lifecycle and block runs.

lib/monitoring/metrics.ts is generalized to buffer and flush per namespace (Sim/HostedKey unchanged API; flushMetrics replaces the hosted-key-only flush). New workflowMetrics: ExecutionStarted, ExecutionCompleted (+ optional ExecutionDuration), ExecutionPaused, and per-block BlockExecuted / BlockDuration with low-cardinality dimensions. GRAFANA_DEPLOYMENT_ENVIRONMENT is added to the Environment dimension fallback so staging workers are not labeled as production.

LoggingSession records start (not on resume), terminal outcomes (success/failed/cancelled), pause separately, and cost-only fallback paths, with guards so completions are not double-counted; markExecutionAsFailed emits failed only when transitioning from running/pending.

BlockExecutor emits block metrics on success and error; the Operation dimension is regex-limited to avoid cardinality blowups.

The stale-execution cleanup cron only force-fails rows still running (avoids racing a finishing worker) and records ExecutionCompleted failed for worker crashes that never hit LoggingSession.

Reviewed by Cursor Bugbot for commit 83bfcbd. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread apps/sim/lib/logs/execution/logging-session.ts Outdated
Comment thread apps/sim/app/api/cron/cleanup-stale-executions/route.ts
Comment thread apps/sim/lib/logs/execution/logging-session.ts
@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR generalises lib/monitoring/metrics.ts into a namespace-aware CloudWatch emitter and wires workflow execution and per-block metrics throughout the executor, LoggingSession, and the stale-execution cleanup cron.

  • New Sim/Workflow namespace adds ExecutionStarted, ExecutionCompleted (with Trigger/Status dimensions), ExecutionDuration, ExecutionPaused, BlockExecuted (with BlockType/Operation/Success), and BlockDuration metrics; cardinality is guarded by an operation regex and the existing low-cardinality dimension convention.
  • LoggingSession emits metrics across all terminal paths (success, error, cancellation, pause, cost-only fallback, markAsFailed) with a completionMetricEmitted boolean to prevent double-counting; the cleanup cron fills the gap for crashed workers that never reach a LoggingSession completion path.
  • hostedKeyMetrics API is unchanged; the refactor only adds the namespace parameter to the internal enqueue function and renames flushHostedKeyMetricsflushMetrics.

Confidence Score: 5/5

Safe to merge; all metric calls are fire-and-forget with errors caught and dropped, so no request-path impact.

The new metric paths are fully additive to existing logic and cannot break execution correctness. The completionMetricEmitted guard and the cron conditional-update pattern correctly prevent double-counting. Test coverage for the emitter and all LoggingSession emission paths is solid. The only finding is an omitted optional field in a telemetry call.

apps/sim/app/api/cron/cleanup-stale-executions/route.ts — the already-computed totalDurationMs is not forwarded to recordExecutionCompleted.

Important Files Changed

Filename Overview
apps/sim/lib/monitoring/metrics.ts Cleanly generalised to a namespace-aware emitter; buffer, flush, and exit-drain logic unchanged; GRAFANA_DEPLOYMENT_ENVIRONMENT added to the env chain.
apps/sim/lib/logs/execution/logging-session.ts Metric emission integrated across all completion paths; completionMetricEmitted guard prevents double-counting; success path passes raw duration = totalDurationMs
apps/sim/executor/execution/block-executor.ts recordBlockMetric correctly gated behind the blockLog check on both success and error paths; operation cardinality guard via regex looks correct.
apps/sim/app/api/cron/cleanup-stale-executions/route.ts Conditional-update pattern correctly prevents double-counting with concurrent worker completions; recordExecutionCompleted emitted correctly but totalDurationMs is not forwarded, so ExecutionDuration data is never emitted for this failure class.
apps/sim/lib/monitoring/metrics.test.ts New test file covering namespace grouping, buffer drain, dimension shapes, failure drop, and missing-duration skipping; good coverage of the emitter contract.
apps/sim/lib/logs/execution/logging-session.test.ts Comprehensive metric tests added: start/resume discrimination, all completion paths, pause→markAsFailed sequencing, dedup guard, and primary-write-failure fallback.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Workflow execution starts] --> B[LoggingSession.start / safeStart]
    B --> C[recordExecutionStarted Sim/Workflow]
    C --> D{Execution path}
    D --> E[BlockExecutor.execute]
    E --> F{Block outcome}
    F -->|success| G[recordBlockExecuted success=true + BlockDuration]
    F -->|error| H[recordBlockExecuted success=false + BlockDuration]
    D --> I{Session completion}
    I -->|complete| J[recordExecutionCompleted status=success + ExecutionDuration]
    I -->|completeWithError| K[recordExecutionCompleted status=failed + ExecutionDuration]
    I -->|completeWithCancellation| L[recordExecutionCompleted status=cancelled + ExecutionDuration]
    I -->|completeWithPause| M[recordExecutionPaused]
    I -->|cost-only fallback| N[recordExecutionCompleted status=failed/cancelled/success]
    I -->|markAsFailed| O[recordExecutionCompleted status=failed no durationMs]
    M -->|later markAsFailed| O
    P[cleanup-stale-executions cron] -->|crashed workers only| Q[recordExecutionCompleted status=failed no durationMs]
    J & K & L & N & O & Q --> R[(CloudWatch Sim/Workflow)]
    G & H --> R
Loading

Reviews (3): Last reviewed commit: "improvement(metrics): one-shot guard on ..." | Re-trigger Greptile

Comment thread apps/sim/lib/logs/execution/logging-session.ts
Comment thread apps/sim/lib/logs/execution/logging-session.ts
Comment on lines 58 to 63
const ENVIRONMENT =
process.env.OTEL_DEPLOYMENT_ENVIRONMENT ||
process.env.DEPLOYMENT_ENVIRONMENT ||
process.env.GRAFANA_DEPLOYMENT_ENVIRONMENT ||
process.env.NODE_ENV ||
'development'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 GRAFANA_DEPLOYMENT_ENVIRONMENT ordering may not fix staging

GRAFANA_DEPLOYMENT_ENVIRONMENT is checked only after DEPLOYMENT_ENVIRONMENT. If staging workers already have DEPLOYMENT_ENVIRONMENT set to any value — even 'production' (common when a shared env template is used) — the Grafana variable is silently shadowed and the fix has no effect. If staging workers consistently have GRAFANA_DEPLOYMENT_ENVIRONMENT but not DEPLOYMENT_ENVIRONMENT, the current order is fine; but if the opposite is possible, GRAFANA_DEPLOYMENT_ENVIRONMENT should move earlier in the chain, or the comment should document the expected env layout explicitly.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified the premise does not hold: the infra CDK stacks set neither OTEL_DEPLOYMENT_ENVIRONMENT nor DEPLOYMENT_ENVIRONMENT anywhere (grepped application-stack and constructs), and CloudWatch currently shows only Environment=production series — which is exactly the bug. The ordering is intentional: the OTEL_/DEPLOYMENT_ vars stay as explicit overrides, with the Grafana label (already set per-env for trigger.dev workers) as the next-best signal. The authoritative fix — DEPLOYMENT_ENVIRONMENT: config.environment on the ECS env maps — lands in the infra repo separately.

@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR generalizes the existing metrics.ts CloudWatch emitter to support multiple namespaces and wires up new Sim/Workflow metrics for execution lifecycle and per-block execution tracking. The hostedKeyMetrics API is unchanged; new workflowMetrics helpers (recordExecutionStarted, recordExecutionCompleted, recordExecutionPaused, recordBlockExecuted) are added with appropriate cardinality guards.

  • metrics.ts: The buffer is now keyed by namespace (BufferedDatum), flushMetrics() groups by namespace before calling PutMetricData, and GRAFANA_DEPLOYMENT_ENVIRONMENT is added to the ENVIRONMENT fallback chain so trigger.dev staging workers no longer collapse to production.
  • logging-session.ts: A completionMetricEmitted flag guards all terminal paths (complete, completeWithError, completeWithCancellation, cost-only fallback, markAsFailed) against double-counting; pause is tracked separately via recordExecutionPaused.
  • block-executor.ts: recordBlockMetric is called on both success and error paths inside the blockLog guard; the Operation dimension is regex-validated to prevent cardinality explosion.
  • cleanup-stale-executions/route.ts: Adds trigger to the stale-execution query and calls recordExecutionCompleted(failed) for crashed workers that never reach a LoggingSession completion path.

Confidence Score: 4/5

Safe to merge; all terminal completion paths are guarded against double-counting and the CloudWatch flush machinery is unmodified. The two findings are confined to metric accuracy, not execution correctness.

The dedup flag (completionMetricEmitted), the regex cardinality guard on Operation, and the trigger type safety (notNull DB column) are all correct. The safeStart double-emit risk is latent rather than present — nothing currently triggers it — but the pattern is fragile enough to note. The cost-only fallback's || 0 duration passes a zero ExecutionDuration data point whenever totalDurationMs is absent, which will skew CloudWatch percentile math for that path.

The two metric-emission sites in logging-session.ts — around safeStart (line 796) and the cost-only fallback duration (line 1080) — are worth a second look before the staging rollout produces misleading percentile data.

Important Files Changed

Filename Overview
apps/sim/lib/monitoring/metrics.ts Core emitter generalized to support namespace-aware buffering; new workflowMetrics API added with correct cardinality guards and GRAFANA_DEPLOYMENT_ENVIRONMENT env-var fallback.
apps/sim/lib/logs/execution/logging-session.ts Metric emission wired across all terminal paths; completionMetricEmitted flag prevents double-counting; a subtle gap exists in the safeStart fallback path (see comment).
apps/sim/executor/execution/block-executor.ts recordBlockMetric added to both success and error paths, guarded behind blockLog existence check and operation regex filter; duration is always computed before the guard.
apps/sim/app/api/cron/cleanup-stale-executions/route.ts trigger column added to select query (notNull in schema); recordExecutionCompleted(failed) correctly placed after the DB update succeeds, handling crashed-worker completions.
apps/sim/lib/monitoring/metrics.test.ts New test file covering namespace grouping, buffer drain, dimension shapes, and failure-drop behavior for the generalized emitter.
apps/sim/lib/logs/execution/logging-session.test.ts Comprehensive new test suite for metric emission: start/resume gate, all completion paths, pause/failed dedup, cost-only fallback idempotency, and cancelled-skip behavior.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant LoggingSession
    participant BlockExecutor
    participant workflowMetrics
    participant Buffer
    participant CloudWatch

    Caller->>LoggingSession: start() / safeStart()
    LoggingSession->>workflowMetrics: "recordExecutionStarted({trigger})"
    workflowMetrics->>Buffer: enqueue(Sim/Workflow, ExecutionStarted)

    loop Per block
        BlockExecutor->>BlockExecutor: execute block
        alt Success
            BlockExecutor->>workflowMetrics: "recordBlockExecuted({blockType, operation?, success:true, durationMs})"
        else Error
            BlockExecutor->>workflowMetrics: "recordBlockExecuted({blockType, operation?, success:false, durationMs})"
        end
        workflowMetrics->>Buffer: enqueue(Sim/Workflow, BlockExecuted + BlockDuration)
    end

    alt Normal completion
        LoggingSession->>workflowMetrics: "recordExecutionCompleted({trigger, status, durationMs})"
    else Pause
        LoggingSession->>workflowMetrics: "recordExecutionPaused({trigger})"
    else markAsFailed
        LoggingSession->>workflowMetrics: "recordExecutionCompleted({trigger, status:failed}) [guarded by completionMetricEmitted]"
    else Crashed worker (cron)
        Note over LoggingSession,workflowMetrics: cleanup-stale-executions route
        LoggingSession->>workflowMetrics: "recordExecutionCompleted({trigger, status:failed})"
    end
    workflowMetrics->>Buffer: enqueue(Sim/Workflow, ExecutionCompleted + ExecutionDuration?)

    Note over Buffer,CloudWatch: Every 5s / threshold / SIGTERM
    Buffer->>CloudWatch: PutMetricData per namespace (Sim/Workflow, Sim/HostedKey)
Loading

Reviews (2): Last reviewed commit: "feat(metrics): emit workflow execution a..." | Re-trigger Greptile

Comment thread apps/sim/lib/logs/execution/logging-session.ts Outdated
Comment thread apps/sim/lib/logs/execution/logging-session.ts
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread apps/sim/lib/logs/execution/logging-session.ts Outdated
Comment thread apps/sim/lib/logs/execution/logging-session.ts Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 83bfcbd. Configure here.

Comment thread apps/sim/lib/logs/execution/logging-session.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant