Skip to content

fix(tracing): direct OTel SDK setup for chain-coherent sampling#2756

Open
ci-operator wants to merge 1 commit into
tektoncd:mainfrom
ci-operator:distributed-tracing
Open

fix(tracing): direct OTel SDK setup for chain-coherent sampling#2756
ci-operator wants to merge 1 commit into
tektoncd:mainfrom
ci-operator:distributed-tracing

Conversation

@ci-operator

Copy link
Copy Markdown
Contributor

📝 Description of the Change

PR #2605 ran PaC's tracing through Knative's config-observability flat
tracing-sampling-rate. At fractional rates each service in the chain rolls
independently — PaC can drop a trace while Tekton keeps it, leaving execution
spans whose parent_spanID points at nothing.

This PR switches to the OTel SDK directly. OTEL_TRACES_SAMPLER opens the
parentbased_* family, which honors the root span's sample decision in the
W3C traceparent flag bit so the whole chain is kept or dropped together.

Controller and watcher call tracing.New() at startup; tracing is opt-in
(both OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER must be set).
The tracing-* keys are dropped from the _example ConfigMap block since
they no longer apply. Resource service.name is pipelines-as-code.
Propagator is W3C TraceContext only; Baggage is intentionally not honored
per Konflux-CI ADR 0061.

🔗 Linked GitHub Issue

Fixes #

🧪 Testing Strategy

  • Unit tests
  • Integration tests
  • End-to-end tests
  • Manual testing
  • Not Applicable

🤖 AI Assistance

AI assistance can be used for various tasks, such as code generation,
documentation, or testing.

Please indicate whether you have used AI assistance
for this PR and provide details if applicable.

  • I have not used any AI assistance for this PR.
  • I have used AI assistance for this PR.

Important

Slop will be simply rejected, if you are using AI assistance you need to make sure you
understand the code generated and that it meets the project's standards. you
need at least know how to run the code and deploy it (if needed). See
startpaac to make it easy
to deploy and test your code changes.

If the majority of the code in this PR was generated by an AI, please add a Co-authored-by trailer to your commit message.
For example:

Co-authored-by: Claude noreply@anthropic.com

✅ Submitter Checklist

  • 📝 My commit messages are clear, informative, and follow the project's How to write a git commit message guide. The Gitlint linter ensures in CI it's properly validated
  • ✨ I have ensured my commit message prefix (e.g., fix:, feat:) matches the "Type of Change" I selected above.
  • ♽ I have run make test and make lint locally to check for and fix any
    issues. For an efficient workflow, I have considered installing
    pre-commit and running pre-commit install to
    automate these checks.
  • 📖 I have added or updated documentation for any user-facing changes.
  • 🧪 I have added sufficient unit tests for my code changes.
  • 🎁 I have added end-to-end tests where feasible. See README for more details.
  • 🔎 I have addressed any CI test flakiness or provided a clear reason to bypass it.
  • If adding a provider feature, I have filled in the following and updated the provider documentation:
    • GitHub App
    • GitHub Webhook
    • Gitea/Forgejo
    • GitLab
    • Bitbucket Cloud
    • Bitbucket Data Center

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions Pipelines-as-Code tracing configuration from a custom ConfigMap to standard OpenTelemetry environment variables (such as OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER). It introduces a new tracing package to initialize and manage the tracer provider lifecycle, updating both the controller and adapter to initialize and shut down the provider correctly. A critical issue was identified in pkg/tracing/provider.go where passing the application's cancellation context to the OTLP exporter prevents final spans from being flushed during shutdown; using context.Background() is recommended instead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/tracing/provider.go Outdated
return noopProvider()
}

exporter, err := newExporter(ctx, logger)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Passing the application's cancellation context (ctx) to the OTLP exporter creation means that when the application context is cancelled to initiate shutdown, the exporter's connection/client is also cancelled immediately. This prevents the final flush of spans during tp.Shutdown(shutdownCtx) from succeeding, leading to lost spans. Using context.Background() ensures the exporter remains functional during the shutdown flush.

Suggested change
exporter, err := newExporter(ctx, logger)
exporter, err := newExporter(context.Background(), logger)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newExporter uses context.Background() so the exporter's connection stays open through tp.Shutdown(shutdownCtx) and queued spans flush. The ctx parameter on tracing.New is dropped since nothing else used it.

@ci-operator ci-operator force-pushed the distributed-tracing branch from ff868b1 to 50ad9c1 Compare June 3, 2026 19:12
Comment on lines -55 to -68
# tracing-protocol specifies the trace export protocol.
# Supported values: "grpc", "http/protobuf", "none".
# Default is "none" (tracing disabled).
# tracing-protocol: "none"

# tracing-endpoint specifies the OTLP collector endpoint.
# Required when tracing-protocol is "grpc" or "http/protobuf".
# The OTEL_EXPORTER_OTLP_ENDPOINT env var takes precedence if set.
# tracing-endpoint: "http://otel-collector.observability.svc.cluster.local:4317"

# tracing-sampling-rate controls the fraction of traces sampled.
# 0.0 = none, 1.0 = all. Default is 0 (none).
# tracing-sampling-rate: "1.0"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the configmap keys are removed, I see no change in controller. Does that mean the configmap based tracing continue to work even with these changes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Knative tracing wrapper that consumed these _example keys is replaced; the new tracer in pkg/tracing/provider.go reads the OTel-standard env vars directly. The _example block was just documentation for keys nothing reads anymore.

This is not the same ConfigMap as pipelines-as-code (the main one), which holds the tracing-label-action|application|component operator label-name mappings.

On the "no change in controller" observation - we verified end-to-end that with neither OTEL_EXPORTER_OTLP_ENDPOINT nor OTEL_TRACES_SAMPLER set, PaC falls back to a noop tracer and emits no spans. If you're seeing tracing behavior unchanged from the pre-PR state, could you share how to reproduce so we can dig in?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changes in controller, I intended to ask if the knative base (eventing/adapter) that pac uses, reads these observability configmap keys for any arbitrary configuration/operations. Ref.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - the Knative eventing-adapter still consumes config-observability for non-tracing observability (metrics/profiling/etc. via evadapter.NewObservabilityConfiguratorFromConfigMap() at the line you linked, logging is read from config-logging separately). What this PR changes is specifically the tracing portion: PaC's old Knative tracing wrapper (added in bd9f468) read tracing-protocol / tracing-endpoint / tracing-sampling-rate from this ConfigMap; the new pkg/tracing/provider.go reads the OTel-standard env vars directly and that wrapper is gone, so those three keys went with the _example block. The Knative-base reads for non-tracing observability are unchanged.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ci-operator as it looks breaking change and @theakshaypant is considerate about, can't we introduce env based configuration as fallback to configmap so that we're keeping both, considering other users may want to use configmap?

Comment thread docs/content/docs/operations/tracing.md Outdated
Both `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_TRACES_SAMPLER` must be set to opt in to tracing. If either is unset PaC falls back to a noop tracer that emits no spans and incurs no exporter cost. Changes to any of these env vars take effect on the next pod restart.

### Example
PaC honors inbound `traceparent` headers on incoming webhook requests via the W3C TraceContext propagator. OTel Baggage is intentionally not honored; per [ADR 0061](https://github.com/konflux-ci/architecture/blob/main/ADR/0061-distributed-tracing.md), this delivery-tracing chain does not use Baggage for cross-service attribute propagation because all required attributes are locally available at every emission point.

@theakshaypant theakshaypant Jun 4, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not link a downstream project's design doc in PaC's documentation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is github.com/konflux-ci considered downstream?? I think NO it's another open source repository.

@theakshaypant theakshaypant Jun 4, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downstream would be anything that builds on or utilizes PaC and if a change here is affecting konflux-ci, then it does qualify as a downstream project.
Even so, point still stands. We should not link a design doc from another project.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please remove konflux references, we are not a dep on this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the konflux link entirely. Rewrote the Baggage paragraph to stand on its own (every emission point already has the attributes it needs from the local PipelineRun and webhook event, so no cross-service propagation channel is needed). Also corrected a stale tracing-endpoint reference further down the file - the OTel SDK reads OTEL_EXPORTER_OTLP_ENDPOINT automatically, so the docs just needed to point at the standard env var name.

Comment thread docs/content/docs/operations/tracing.md Outdated
Comment thread pkg/tracing/provider.go Outdated
case protocolGRPC:
return otlptracegrpc.New(ctx, otlptracegrpc.WithEndpointURL(endpoint))
default:
logger.Errorw("unsupported OTLP protocol; falling back to grpc", "protocol", protocolFromEnv())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] redundant protocolFromEnv call

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Hoisted to a local.

@zakisk zakisk force-pushed the distributed-tracing branch from 50ad9c1 to 5b41561 Compare June 4, 2026 05:34
@zakisk

zakisk commented Jun 4, 2026

Copy link
Copy Markdown
Member

/ok-to-test

@codecov-commenter

codecov-commenter commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.29703% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.78%. Comparing base (cc818de) to head (6820cf1).

Files with missing lines Patch % Lines
pkg/tracing/provider.go 80.68% 13 Missing and 4 partials ⚠️
pkg/reconciler/controller.go 0.00% 7 Missing ⚠️
pkg/adapter/adapter.go 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2756      +/-   ##
==========================================
+ Coverage   59.73%   59.78%   +0.05%     
==========================================
  Files         210      211       +1     
  Lines       21117    21218     +101     
==========================================
+ Hits        12615    12686      +71     
- Misses       7707     7733      +26     
- Partials      795      799       +4     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chmouel

chmouel commented Jun 4, 2026

Copy link
Copy Markdown
Member

This feelds vibe coded, have you tested this properly? have you properly reviewed it?

@ci-operator ci-operator force-pushed the distributed-tracing branch from 5b41561 to 311babd Compare June 4, 2026 17:23
@ci-operator

Copy link
Copy Markdown
Contributor Author

Sorry for the churn here. A number of findings surfaced while working through the integration-service and release-service side of this work, and the OTel SDK swap in this PR turned out to be necessary to avoid a fractional-sampling fragmentation bug (Knative's tracing wrapper made per-service sampling decisions independently of the root span). The unrelated cleanups landed in the same commit, but the overall behavior is functionally identical to what shipped in #2605 (#2605): same spans, same attribute schema, same opt-in env vars, just plumbed through the OTel SDK directly so parentbased_* samplers behave coherently across the delivery chain. Let me know if this needs greater alignment with your code conventions.

The PR was validated end-to-end across more than a dozen scenarios covering each fix and each error class it mitigates:

  • Chain-coherency under fractional sampling (per-service-independent sample decisions fragmenting chains - the failure mode the PR exists to fix).
  • Dual opt-in: noop fallback when either ENDPOINT or SAMPLER is unset.
  • All sampler families plus the ratio-arg defaulting path.
  • Transport selection between gRPC and HTTP/protobuf.
  • W3C TraceContext only; Baggage explicitly suppressed.
  • Resource attribute resolution (no unknown_service fallbacks).
  • Concurrent webhook handling.
  • BSP flush behavior under graceful and abrupt controller termination.

Comment thread docs/content/docs/operations/tracing.md Outdated
Changes to `tracing-protocol`, `tracing-endpoint`, and `tracing-sampling-rate` require restarting the controller and watcher pods. The trace exporter is created once at startup from the ConfigMap values at that time. Set `tracing-protocol` to `none` or remove the tracing keys to disable tracing.

The controller and watcher locate this ConfigMap by name via the `CONFIG_OBSERVABILITY_NAME` environment variable set in their deployment manifests. Operator-based installations may manage this differently; consult the operator documentation for details.
The `parentbased_*` sampler family honors the parent span's sample decision carried in the W3C `traceparent` flag bit. When PaC, Tekton, integration-service, and release-service all use parent-based samplers, the root span's sampling decision propagates through the whole delivery chain: every service either keeps its spans or drops them based on what the root chose. Flat-rate samplers (`traceidratio` without parent-based) cause each service to roll independently, which at fractional sampling fragments the chain into orphaned spans whose `parent_spanID` references a span that was dropped. `parentbased_always_on` keeps everything; `parentbased_traceidratio` with a numeric argument samples a coherent fraction.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning/docs] Remaining Konflux-specific service references

The earlier reviewer feedback asked to remove Konflux references. The ADR link was removed, but "integration-service" and "release-service" on this line are Konflux-CI-specific service names that leaked through. PaC is upstream and shouldn't assume a specific deployment topology in its docs.

Consider generalizing:

Suggested change
The `parentbased_*` sampler family honors the parent span's sample decision carried in the W3C `traceparent` flag bit. When PaC, Tekton, integration-service, and release-service all use parent-based samplers, the root span's sampling decision propagates through the whole delivery chain: every service either keeps its spans or drops them based on what the root chose. Flat-rate samplers (`traceidratio` without parent-based) cause each service to roll independently, which at fractional sampling fragments the chain into orphaned spans whose `parent_spanID` references a span that was dropped. `parentbased_always_on` keeps everything; `parentbased_traceidratio` with a numeric argument samples a coherent fraction.
The `parentbased_*` sampler family honors the parent span's sample decision carried in the W3C `traceparent` flag bit. When every service in the delivery chain uses parent-based samplers, the root span's sampling decision propagates end to end: each service either keeps its spans or drops them based on what the root chose. Flat-rate samplers (`traceidratio` without parent-based) cause each service to roll independently, which at fractional sampling fragments the chain into orphaned spans whose `parent_spanID` references a span that was dropped. `parentbased_always_on` keeps everything; `parentbased_traceidratio` with a numeric argument samples a coherent fraction.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I missed those alongside the ADR link. Reworded the sentence to be topology-neutral.

Comment thread pkg/tracing/provider.go Outdated
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.TraceContext{})

logger.Infow("tracing initialized", "endpoint", os.Getenv(EnvOTLPEndpoint), "protocol", protocolFromEnv())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit/code-quality] Redundant protocolFromEnv() — acknowledged but not yet pushed

This was flagged by a prior reviewer, and you replied "Fixed. Hoisted to a local." — but the fix hasn't been pushed yet. protocolFromEnv() is called here (line 70) and again inside newExporter() (line 91). No functional impact (env vars don't change mid-process), just redundant work.

To hoist: add proto := protocolFromEnv() before the newExporter call, pass it into newExporter as a parameter, and use the local proto in this log line.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was on wrong branch and never actually pushed it. Pulled proto := protocolFromEnv() up before the log line and pass it into newExporter so it only runs once.

@theakshaypant theakshaypant left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[AI-assisted review]
Solid work — well-structured, correct, and good test coverage. Two minor items inline.

What this does: Replaces Knative's config-observability ConfigMap-based tracing (which uses a flat tracing-sampling-rate causing per-service independent sampling) with direct OTel SDK initialization. This enables parentbased_* samplers that honor the root span's W3C traceparent flag bit, keeping the entire trace chain coherent.

What's good:

  • Clean self-contained tracing package with a clear New() / Shutdown() lifecycle
  • Correct use of context.Background() for the exporter — the gRPC/HTTP connection survives context cancellation so the BatchSpanProcessor can flush during shutdown
  • Dual opt-in guard (both OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER must be set) with clear log messages on the noop fallback path
  • Thorough test suite: all 6 sampler families, noop fallback paths, protocol precedence, shutdown idempotency. The saveGlobalProvider helper properly restores global OTel state
  • Uses gotest.tools/v3/assert and table-driven tests per project conventions
  • samplerFromEnv logs a helpful warning when a ratio sampler is selected without OTEL_TRACES_SAMPLER_ARG (defaults to 0%)

Findings (1 warning, 1 nit):

  • docs/content/docs/operations/tracing.md:23warning/docs: "integration-service" and "release-service" are Konflux-specific service names that survived the earlier Konflux-reference cleanup. Suggest generalizing to "every service in the delivery chain."
  • pkg/tracing/provider.go:70nit/code-quality: Redundant protocolFromEnv() call was acknowledged by the author as fixed ("Hoisted to a local.") but the fix hasn't been pushed yet.

Prior review feedback status:

  • ✅ Gemini's context.Background() for exporter creation — fixed
  • ✅ Konflux ADR link removed, Baggage paragraph rewritten
  • ✅ ConfigMap clarification — Knative base still reads non-tracing keys
  • ⚠️ Redundant protocolFromEnv() — acknowledged but not yet pushed

@zakisk

zakisk commented Jun 11, 2026

Copy link
Copy Markdown
Member

/ok-to-test

@theakshaypant

Copy link
Copy Markdown
Member

/ok-to-test

@theakshaypant

Copy link
Copy Markdown
Member

/ok-to-test

Comment thread pkg/reconciler/controller.go Outdated
log := logging.FromContext(ctx)

tp := tracing.New(log)
go func() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run make lint and make test locally to avoid CI failures.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppressed gosec false positive on OTel shutdown goroutine where Background is needed; lint and test pass now.

Knative's config-observability ConfigMap only exposes a flat
tracing-sampling-rate, so at fractional rates each service in the chain
rolls independently — PaC can drop a trace while Tekton keeps it, leaving
execution spans whose parent_spanID points at nothing. Switching to the
OTel SDK opens up OTEL_TRACES_SAMPLER's parentbased_* family, which honors
the root span's sample decision in the W3C traceparent flag so the whole
chain is kept or dropped together.

Controller and watcher call tracing.New() at startup. Tracing is opt-in:
both OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER must be set,
otherwise PaC falls back to noop (matching the prior tracing-sampling-rate
"0" default). Resource service.name is pipelines-as-code. Propagator is
W3C TraceContext only; Baggage is intentionally not honored per Konflux-CI
ADR 0061. otlptracegrpc and otlptracehttp promoted from indirect to direct
dependencies.

Assisted-by: Claude Code
Signed-off-by: Josiah England <jengland@redhat.com>
@ci-operator ci-operator force-pushed the distributed-tracing branch from 6820cf1 to 7fe6ead Compare June 17, 2026 14:24
@ci-operator

Copy link
Copy Markdown
Contributor Author

Pushed gosec suppression on OTel shutdown goroutine and a small cleanup.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the configmap based tracing would no longer work, can we add a callout for any existing users.
Something like this.

Image

<-ctx.Done()
shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
_ = tp.Shutdown(shutdownCtx)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we log the shutdown error?

Comment thread pkg/tracing/provider.go
}

proto := protocolFromEnv()
exporter, err := newExporter(context.Background(), logger, proto)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per some AI review, does timeout handling needed here??

@zakisk

zakisk commented Jun 18, 2026

Copy link
Copy Markdown
Member

@ci-operator the whole point to emit traces via OTEL Sdk and drop the knative configmap is to use parentbased_* field and disable traces emitted from knative??

@zakisk

zakisk commented Jun 18, 2026

Copy link
Copy Markdown
Member

@ci-operator the whole point to emit traces via OTEL Sdk and drop the knative configmap is to use parentbased_* field and disable traces emitted from knative??

If yes, so then AFAIK, hierarchy of services is PaC controller -> Tekon Controller -> PaC watcher (please correct if I am wrong...). and I think tektoncd/pipeline still uses knative based tracing and using tracing-sampling-rate for sampling so it wouldn't be using parentbased_* sampling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants