Skip to content

OCPBUGS-85486: Add client retry to critical monitor test setup to avoid transient et…#31249

Merged
openshift-merge-bot[bot] merged 4 commits into
openshift:mainfrom
xueqzhan:monitor-setup-retry
Jun 5, 2026
Merged

OCPBUGS-85486: Add client retry to critical monitor test setup to avoid transient et…#31249
openshift-merge-bot[bot] merged 4 commits into
openshift:mainfrom
xueqzhan:monitor-setup-retry

Conversation

@xueqzhan

@xueqzhan xueqzhan commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

…cd errors

Summary by CodeRabbit

  • New Features

    • Added automated retry with exponential backoff that detects transient API failures and retries setup operations only for those cases.
    • Retries now return the last transient error on overall timeout to surface the root cause.
  • Tests

    • Wrapped multiple test setup steps in the new retry logic and made "already exists" conditions non-fatal by reusing existing resources.
    • Added short, strategic delays around load-balancer and deployment steps to improve stability.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 1, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@xueqzhan: This pull request references Jira Issue OCPBUGS-85486, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

…cd errors

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown

Walkthrough

Adds a transient Kubernetes API error classifier and an exponential-backoff retry helper, then applies retries to test setup resource creation and cluster queries in multiple disruption tests and the kube-apiserver disruption test.

Changes

API Resilience Through Exponential Backoff Retries

Layer / File(s) Summary
Transient error classification and exponential backoff utility
pkg/monitortestlibrary/utility/retry.go
IsTransientAPIError detects transient Kubernetes API errors (timeouts, specific HTTP 502/504 StatusErrors, etcd timeout substring). RetryWithExponentialBackoff retries a callback only for transient errors with exponential backoff, logs warnings, and returns the last transient error on wait timeout.
kube-apiserver imports
pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go
Adds monitortestlibrary/utility import and OpenShift configv1 types used by StartCollection and helpers.
kube-apiserver RBAC & deployment creates
pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go
Wraps ClusterRole/ClusterRoleBinding/RoleBinding/Deployment/PodDisruptionBudget Create calls in RetryWithExponentialBackoff, treats AlreadyExists as success, and records or discovers CRB names when present.
kube-apiserver StartCollection & Cleanup
pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go
Retries Infrastructure and Node listing calls in StartCollection; adds 2s sleeps around localhost deployment creation; Cleanup only deletes RBAC objects when their recorded names are non-empty.
Pod network disruption test retry integration
pkg/monitortests/network/disruptionpodnetwork/monitortest.go
PrepareCollection now retries namespace, role binding, node listing, initial poller deployments, pod-/host-network target deployments and services, and later poller deployments; treats AlreadyExists as success and GET-falls back for the pod-network target Service to assign pna.targetService.
Service load balancer disruption test retry integration
pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go
Wraps Infrastructure get, Nodes list, Namespace create, and TCP LoadBalancer Service create in exponential-backoff retries; on Service AlreadyExists falls back to Get; inserts brief time.Sleep(2 * time.Second) pauses after provisioning steps.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • deads2k
  • p0lyn0mial
🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Uses unconditional time.Sleep() calls that ignore context cancellation: disruptionserviceloadbalancer (4x) and disruptioninclusterapiserver (2x). Replace time.Sleep() with context-aware waits using select with ctx.Done() to allow prompt test cancellation.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding client retry logic to monitor test setup to handle transient errors, which is directly reflected in all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies Monitor Test framework (not Ginkgo) setup code without introducing any test title declarations with dynamic content; existing test names are static constants.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. Only existing monitor test setup functions were modified. Check not applicable to this PR.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds utility functions and modifies monitor test setup code, not new Ginkgo e2e tests. No It()/Describe()/Context()/When() declarations found.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds retry logic only; no new deployment manifests or topology-unfriendly constraints introduced. Replica counts adapt to topology (1 on SNO, 2 on others).
Ote Binary Stdout Contract ✅ Passed No stdout writes in process-level code. klog.Warningf writes to stderr by default; no fmt.Print/Println found in changes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e test blocks are added; changes are to existing monitor test setup code and new utility functions.
No-Weak-Crypto ✅ Passed No weak cryptography patterns, custom crypto implementations, or non-constant-time secret comparisons detected. PR adds retry/resilience logic to monitor tests only.
Container-Privileges ✅ Passed PR adds only retry logic to test code; no new or modified manifests with privileged containers were added.
No-Sensitive-Data-In-Logs ✅ Passed The only new logging added logs standard Kubernetes API errors (timeouts, status codes, error reasons) which do not expose passwords, tokens, API keys, PII, or sensitive data.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from deads2k and p0lyn0mial June 1, 2026 20:01
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 1, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@xueqzhan: This pull request references Jira Issue OCPBUGS-85486, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

…cd errors

Summary by CodeRabbit

  • Tests
  • Enhanced resilience of network disruption tests by introducing exponential backoff retry logic for transient API failures in cluster operations.
  • Added strategic delays to service load balancer test workflows to improve reliability under transient conditions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/monitortestlibrary/utility/retry.go (1)

14-30: ⚡ Quick win

Harden IsTransientAPIError against nil and wrapped errors.

Two robustness gaps in this exported helper:

  • Line 26 calls err.Error() unconditionally. If a future caller passes a nil error, this panics (the apierrors.* helpers above return false for nil, so execution falls through to the strings.Contains).
  • The *apierrors.StatusError type assertion on Line 20 does not unwrap. Errors wrapped via fmt.Errorf("...: %w", err) (e.g. from jig helpers) will skip the 502/504 classification. Prefer errors.As.
♻️ Proposed hardening
 func IsTransientAPIError(err error) bool {
+	if err == nil {
+		return false
+	}
 	if apierrors.IsServerTimeout(err) || apierrors.IsTimeout(err) ||
 		apierrors.IsTooManyRequests(err) || apierrors.IsServiceUnavailable(err) ||
 		apierrors.IsInternalError(err) {
 		return true
 	}
-	if statusErr, ok := err.(*apierrors.StatusError); ok {
+	var statusErr *apierrors.StatusError
+	if errors.As(err, &statusErr) {
 		code := statusErr.Status().Code
 		if code == http.StatusGatewayTimeout || code == http.StatusBadGateway {
 			return true
 		}
 	}

Add "errors" to the import block.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortestlibrary/utility/retry.go` around lines 14 - 30,
IsTransientAPIError currently can panic on a nil err and fails to detect wrapped
*apierrors.StatusError; fix by importing "errors", add an early guard "if err ==
nil { return false }", replace the direct type assertion with errors.As to
unwrap into a variable of type *apierrors.StatusError (e.g. var statusErr
*apierrors.StatusError; if errors.As(err, &statusErr) { ... }) and keep the
final strings.Contains check safe by relying on the nil guard (so you can still
call err.Error() after the nil check).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortestlibrary/utility/retry.go`:
- Around line 40-50: The retry loop in the function using
wait.ExponentialBackoffWithContext currently swallows transient errors
(IsTransientAPIError) by returning (false, nil), which causes
wait.ErrWaitTimeout to be returned and loses the underlying last transient
error; remove any use of wait.Interrupted and instead track the lastErr variable
inside the closure (set lastErr = err whenever IsTransientAPIError(err) or any
non-nil err occurs), then after ExponentialBackoffWithContext returns, if the
returned error is wait.ErrWaitTimeout return lastErr (so callers see the final
transient error), otherwise return the original err (allowing ctx.Err to
propagate); reference the existing closure around
wait.ExponentialBackoffWithContext and the IsTransientAPIError check to locate
where to add lastErr handling.

In `@pkg/monitortests/network/disruptionpodnetwork/monitortest.go`:
- Around line 113-120: The retry wrapper is calling non-idempotent Create
(kubeClient.CoreV1().Namespaces().Create) so a transient timeout followed by
AlreadyExists will abort retries and/or leave actualNamespace nil; change the
retried closure used by RetryWithExponentialBackoff to handle
metav1.IsAlreadyExists by performing a Get fallback and assigning the fetched
object to actualNamespace (so actualNamespace is never nil), or factor this
pattern into a small helper (e.g., CreateOrGetNamespace/CreateOrGet) and use it
for other Create call sites (role binding, poller/target deployments/services,
jig.CreateTCPService) so Create + IsAlreadyExists => Get is used inside the
retry loop to make the operation idempotent.

---

Nitpick comments:
In `@pkg/monitortestlibrary/utility/retry.go`:
- Around line 14-30: IsTransientAPIError currently can panic on a nil err and
fails to detect wrapped *apierrors.StatusError; fix by importing "errors", add
an early guard "if err == nil { return false }", replace the direct type
assertion with errors.As to unwrap into a variable of type
*apierrors.StatusError (e.g. var statusErr *apierrors.StatusError; if
errors.As(err, &statusErr) { ... }) and keep the final strings.Contains check
safe by relying on the nil guard (so you can still call err.Error() after the
nil check).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e9a90ba5-8703-46a5-a5f2-5ca877127370

📥 Commits

Reviewing files that changed from the base of the PR and between 76ed5a8 and 176a6bf.

📒 Files selected for processing (3)
  • pkg/monitortestlibrary/utility/retry.go
  • pkg/monitortests/network/disruptionpodnetwork/monitortest.go
  • pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go

Comment thread pkg/monitortestlibrary/utility/retry.go Outdated
Comment thread pkg/monitortests/network/disruptionpodnetwork/monitortest.go
@xueqzhan

xueqzhan commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-5.0-e2e-azure-ovn-upgrade

@openshift-ci

openshift-ci Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@xueqzhan: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-5.0-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b6d8fdf0-5df9-11f1-8fdc-ed229b6091f4-0

@xueqzhan

xueqzhan commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-5.0-e2e-azure-ovn

@openshift-ci

openshift-ci Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@xueqzhan: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-5.0-e2e-azure-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bf9bba90-5df9-11f1-8e88-fb87807b8c06-0

@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 1, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@xueqzhan

xueqzhan commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

/retest required

@xueqzhan

xueqzhan commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

/retest-required

@openshift-trt

openshift-trt Bot commented Jun 2, 2026

Copy link
Copy Markdown

Job Failure Risk Analysis for sha: 2d262a4

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-gcp-csi IncompleteTests
Tests for this run (3) are below the historical average (1932): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@mkowalski

Copy link
Copy Markdown
Contributor

Thanks for working on this @xueqzhan — this is essentially the same fix I opened in #31253 but your approach is more thorough (handling IsAlreadyExists, retrying reads, proper context propagation).

One thing I noticed: pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go has the same problem but isn't covered by this PR. It creates all its resources in StartCollection (not PrepareCollection, which is a no-op), so it may have been easy to miss.

There are 13 unprotected API calls in its StartCollection call chain:

Create calls:

  1. Namespaces().Create()createNamespace() (line 340)
  2. ServiceAccounts().Create()createServiceAccount() (line 327)
  3. ClusterRoleBindings().Create()createRBACPrivileged() (line 264)
  4. ClusterRoles().Create()createMonitorClusterRole() (line 276)
  5. Roles().Create()createMonitorRole() (line 289)
  6. ClusterRoleBindings().Create()createMonitorCRB() (line 301)
  7. RoleBindings().Create()createMonitorRB() (line 314)
  8. Deployments().Create()createDeploymentAndWaitToRollout() (line 170, called 3x for service-network, localhost, internal-lb)
  9. PodDisruptionBudgets().Create()createInternalLBDeployment() (line 213)
  10. PodDisruptionBudgets().Create()createServiceNetworkDeployment() (line 236)

List/Get calls:
11. Infrastructures().Get()StartCollection (line 474)
12. Nodes().List()StartCollection (line 504, all nodes for replica count)
13. Nodes().List()StartCollection (line 512, control plane nodes)

I checked all other monitor tests in the repo — disruptioninclusterapiserver is the only remaining one with significant unprotected resource creation. The rest either don't create resources, use higher-level helpers, or only do reads during CollectData when the cluster is stable.

Happy to add these changes to your PR or open a follow-up — let me know what you prefer.


🤖 This comment was posted by OpenClaw on behalf of @mkowalski.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go (1)

618-618: 💤 Low value

Consider documenting the rationale for these fixed sleeps.

Since createDeploymentAndWaitToRollout already watches for deployment availability before returning, these 2-second sleeps suggest an undocumented race condition. Fixed delays are fragile under varying load. Consider adding a brief comment explaining what timing issue these address, or replacing with proper synchronization if there's a specific condition to wait for.

Also applies to: 623-623

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`
at line 618, There are two fixed 2s sleeps after
createDeploymentAndWaitToRollout that are undocumented and fragile; either
replace them with explicit synchronization (e.g., poll/watch the specific
condition you actually need such as Pod readiness, Service endpoints, or
deployment rollout status) or add a short comment explaining the precise race
you're working around and why a fixed 2s delay is required; locate the sleeps
(time.Sleep(2 * time.Second)) near createDeploymentAndWaitToRollout in
monitortest.go and either implement the proper wait (polling the
Deployment/Pod/Endpoint conditions or using a watcher) or document the timing
rationale inline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`:
- Line 358: The assignment rbacMonitorCRBName = rbacMonitorRBObj.Name wrongly
overwrites the ClusterRoleBinding name with the RoleBinding name; remove this
line so rbacMonitorCRBName retains the correct value set when creating the
ClusterRoleBinding (see symbols createMonitorCRB, createMonitorRB,
rbacMonitorCRBName, rbacMonitorRBObj and Cleanup) and rely on namespace deletion
to remove the RoleBinding instead of assigning its name to the CRB variable.

---

Nitpick comments:
In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`:
- Line 618: There are two fixed 2s sleeps after createDeploymentAndWaitToRollout
that are undocumented and fragile; either replace them with explicit
synchronization (e.g., poll/watch the specific condition you actually need such
as Pod readiness, Service endpoints, or deployment rollout status) or add a
short comment explaining the precise race you're working around and why a fixed
2s delay is required; locate the sleeps (time.Sleep(2 * time.Second)) near
createDeploymentAndWaitToRollout in monitortest.go and either implement the
proper wait (polling the Deployment/Pod/Endpoint conditions or using a watcher)
or document the timing rationale inline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 454fdbc3-70ac-4220-a8f2-ee254b92b486

📥 Commits

Reviewing files that changed from the base of the PR and between 2d262a4 and 28713ce.

📒 Files selected for processing (1)
  • pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go

Comment thread pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go Outdated
@mkowalski

Copy link
Copy Markdown
Contributor

/test verify-deps

 	github.com/docker/docker/api/types/system: github.com/docker/docker@v28.5.2+incompatible: read "https://proxy.golang.org/github.com/docker/docker/@v/v28.5.2+incompatible.zip": stream error: stream ID 1285; INTERNAL_ERROR; received from peer 

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go (2)

650-655: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make the two StartCollection waits respect ctx cancellation.

StartCollection takes ctx but performs two unconditional time.Sleep(2 * time.Second) calls (before createLocalhostDeployment and before createInternalLBDeployment), so cancellation/timeout can be delayed by up to ~4s.

♻️ Proposed fix
-	time.Sleep(2 * time.Second)
+	select {
+	case <-time.After(2 * time.Second):
+	case <-ctx.Done():
+		return ctx.Err()
+	}
 	err = i.createLocalhostDeployment(ctx)
 	if err != nil {
 		return fmt.Errorf("error creating localhost: %v", err)
 	}
-	time.Sleep(2 * time.Second)
+	select {
+	case <-time.After(2 * time.Second):
+	case <-ctx.Done():
+		return ctx.Err()
+	}
 	err = i.createInternalLBDeployment(ctx, apiIntHost, apiIntPort)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`
around lines 650 - 655, The two unconditional time.Sleep(2 * time.Second) calls
in StartCollection block should be replaced with context-aware waits so ctx
cancellation/timeout is respected; instead of sleeping directly before calling
createLocalhostDeployment and before createInternalLBDeployment, replace each
sleep with a select that waits on time.After(2*time.Second) and on ctx.Done(),
returning ctx.Err() if cancelled. Update the StartCollection function where the
sleeps occur (surrounding createLocalhostDeployment and
createInternalLBDeployment) to perform this select-based wait (or call a small
helper waitWithContext(ctx, 2*time.Second)) so we exit promptly on ctx
cancellation.

421-426: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Fix createNamespace ambiguous-success recovery when retrying Namespaces().Create

  • pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/manifests/namespace.yaml uses metadata.generateName (not a stable name), so a retry after a post-persist/timeout can create another Namespace with a different generated name.
  • createNamespace in pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go retries client.Create(...) and returns the next error verbatim, with no AlreadyExists/Get reconciliation (or label-based recovery for generateName), so setup can still fail and/or leave behind extra labeled namespaces.
  • Reconcile explicitly after a failed create: make the operation idempotent (stable name) or, for generateName, reconcile by a unique per-test label selector (then Get the matching Namespace) instead of blindly retrying raw Create.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`
around lines 421 - 426, createNamespace currently retries client.Create(...) and
can return ambiguous failures because the manifest uses metadata.generateName;
instead make the create operation idempotent by reconciling on failure: in the
RetryWithExponentialBackoff closure (in function createNamespace in
monitortest.go) if client.Create returns an error, do not just return it—query
the API for a matching Namespace using a unique per-test label selector
(add/ensure a deterministic label on the Namespace creation path so you can
List/Get by that label) and if you find exactly one matching Namespace set
actualNamespace to that object and return nil; only return the original error if
no matching Namespace is found or multiple ambiguous matches exist. Ensure this
logic references client.Namespaces().Create and client.Namespaces().List/Get and
uses the same test-unique label you apply in the namespace manifest.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`:
- Around line 281-301: The createRBACPrivileged/createMonitorCRB and
createNamespace flows are non-idempotent because they POST manifests with
metadata.generateName and on retry may create duplicates; change these routines
to use deterministic metadata.name (set a fixed name or derive one from
i.testRunID) before Create so retries target the same resource and Cleanup can
reliably delete it; alternatively switch Create to a server-side apply/patch
(e.g., use Patch/Apply) or, after a transient create error, list by a unique
label you add to the manifest (use a dedicated label key/value) instead of
prefix-matching in findClusterRoleBindingByPrefix so you can unambiguously
reconcile the single resource. Also replace the unconditional time.Sleep(2 *
time.Second) in StartCollection with a context-aware wait (select on time.After
vs ctx.Done) so the operation cancels promptly on context timeout/cancellation.
Ensure references: createRBACPrivileged, createMonitorCRB, createNamespace,
findClusterRoleBindingByPrefix, StartCollection.

---

Outside diff comments:
In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`:
- Around line 650-655: The two unconditional time.Sleep(2 * time.Second) calls
in StartCollection block should be replaced with context-aware waits so ctx
cancellation/timeout is respected; instead of sleeping directly before calling
createLocalhostDeployment and before createInternalLBDeployment, replace each
sleep with a select that waits on time.After(2*time.Second) and on ctx.Done(),
returning ctx.Err() if cancelled. Update the StartCollection function where the
sleeps occur (surrounding createLocalhostDeployment and
createInternalLBDeployment) to perform this select-based wait (or call a small
helper waitWithContext(ctx, 2*time.Second)) so we exit promptly on ctx
cancellation.
- Around line 421-426: createNamespace currently retries client.Create(...) and
can return ambiguous failures because the manifest uses metadata.generateName;
instead make the create operation idempotent by reconciling on failure: in the
RetryWithExponentialBackoff closure (in function createNamespace in
monitortest.go) if client.Create returns an error, do not just return it—query
the API for a matching Namespace using a unique per-test label selector
(add/ensure a deterministic label on the Namespace creation path so you can
List/Get by that label) and if you find exactly one matching Namespace set
actualNamespace to that object and return nil; only return the original error if
no matching Namespace is found or multiple ambiguous matches exist. Ensure this
logic references client.Namespaces().Create and client.Namespaces().List/Get and
uses the same test-unique label you apply in the namespace manifest.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 27c0d11f-eb00-4b9a-a37a-644fdb38c97d

📥 Commits

Reviewing files that changed from the base of the PR and between 28713ce and 71fc96b.

📒 Files selected for processing (3)
  • pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go
  • pkg/monitortests/network/disruptionpodnetwork/monitortest.go
  • pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go
💤 Files with no reviewable changes (2)
  • pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go
  • pkg/monitortests/network/disruptionpodnetwork/monitortest.go

Comment on lines 281 to +301
client := i.kubeClient.RbacV1().ClusterRoleBindings()
obj, err := client.Create(ctx, rbacPrivilegedObj, metav1.CreateOptions{})
if err != nil && !apierrors.IsAlreadyExists(err) {
var created *rbacv1.ClusterRoleBinding
if err := utility.RetryWithExponentialBackoff(ctx, func() error {
var createErr error
created, createErr = client.Create(ctx, rbacPrivilegedObj, metav1.CreateOptions{})
if apierrors.IsAlreadyExists(createErr) {
return nil
}
return createErr
}); err != nil {
return fmt.Errorf("error creating privileged SCC CRB: %v", err)
}
rbacPrivilegedCRBName = obj.Name
if created != nil {
rbacPrivilegedCRBName = created.Name
} else {
name, err := findClusterRoleBindingByPrefix(ctx, client, rbacPrivilegedObj.GenerateName)
if err != nil {
return fmt.Errorf("error finding existing privileged SCC CRB: %v", err)
}
rbacPrivilegedCRBName = name
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Embedded CRB manifests =="
sed -n '1,80p' pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/manifests/crb-privileged.yaml
echo
sed -n '1,80p' pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/manifests/crb-monitor.yaml
echo
echo "== Retried CRB create sites =="
rg -n -C2 'ClusterRoleBindings\(\)\.Create|RetryWithExponentialBackoff|findClusterRoleBindingByPrefix' \
  pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go

Repository: openshift/origin

Length of output: 5137


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go"

echo "== findClusterRoleBindingByPrefix implementation =="
sed -n '395,450p' "$FILE" | cat -n

echo
echo "== CRB privileged create/recovery block =="
sed -n '260,330p' "$FILE" | cat -n

echo
echo "== CRB monitor create/recovery block =="
sed -n '330,410p' "$FILE" | cat -n

echo
echo "== createNamespace retry block =="
rg -n "func \\(.*\\) createNamespace|createNamespace\\(" "$FILE"
# Print near the match(s)
rg -n -C 3 "createNamespace\\(" "$FILE" | head -n 80
# Best-effort: show likely region around the namespace create
sed -n '410,520p' "$FILE" | cat -n

echo
echo "== time.Sleep occurrences (context) =="
rg -n "time\\.Sleep\\(" "$FILE" -n
rg -n -C 2 "time\\.Sleep\\(" "$FILE"

Repository: openshift/origin

Length of output: 15875


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Find cleanup / usage of stored CRB names
rg -n "rbacPrivilegedCRBName|rbacMonitorCRBName" pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go

echo "----"

# 2) Inspect retry implementation to see what it retries on
FILE="pkg/monitortestlibrary/utility/retry.go"
if [ -f "$FILE" ]; then
  sed -n '1,220p' "$FILE" | cat -n
else
  echo "Missing expected retry file: $FILE"
fi

Repository: openshift/origin

Length of output: 2514


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go"

echo "== namespaceYaml definition =="
rg -n "namespaceYaml" "$FILE"

echo
echo "== Show surrounding YAML embedded/constructed for namespaceYaml =="
# Print up to ~80 lines around first declaration/assignment
line=$(rg -n "namespaceYaml" "$FILE" | head -n 1 | cut -d: -f1)
start=$((line-40))
end=$((line+140))
sed -n "${start},${end}p" "$FILE" | cat -n

Repository: openshift/origin

Length of output: 7997


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go"
EMBED_DIR="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/manifests"

echo "== namespace.yaml =="
sed -n '1,120p' "$EMBED_DIR/namespace.yaml" | cat -n

echo
echo "== call site around createNamespace(ctx) and assigned namespaceName =="
# Find where createNamespace return used
rg -n "createNamespace\(ctx\)" "$FILE"

line=$(rg -n "namespace, err := i\.createNamespace\(ctx\)" "$FILE" | head -n 1 | cut -d: -f1)
start=$((line-40)); end=$((line+80))
sed -n "${start},${end}p" "$FILE" | cat -n

echo
echo "== cleanup / deletion of namespaces - where i.namespaceName is used =="
rg -n "deleteNamespace\\(|removeExistingMonitorNamespaces\\(|namespaceName" "$FILE" | head -n 80

echo
echo "== StartCollection time.Sleep context (around 640-670) =="
sed -n '620,690p' "$FILE" | cat -n

Repository: openshift/origin

Length of output: 11181


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go"

echo "== Cleanup function around namespace/CRB deletion =="
# show around first occurrence of func (i *InvariantInClusterDisruption) Cleanup
line=$(rg -n "func \\(i \\*InvariantInClusterDisruption\\) Cleanup" "$FILE" | head -n 1 | cut -d: -f1)
start=$((line-40)); end=$((line+120))
sed -n "${start},${end}p" "$FILE" | cat -n

Repository: openshift/origin

Length of output: 4352


Fix non-idempotent RBAC/namespace creation on transient retries (generatedName + prefix lookup)

  • createRBACPrivileged and createMonitorCRB retry ClusterRoleBindings().Create using CRB manifests with metadata.generateName, then recover via findClusterRoleBindingByPrefix (list all CRBs and return the first strings.HasPrefix). If the API creates the CRB but the client sees a retryable timeout/5xx, subsequent retries can create additional CRBs with different generated names; Cleanup deletes only rbacPrivilegedCRBName / rbacMonitorCRBName, leaving orphans.
  • createNamespace retries Namespaces().Create using metadata.generateName (manifests/namespace.yaml), with no deterministic name reconciliation; Cleanup deletes only i.namespaceName, so transient-create ambiguity can leave orphaned monitoring namespaces.
  • StartCollection uses unconditional time.Sleep(2 * time.Second); switch to a context-aware wait so cancellation/timeouts stop the test promptly.

Also applies to: 343-363, 401-411

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`
around lines 281 - 301, The createRBACPrivileged/createMonitorCRB and
createNamespace flows are non-idempotent because they POST manifests with
metadata.generateName and on retry may create duplicates; change these routines
to use deterministic metadata.name (set a fixed name or derive one from
i.testRunID) before Create so retries target the same resource and Cleanup can
reliably delete it; alternatively switch Create to a server-side apply/patch
(e.g., use Patch/Apply) or, after a transient create error, list by a unique
label you add to the manifest (use a dedicated label key/value) instead of
prefix-matching in findClusterRoleBindingByPrefix so you can unambiguously
reconcile the single resource. Also replace the unconditional time.Sleep(2 *
time.Second) in StartCollection with a context-aware wait (select on time.After
vs ctx.Done) so the operation cancels promptly on context timeout/cancellation.
Ensure references: createRBACPrivileged, createMonitorCRB, createNamespace,
findClusterRoleBindingByPrefix, StartCollection.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@mkowalski

Copy link
Copy Markdown
Contributor

/lgtm

@mkowalski

Copy link
Copy Markdown
Contributor

/verified by CI

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 5, 2026
@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 5, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mkowalski: This PR has been marked as verified by CI.

Details

In response to this:

/verified by CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mkowalski, xueqzhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci

openshift-ci Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

@xueqzhan: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit b1497b7 into openshift:main Jun 5, 2026
21 checks passed
@openshift-ci-robot

Copy link
Copy Markdown

@xueqzhan: Jira Issue Verification Checks: Jira Issue OCPBUGS-85486
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-85486 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

…cd errors

Summary by CodeRabbit

  • New Features

  • Added automated retry with exponential backoff that detects transient API failures and retries setup operations only for those cases.

  • Retries now return the last transient error on overall timeout to surface the root cause.

  • Tests

  • Wrapped multiple test setup steps in the new retry logic and made "already exists" conditions non-fatal by reusing existing resources.

  • Added short, strategic delays around load-balancer and deployment steps to improve stability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot

Copy link
Copy Markdown
Contributor

Fix included in release 5.0.0-0.nightly-2026-06-06-100407

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants