OCPBUGS-85486: Add client retry to critical monitor test setup to avoid transient et… by xueqzhan · Pull Request #31249 · openshift/origin

xueqzhan · 2026-06-01T20:00:16Z

…cd errors

Summary by CodeRabbit

New Features
- Added automated retry with exponential backoff that detects transient API failures and retries setup operations only for those cases.
- Retries now return the last transient error on overall timeout to surface the root cause.
Tests
- Wrapped multiple test setup steps in the new retry logic and made "already exists" conditions non-fatal by reusing existing resources.
- Added short, strategic delays around load-balancer and deployment steps to improve stability.

…cd errors

openshift-merge-bot · 2026-06-01T20:00:19Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot · 2026-06-01T20:00:23Z

@xueqzhan: This pull request references Jira Issue OCPBUGS-85486, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

…cd errors

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-01T20:00:39Z

Walkthrough

Adds a transient Kubernetes API error classifier and an exponential-backoff retry helper, then applies retries to test setup resource creation and cluster queries in multiple disruption tests and the kube-apiserver disruption test.

Changes

API Resilience Through Exponential Backoff Retries

Layer / File(s)	Summary
Transient error classification and exponential backoff utility `pkg/monitortestlibrary/utility/retry.go`	`IsTransientAPIError` detects transient Kubernetes API errors (timeouts, specific HTTP 502/504 StatusErrors, etcd timeout substring). `RetryWithExponentialBackoff` retries a callback only for transient errors with exponential backoff, logs warnings, and returns the last transient error on wait timeout.
kube-apiserver imports `pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`	Adds `monitortestlibrary/utility` import and OpenShift `configv1` types used by StartCollection and helpers.
kube-apiserver RBAC & deployment creates `pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`	Wraps ClusterRole/ClusterRoleBinding/RoleBinding/Deployment/PodDisruptionBudget Create calls in `RetryWithExponentialBackoff`, treats `AlreadyExists` as success, and records or discovers CRB names when present.
kube-apiserver StartCollection & Cleanup `pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`	Retries Infrastructure and Node listing calls in StartCollection; adds 2s sleeps around localhost deployment creation; Cleanup only deletes RBAC objects when their recorded names are non-empty.
Pod network disruption test retry integration `pkg/monitortests/network/disruptionpodnetwork/monitortest.go`	PrepareCollection now retries namespace, role binding, node listing, initial poller deployments, pod-/host-network target deployments and services, and later poller deployments; treats `AlreadyExists` as success and GET-falls back for the pod-network target Service to assign `pna.targetService`.
Service load balancer disruption test retry integration `pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go`	Wraps Infrastructure get, Nodes list, Namespace create, and TCP LoadBalancer Service create in exponential-backoff retries; on Service `AlreadyExists` falls back to `Get`; inserts brief `time.Sleep(2 * time.Second)` pauses after provisioning steps.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

deads2k
p0lyn0mial

🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	Uses unconditional time.Sleep() calls that ignore context cancellation: disruptionserviceloadbalancer (4x) and disruptioninclusterapiserver (2x).	Replace time.Sleep() with context-aware waits using select with ctx.Done() to allow prompt test cancellation.

✅ Passed checks (13 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding client retry logic to monitor test setup to handle transient errors, which is directly reflected in all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR modifies Monitor Test framework (not Ginkgo) setup code without introducing any test title declarations with dynamic content; existing test names are static constants.
Microshift Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added. Only existing monitor test setup functions were modified. Check not applicable to this PR.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR adds utility functions and modifies monitor test setup code, not new Ginkgo e2e tests. No It()/Describe()/Context()/When() declarations found.
Topology-Aware Scheduling Compatibility	✅ Passed	PR adds retry logic only; no new deployment manifests or topology-unfriendly constraints introduced. Replica counts adapt to topology (1 on SNO, 2 on others).
Ote Binary Stdout Contract	✅ Passed	No stdout writes in process-level code. klog.Warningf writes to stderr by default; no fmt.Print/Println found in changes.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No new Ginkgo e2e test blocks are added; changes are to existing monitor test setup code and new utility functions.
No-Weak-Crypto	✅ Passed	No weak cryptography patterns, custom crypto implementations, or non-constant-time secret comparisons detected. PR adds retry/resilience logic to monitor tests only.
Container-Privileges	✅ Passed	PR adds only retry logic to test code; no new or modified manifests with privileged containers were added.
No-Sensitive-Data-In-Logs	✅ Passed	The only new logging added logs standard Kubernetes API errors (timeouts, status codes, error reasons) which do not expose passwords, tokens, API keys, PII, or sensitive data.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-06-01T20:01:38Z

@xueqzhan: This pull request references Jira Issue OCPBUGS-85486, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

…cd errors

Summary by CodeRabbit

Tests

Enhanced resilience of network disruption tests by introducing exponential backoff retry logic for transient API failures in cluster operations.

Added strategic delays to service load balancer test workflows to improve reliability under transient conditions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

pkg/monitortestlibrary/utility/retry.go (1)
14-30: ⚡ Quick win

Harden IsTransientAPIError against nil and wrapped errors.

Two robustness gaps in this exported helper:

Line 26 calls err.Error() unconditionally. If a future caller passes a nil error, this panics (the apierrors.* helpers above return false for nil, so execution falls through to the strings.Contains).

The *apierrors.StatusError type assertion on Line 20 does not unwrap. Errors wrapped via fmt.Errorf("...: %w", err) (e.g. from jig helpers) will skip the 502/504 classification. Prefer errors.As.
♻️ Proposed hardening
 func IsTransientAPIError(err error) bool {
+	if err == nil {
+		return false
+	}
 	if apierrors.IsServerTimeout(err) || apierrors.IsTimeout(err) ||
 		apierrors.IsTooManyRequests(err) || apierrors.IsServiceUnavailable(err) ||
 		apierrors.IsInternalError(err) {
 		return true
 	}
-	if statusErr, ok := err.(*apierrors.StatusError); ok {
+	var statusErr *apierrors.StatusError
+	if errors.As(err, &statusErr) {
 		code := statusErr.Status().Code
 		if code == http.StatusGatewayTimeout || code == http.StatusBadGateway {
 			return true
 		}
 	}
Add "errors" to the import block.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortestlibrary/utility/retry.go` around lines 14 - 30,
IsTransientAPIError currently can panic on a nil err and fails to detect wrapped
*apierrors.StatusError; fix by importing "errors", add an early guard "if err ==
nil { return false }", replace the direct type assertion with errors.As to
unwrap into a variable of type *apierrors.StatusError (e.g. var statusErr
*apierrors.StatusError; if errors.As(err, &statusErr) { ... }) and keep the
final strings.Contains check safe by relying on the nil guard (so you can still
call err.Error() after the nil check).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortestlibrary/utility/retry.go`:
- Around line 40-50: The retry loop in the function using
wait.ExponentialBackoffWithContext currently swallows transient errors
(IsTransientAPIError) by returning (false, nil), which causes
wait.ErrWaitTimeout to be returned and loses the underlying last transient
error; remove any use of wait.Interrupted and instead track the lastErr variable
inside the closure (set lastErr = err whenever IsTransientAPIError(err) or any
non-nil err occurs), then after ExponentialBackoffWithContext returns, if the
returned error is wait.ErrWaitTimeout return lastErr (so callers see the final
transient error), otherwise return the original err (allowing ctx.Err to
propagate); reference the existing closure around
wait.ExponentialBackoffWithContext and the IsTransientAPIError check to locate
where to add lastErr handling.

In `@pkg/monitortests/network/disruptionpodnetwork/monitortest.go`:
- Around line 113-120: The retry wrapper is calling non-idempotent Create
(kubeClient.CoreV1().Namespaces().Create) so a transient timeout followed by
AlreadyExists will abort retries and/or leave actualNamespace nil; change the
retried closure used by RetryWithExponentialBackoff to handle
metav1.IsAlreadyExists by performing a Get fallback and assigning the fetched
object to actualNamespace (so actualNamespace is never nil), or factor this
pattern into a small helper (e.g., CreateOrGetNamespace/CreateOrGet) and use it
for other Create call sites (role binding, poller/target deployments/services,
jig.CreateTCPService) so Create + IsAlreadyExists => Get is used inside the
retry loop to make the operation idempotent.

---

Nitpick comments:
In `@pkg/monitortestlibrary/utility/retry.go`:
- Around line 14-30: IsTransientAPIError currently can panic on a nil err and
fails to detect wrapped *apierrors.StatusError; fix by importing "errors", add
an early guard "if err == nil { return false }", replace the direct type
assertion with errors.As to unwrap into a variable of type
*apierrors.StatusError (e.g. var statusErr *apierrors.StatusError; if
errors.As(err, &statusErr) { ... }) and keep the final strings.Contains check
safe by relying on the nil guard (so you can still call err.Error() after the
nil check).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e9a90ba5-8703-46a5-a5f2-5ca877127370

📥 Commits

Reviewing files that changed from the base of the PR and between 76ed5a8 and 176a6bf.

📒 Files selected for processing (3)

pkg/monitortestlibrary/utility/retry.go
pkg/monitortests/network/disruptionpodnetwork/monitortest.go
pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go

xueqzhan · 2026-06-01T20:37:31Z

/payload-job periodic-ci-openshift-release-main-ci-5.0-e2e-azure-ovn-upgrade

openshift-ci · 2026-06-01T20:37:41Z

@xueqzhan: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-5.0-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b6d8fdf0-5df9-11f1-8fdc-ed229b6091f4-0

xueqzhan · 2026-06-01T20:37:46Z

/payload-job periodic-ci-openshift-release-main-ci-5.0-e2e-azure-ovn

openshift-ci · 2026-06-01T20:38:02Z

@xueqzhan: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-5.0-e2e-azure-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bf9bba90-5df9-11f1-8e88-fb87807b8c06-0

openshift-merge-bot · 2026-06-01T20:59:38Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

xueqzhan · 2026-06-02T17:34:10Z

/retest required

xueqzhan · 2026-06-02T17:34:46Z

/retest-required

openshift-trt · 2026-06-02T22:18:38Z

Job Failure Risk Analysis for sha: 2d262a4

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-gcp-csi	IncompleteTests Tests for this run (3) are below the historical average (1932): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

mkowalski · 2026-06-03T13:00:29Z

Thanks for working on this @xueqzhan — this is essentially the same fix I opened in #31253 but your approach is more thorough (handling IsAlreadyExists, retrying reads, proper context propagation).

One thing I noticed: pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go has the same problem but isn't covered by this PR. It creates all its resources in StartCollection (not PrepareCollection, which is a no-op), so it may have been easy to miss.

There are 13 unprotected API calls in its StartCollection call chain:

Create calls:

Namespaces().Create() — createNamespace() (line 340)
ServiceAccounts().Create() — createServiceAccount() (line 327)
ClusterRoleBindings().Create() — createRBACPrivileged() (line 264)
ClusterRoles().Create() — createMonitorClusterRole() (line 276)
Roles().Create() — createMonitorRole() (line 289)
ClusterRoleBindings().Create() — createMonitorCRB() (line 301)
RoleBindings().Create() — createMonitorRB() (line 314)
Deployments().Create() — createDeploymentAndWaitToRollout() (line 170, called 3x for service-network, localhost, internal-lb)
PodDisruptionBudgets().Create() — createInternalLBDeployment() (line 213)
PodDisruptionBudgets().Create() — createServiceNetworkDeployment() (line 236)

List/Get calls:
11. Infrastructures().Get() — StartCollection (line 474)
12. Nodes().List() — StartCollection (line 504, all nodes for replica count)
13. Nodes().List() — StartCollection (line 512, control plane nodes)

I checked all other monitor tests in the repo — disruptioninclusterapiserver is the only remaining one with significant unprotected resource creation. The rest either don't create resources, use higher-level helpers, or only do reads during CollectData when the cluster is stable.

Happy to add these changes to your PR or open a follow-up — let me know what you prefer.

🤖 This comment was posted by OpenClaw on behalf of @mkowalski.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go (1)
618-618: 💤 Low value

Consider documenting the rationale for these fixed sleeps.

Since createDeploymentAndWaitToRollout already watches for deployment availability before returning, these 2-second sleeps suggest an undocumented race condition. Fixed delays are fragile under varying load. Consider adding a brief comment explaining what timing issue these address, or replacing with proper synchronization if there's a specific condition to wait for.

Also applies to: 623-623
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`
at line 618, There are two fixed 2s sleeps after
createDeploymentAndWaitToRollout that are undocumented and fragile; either
replace them with explicit synchronization (e.g., poll/watch the specific
condition you actually need such as Pod readiness, Service endpoints, or
deployment rollout status) or add a short comment explaining the precise race
you're working around and why a fixed 2s delay is required; locate the sleeps
(time.Sleep(2 * time.Second)) near createDeploymentAndWaitToRollout in
monitortest.go and either implement the proper wait (polling the
Deployment/Pod/Endpoint conditions or using a watcher) or document the timing
rationale inline.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`:
- Line 358: The assignment rbacMonitorCRBName = rbacMonitorRBObj.Name wrongly
overwrites the ClusterRoleBinding name with the RoleBinding name; remove this
line so rbacMonitorCRBName retains the correct value set when creating the
ClusterRoleBinding (see symbols createMonitorCRB, createMonitorRB,
rbacMonitorCRBName, rbacMonitorRBObj and Cleanup) and rely on namespace deletion
to remove the RoleBinding instead of assigning its name to the CRB variable.

---

Nitpick comments:
In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`:
- Line 618: There are two fixed 2s sleeps after createDeploymentAndWaitToRollout
that are undocumented and fragile; either replace them with explicit
synchronization (e.g., poll/watch the specific condition you actually need such
as Pod readiness, Service endpoints, or deployment rollout status) or add a
short comment explaining the precise race you're working around and why a fixed
2s delay is required; locate the sleeps (time.Sleep(2 * time.Second)) near
createDeploymentAndWaitToRollout in monitortest.go and either implement the
proper wait (polling the Deployment/Pod/Endpoint conditions or using a watcher)
or document the timing rationale inline.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 454fdbc3-70ac-4220-a8f2-ee254b92b486

📥 Commits

Reviewing files that changed from the base of the PR and between 2d262a4 and 28713ce.

📒 Files selected for processing (1)

pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go

mkowalski · 2026-06-04T09:50:02Z

/test verify-deps

 	github.com/docker/docker/api/types/system: github.com/docker/docker@v28.5.2+incompatible: read "https://proxy.golang.org/github.com/docker/docker/@v/v28.5.2+incompatible.zip": stream error: stream ID 1285; INTERNAL_ERROR; received from peer

openshift-merge-bot · 2026-06-04T10:01:28Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go (2)
650-655: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make the two StartCollection waits respect ctx cancellation.

StartCollection takes ctx but performs two unconditional time.Sleep(2 * time.Second) calls (before createLocalhostDeployment and before createInternalLBDeployment), so cancellation/timeout can be delayed by up to ~4s.
♻️ Proposed fix
-	time.Sleep(2 * time.Second)
+	select {
+	case <-time.After(2 * time.Second):
+	case <-ctx.Done():
+		return ctx.Err()
+	}
 	err = i.createLocalhostDeployment(ctx)
 	if err != nil {
 		return fmt.Errorf("error creating localhost: %v", err)
 	}
-	time.Sleep(2 * time.Second)
+	select {
+	case <-time.After(2 * time.Second):
+	case <-ctx.Done():
+		return ctx.Err()
+	}
 	err = i.createInternalLBDeployment(ctx, apiIntHost, apiIntPort)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`
around lines 650 - 655, The two unconditional time.Sleep(2 * time.Second) calls
in StartCollection block should be replaced with context-aware waits so ctx
cancellation/timeout is respected; instead of sleeping directly before calling
createLocalhostDeployment and before createInternalLBDeployment, replace each
sleep with a select that waits on time.After(2*time.Second) and on ctx.Done(),
returning ctx.Err() if cancelled. Update the StartCollection function where the
sleeps occur (surrounding createLocalhostDeployment and
createInternalLBDeployment) to perform this select-based wait (or call a small
helper waitWithContext(ctx, 2*time.Second)) so we exit promptly on ctx
cancellation.
421-426: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Fix createNamespace ambiguous-success recovery when retrying Namespaces().Create

pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/manifests/namespace.yaml uses metadata.generateName (not a stable name), so a retry after a post-persist/timeout can create another Namespace with a different generated name.

createNamespace in pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go retries client.Create(...) and returns the next error verbatim, with no AlreadyExists/Get reconciliation (or label-based recovery for generateName), so setup can still fail and/or leave behind extra labeled namespaces.

Reconcile explicitly after a failed create: make the operation idempotent (stable name) or, for generateName, reconcile by a unique per-test label selector (then Get the matching Namespace) instead of blindly retrying raw Create.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`
around lines 421 - 426, createNamespace currently retries client.Create(...) and
can return ambiguous failures because the manifest uses metadata.generateName;
instead make the create operation idempotent by reconciling on failure: in the
RetryWithExponentialBackoff closure (in function createNamespace in
monitortest.go) if client.Create returns an error, do not just return it—query
the API for a matching Namespace using a unique per-test label selector
(add/ensure a deterministic label on the Namespace creation path so you can
List/Get by that label) and if you find exactly one matching Namespace set
actualNamespace to that object and return nil; only return the original error if
no matching Namespace is found or multiple ambiguous matches exist. Ensure this
logic references client.Namespaces().Create and client.Namespaces().List/Get and
uses the same test-unique label you apply in the namespace manifest.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`:
- Around line 281-301: The createRBACPrivileged/createMonitorCRB and
createNamespace flows are non-idempotent because they POST manifests with
metadata.generateName and on retry may create duplicates; change these routines
to use deterministic metadata.name (set a fixed name or derive one from
i.testRunID) before Create so retries target the same resource and Cleanup can
reliably delete it; alternatively switch Create to a server-side apply/patch
(e.g., use Patch/Apply) or, after a transient create error, list by a unique
label you add to the manifest (use a dedicated label key/value) instead of
prefix-matching in findClusterRoleBindingByPrefix so you can unambiguously
reconcile the single resource. Also replace the unconditional time.Sleep(2 *
time.Second) in StartCollection with a context-aware wait (select on time.After
vs ctx.Done) so the operation cancels promptly on context timeout/cancellation.
Ensure references: createRBACPrivileged, createMonitorCRB, createNamespace,
findClusterRoleBindingByPrefix, StartCollection.

---

Outside diff comments:
In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`:
- Around line 650-655: The two unconditional time.Sleep(2 * time.Second) calls
in StartCollection block should be replaced with context-aware waits so ctx
cancellation/timeout is respected; instead of sleeping directly before calling
createLocalhostDeployment and before createInternalLBDeployment, replace each
sleep with a select that waits on time.After(2*time.Second) and on ctx.Done(),
returning ctx.Err() if cancelled. Update the StartCollection function where the
sleeps occur (surrounding createLocalhostDeployment and
createInternalLBDeployment) to perform this select-based wait (or call a small
helper waitWithContext(ctx, 2*time.Second)) so we exit promptly on ctx
cancellation.
- Around line 421-426: createNamespace currently retries client.Create(...) and
can return ambiguous failures because the manifest uses metadata.generateName;
instead make the create operation idempotent by reconciling on failure: in the
RetryWithExponentialBackoff closure (in function createNamespace in
monitortest.go) if client.Create returns an error, do not just return it—query
the API for a matching Namespace using a unique per-test label selector
(add/ensure a deterministic label on the Namespace creation path so you can
List/Get by that label) and if you find exactly one matching Namespace set
actualNamespace to that object and return nil; only return the original error if
no matching Namespace is found or multiple ambiguous matches exist. Ensure this
logic references client.Namespaces().Create and client.Namespaces().List/Get and
uses the same test-unique label you apply in the namespace manifest.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 27c0d11f-eb00-4b9a-a37a-644fdb38c97d

📥 Commits

Reviewing files that changed from the base of the PR and between 28713ce and 71fc96b.

📒 Files selected for processing (3)

pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go
pkg/monitortests/network/disruptionpodnetwork/monitortest.go
pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go

💤 Files with no reviewable changes (2)

pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go
pkg/monitortests/network/disruptionpodnetwork/monitortest.go

coderabbitai · 2026-06-04T14:43:49Z

 	client := i.kubeClient.RbacV1().ClusterRoleBindings()
-	obj, err := client.Create(ctx, rbacPrivilegedObj, metav1.CreateOptions{})
-	if err != nil && !apierrors.IsAlreadyExists(err) {
+	var created *rbacv1.ClusterRoleBinding
+	if err := utility.RetryWithExponentialBackoff(ctx, func() error {
+		var createErr error
+		created, createErr = client.Create(ctx, rbacPrivilegedObj, metav1.CreateOptions{})
+		if apierrors.IsAlreadyExists(createErr) {
+			return nil
+		}
+		return createErr
+	}); err != nil {
 		return fmt.Errorf("error creating privileged SCC CRB: %v", err)
 	}
-	rbacPrivilegedCRBName = obj.Name
+	if created != nil {
+		rbacPrivilegedCRBName = created.Name
+	} else {
+		name, err := findClusterRoleBindingByPrefix(ctx, client, rbacPrivilegedObj.GenerateName)
+		if err != nil {
+			return fmt.Errorf("error finding existing privileged SCC CRB: %v", err)
+		}
+		rbacPrivilegedCRBName = name
+	}


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Embedded CRB manifests ==" sed -n '1,80p' pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/manifests/crb-privileged.yaml echo sed -n '1,80p' pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/manifests/crb-monitor.yaml echo echo "== Retried CRB create sites ==" rg -n -C2 'ClusterRoleBindings\.Create|RetryWithExponentialBackoff|findClusterRoleBindingByPrefix' \ pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go

Repository: openshift/origin

Length of output: 5137

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go" echo "== findClusterRoleBindingByPrefix implementation ==" sed -n '395,450p' "$FILE" | cat -n echo echo "== CRB privileged create/recovery block ==" sed -n '260,330p' "$FILE" | cat -n echo echo "== CRB monitor create/recovery block ==" sed -n '330,410p' "$FILE" | cat -n echo echo "== createNamespace retry block ==" rg -n "func \$.*\$ createNamespace|createNamespace\\(" "$FILE" # Print near the match(s) rg -n -C 3 "createNamespace\\(" "$FILE" | head -n 80 # Best-effort: show likely region around the namespace create sed -n '410,520p' "$FILE" | cat -n echo echo "== time.Sleep occurrences (context) ==" rg -n "time\\.Sleep\\(" "$FILE" -n rg -n -C 2 "time\\.Sleep\\(" "$FILE"

Repository: openshift/origin

Length of output: 15875

🏁 Script executed:

#!/bin/bash set -euo pipefail # 1) Find cleanup / usage of stored CRB names rg -n "rbacPrivilegedCRBName|rbacMonitorCRBName" pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go echo "----" # 2) Inspect retry implementation to see what it retries on FILE="pkg/monitortestlibrary/utility/retry.go" if [ -f "$FILE" ]; then sed -n '1,220p' "$FILE" | cat -n else echo "Missing expected retry file: $FILE" fi

Repository: openshift/origin

Length of output: 2514

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go" echo "== namespaceYaml definition ==" rg -n "namespaceYaml" "$FILE" echo echo "== Show surrounding YAML embedded/constructed for namespaceYaml ==" # Print up to ~80 lines around first declaration/assignment line=$(rg -n "namespaceYaml" "$FILE" | head -n 1 | cut -d: -f1) start=$((line-40)) end=$((line+140)) sed -n "${start},${end}p" "$FILE" | cat -n

Repository: openshift/origin

Length of output: 7997

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go" EMBED_DIR="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/manifests" echo "== namespace.yaml ==" sed -n '1,120p' "$EMBED_DIR/namespace.yaml" | cat -n echo echo "== call site around createNamespace(ctx) and assigned namespaceName ==" # Find where createNamespace return used rg -n "createNamespace$ctx$" "$FILE" line=$(rg -n "namespace, err := i\.createNamespace$ctx$" "$FILE" | head -n 1 | cut -d: -f1) start=$((line-40)); end=$((line+80)) sed -n "${start},${end}p" "$FILE" | cat -n echo echo "== cleanup / deletion of namespaces - where i.namespaceName is used ==" rg -n "deleteNamespace\$|removeExistingMonitorNamespaces\\(|namespaceName" "$FILE" | head -n 80 echo echo "== StartCollection time.Sleep context (around 640-670) ==" sed -n '620,690p' "$FILE" | cat -n

Repository: openshift/origin

Length of output: 11181

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go" echo "== Cleanup function around namespace/CRB deletion ==" # show around first occurrence of func (i *InvariantInClusterDisruption) Cleanup line=$(rg -n "func \\(i \\*InvariantInClusterDisruption\$ Cleanup" "$FILE" | head -n 1 | cut -d: -f1) start=$((line-40)); end=$((line+120)) sed -n "${start},${end}p" "$FILE" | cat -n

Repository: openshift/origin

Length of output: 4352

Fix non-idempotent RBAC/namespace creation on transient retries (generatedName + prefix lookup)

createRBACPrivileged and createMonitorCRB retry ClusterRoleBindings().Create using CRB manifests with metadata.generateName, then recover via findClusterRoleBindingByPrefix (list all CRBs and return the first strings.HasPrefix). If the API creates the CRB but the client sees a retryable timeout/5xx, subsequent retries can create additional CRBs with different generated names; Cleanup deletes only rbacPrivilegedCRBName / rbacMonitorCRBName, leaving orphans.

createNamespace retries Namespaces().Create using metadata.generateName (manifests/namespace.yaml), with no deterministic name reconciliation; Cleanup deletes only i.namespaceName, so transient-create ambiguity can leave orphaned monitoring namespaces.

StartCollection uses unconditional time.Sleep(2 * time.Second); switch to a context-aware wait so cancellation/timeouts stop the test promptly.

Also applies to: 343-363, 401-411

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go` around lines 281 - 301, The createRBACPrivileged/createMonitorCRB and createNamespace flows are non-idempotent because they POST manifests with metadata.generateName and on retry may create duplicates; change these routines to use deterministic metadata.name (set a fixed name or derive one from i.testRunID) before Create so retries target the same resource and Cleanup can reliably delete it; alternatively switch Create to a server-side apply/patch (e.g., use Patch/Apply) or, after a transient create error, list by a unique label you add to the manifest (use a dedicated label key/value) instead of prefix-matching in findClusterRoleBindingByPrefix so you can unambiguously reconcile the single resource. Also replace the unconditional time.Sleep(2 * time.Second) in StartCollection with a context-aware wait (select on time.After vs ctx.Done) so the operation cancels promptly on context timeout/cancellation. Ensure references: createRBACPrivileged, createMonitorCRB, createNamespace, findClusterRoleBindingByPrefix, StartCollection.

openshift-merge-bot · 2026-06-04T14:53:52Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

mkowalski · 2026-06-05T08:58:29Z

/lgtm

mkowalski · 2026-06-05T08:58:38Z

/verified by CI

openshift-ci-robot · 2026-06-05T08:58:49Z

@mkowalski: This PR has been marked as verified by CI.

Details

In response to this:

/verified by CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-06-05T08:58:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mkowalski, xueqzhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mkowalski,xueqzhan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-06-05T09:35:13Z

@xueqzhan: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-06-05T09:38:34Z

@xueqzhan: Jira Issue Verification Checks: Jira Issue OCPBUGS-85486
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-85486 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

…cd errors

Summary by CodeRabbit

New Features

Added automated retry with exponential backoff that detects transient API failures and retries setup operations only for those cases.

Retries now return the last transient error on overall timeout to surface the root cause.

Tests

Wrapped multiple test setup steps in the new retry logic and made "already exists" conditions non-fatal by reusing existing resources.

Added short, strategic delays around load-balancer and deployment steps to improve stability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-robot · 2026-06-06T15:58:09Z

Fix included in release 5.0.0-0.nightly-2026-06-06-100407

Add client retry to critical monitor test setup to avoid transient et…

176a6bf

…cd errors

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 1, 2026

openshift-ci Bot requested review from deads2k and p0lyn0mial June 1, 2026 20:01

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 1, 2026

coderabbitai Bot requested changes Jun 1, 2026

View reviewed changes

Comment thread pkg/monitortestlibrary/utility/retry.go Outdated

Comment thread pkg/monitortests/network/disruptionpodnetwork/monitortest.go

Make retry more robust

2d262a4

coderabbitai Bot approved these changes Jun 1, 2026

View reviewed changes

openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 1, 2026

mkowalski mentioned this pull request Jun 3, 2026

OCPBUGS-86922: Retry transient API errors in pod-network-avalibility preparation #31253

Closed

mkowalski mentioned this pull request Jun 3, 2026

OCPBUGS-86922: Add shared retry helper for monitor test API calls #31255

Closed

Add retry to incluster apiserver monitor test

28713ce

coderabbitai Bot requested changes Jun 3, 2026

View reviewed changes

Comment thread pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go Outdated

Fix cleanup to use actual resource names created during setup

71fc96b

coderabbitai Bot requested changes Jun 4, 2026

View reviewed changes

openshift-ci Bot assigned mkowalski Jun 5, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 5, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 5, 2026

openshift-merge-bot Bot merged commit b1497b7 into openshift:main Jun 5, 2026
21 checks passed

Conversation

xueqzhan commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Jun 1, 2026

Uh oh!

openshift-ci-robot commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci-robot commented Jun 1, 2026

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xueqzhan commented Jun 1, 2026

Uh oh!

openshift-ci Bot commented Jun 1, 2026

Uh oh!

xueqzhan commented Jun 1, 2026

Uh oh!

openshift-ci Bot commented Jun 1, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 1, 2026

Uh oh!

xueqzhan commented Jun 2, 2026

Uh oh!

xueqzhan commented Jun 2, 2026

Uh oh!

openshift-trt Bot commented Jun 2, 2026

Uh oh!

mkowalski commented Jun 3, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mkowalski commented Jun 4, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 4, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-merge-bot Bot commented Jun 4, 2026

Uh oh!

mkowalski commented Jun 5, 2026

Uh oh!

mkowalski commented Jun 5, 2026

Uh oh!

openshift-ci-robot commented Jun 5, 2026

Uh oh!

openshift-ci Bot commented Jun 5, 2026

Uh oh!

openshift-ci Bot commented Jun 5, 2026

Uh oh!

Uh oh!

openshift-ci-robot commented Jun 5, 2026

Summary by CodeRabbit

Uh oh!

openshift-merge-robot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

xueqzhan commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading