OCPBUGS-85486: Add client retry to critical monitor test setup to avoid transient et…#31249
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@xueqzhan: This pull request references Jira Issue OCPBUGS-85486, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds a transient Kubernetes API error classifier and an exponential-backoff retry helper, then applies retries to test setup resource creation and cluster queries in multiple disruption tests and the kube-apiserver disruption test. ChangesAPI Resilience Through Exponential Backoff Retries
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 13 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (13 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@xueqzhan: This pull request references Jira Issue OCPBUGS-85486, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/monitortestlibrary/utility/retry.go (1)
14-30: ⚡ Quick winHarden
IsTransientAPIErroragainstniland wrapped errors.Two robustness gaps in this exported helper:
- Line 26 calls
err.Error()unconditionally. If a future caller passes anilerror, this panics (theapierrors.*helpers above returnfalsefornil, so execution falls through to thestrings.Contains).- The
*apierrors.StatusErrortype assertion on Line 20 does not unwrap. Errors wrapped viafmt.Errorf("...: %w", err)(e.g. from jig helpers) will skip the 502/504 classification. Prefererrors.As.♻️ Proposed hardening
func IsTransientAPIError(err error) bool { + if err == nil { + return false + } if apierrors.IsServerTimeout(err) || apierrors.IsTimeout(err) || apierrors.IsTooManyRequests(err) || apierrors.IsServiceUnavailable(err) || apierrors.IsInternalError(err) { return true } - if statusErr, ok := err.(*apierrors.StatusError); ok { + var statusErr *apierrors.StatusError + if errors.As(err, &statusErr) { code := statusErr.Status().Code if code == http.StatusGatewayTimeout || code == http.StatusBadGateway { return true } }Add
"errors"to the import block.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortestlibrary/utility/retry.go` around lines 14 - 30, IsTransientAPIError currently can panic on a nil err and fails to detect wrapped *apierrors.StatusError; fix by importing "errors", add an early guard "if err == nil { return false }", replace the direct type assertion with errors.As to unwrap into a variable of type *apierrors.StatusError (e.g. var statusErr *apierrors.StatusError; if errors.As(err, &statusErr) { ... }) and keep the final strings.Contains check safe by relying on the nil guard (so you can still call err.Error() after the nil check).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitortestlibrary/utility/retry.go`:
- Around line 40-50: The retry loop in the function using
wait.ExponentialBackoffWithContext currently swallows transient errors
(IsTransientAPIError) by returning (false, nil), which causes
wait.ErrWaitTimeout to be returned and loses the underlying last transient
error; remove any use of wait.Interrupted and instead track the lastErr variable
inside the closure (set lastErr = err whenever IsTransientAPIError(err) or any
non-nil err occurs), then after ExponentialBackoffWithContext returns, if the
returned error is wait.ErrWaitTimeout return lastErr (so callers see the final
transient error), otherwise return the original err (allowing ctx.Err to
propagate); reference the existing closure around
wait.ExponentialBackoffWithContext and the IsTransientAPIError check to locate
where to add lastErr handling.
In `@pkg/monitortests/network/disruptionpodnetwork/monitortest.go`:
- Around line 113-120: The retry wrapper is calling non-idempotent Create
(kubeClient.CoreV1().Namespaces().Create) so a transient timeout followed by
AlreadyExists will abort retries and/or leave actualNamespace nil; change the
retried closure used by RetryWithExponentialBackoff to handle
metav1.IsAlreadyExists by performing a Get fallback and assigning the fetched
object to actualNamespace (so actualNamespace is never nil), or factor this
pattern into a small helper (e.g., CreateOrGetNamespace/CreateOrGet) and use it
for other Create call sites (role binding, poller/target deployments/services,
jig.CreateTCPService) so Create + IsAlreadyExists => Get is used inside the
retry loop to make the operation idempotent.
---
Nitpick comments:
In `@pkg/monitortestlibrary/utility/retry.go`:
- Around line 14-30: IsTransientAPIError currently can panic on a nil err and
fails to detect wrapped *apierrors.StatusError; fix by importing "errors", add
an early guard "if err == nil { return false }", replace the direct type
assertion with errors.As to unwrap into a variable of type
*apierrors.StatusError (e.g. var statusErr *apierrors.StatusError; if
errors.As(err, &statusErr) { ... }) and keep the final strings.Contains check
safe by relying on the nil guard (so you can still call err.Error() after the
nil check).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: e9a90ba5-8703-46a5-a5f2-5ca877127370
📒 Files selected for processing (3)
pkg/monitortestlibrary/utility/retry.gopkg/monitortests/network/disruptionpodnetwork/monitortest.gopkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go
|
/payload-job periodic-ci-openshift-release-main-ci-5.0-e2e-azure-ovn-upgrade |
|
@xueqzhan: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b6d8fdf0-5df9-11f1-8fdc-ed229b6091f4-0 |
|
/payload-job periodic-ci-openshift-release-main-ci-5.0-e2e-azure-ovn |
|
@xueqzhan: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bf9bba90-5df9-11f1-8e88-fb87807b8c06-0 |
|
Scheduling required tests: |
|
/retest required |
|
/retest-required |
|
Job Failure Risk Analysis for sha: 2d262a4
|
|
Thanks for working on this @xueqzhan — this is essentially the same fix I opened in #31253 but your approach is more thorough (handling One thing I noticed: There are 13 unprotected API calls in its Create calls:
List/Get calls: I checked all other monitor tests in the repo — Happy to add these changes to your PR or open a follow-up — let me know what you prefer. 🤖 This comment was posted by OpenClaw on behalf of @mkowalski. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go (1)
618-618: 💤 Low valueConsider documenting the rationale for these fixed sleeps.
Since
createDeploymentAndWaitToRolloutalready watches for deployment availability before returning, these 2-second sleeps suggest an undocumented race condition. Fixed delays are fragile under varying load. Consider adding a brief comment explaining what timing issue these address, or replacing with proper synchronization if there's a specific condition to wait for.Also applies to: 623-623
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go` at line 618, There are two fixed 2s sleeps after createDeploymentAndWaitToRollout that are undocumented and fragile; either replace them with explicit synchronization (e.g., poll/watch the specific condition you actually need such as Pod readiness, Service endpoints, or deployment rollout status) or add a short comment explaining the precise race you're working around and why a fixed 2s delay is required; locate the sleeps (time.Sleep(2 * time.Second)) near createDeploymentAndWaitToRollout in monitortest.go and either implement the proper wait (polling the Deployment/Pod/Endpoint conditions or using a watcher) or document the timing rationale inline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`:
- Line 358: The assignment rbacMonitorCRBName = rbacMonitorRBObj.Name wrongly
overwrites the ClusterRoleBinding name with the RoleBinding name; remove this
line so rbacMonitorCRBName retains the correct value set when creating the
ClusterRoleBinding (see symbols createMonitorCRB, createMonitorRB,
rbacMonitorCRBName, rbacMonitorRBObj and Cleanup) and rely on namespace deletion
to remove the RoleBinding instead of assigning its name to the CRB variable.
---
Nitpick comments:
In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`:
- Line 618: There are two fixed 2s sleeps after createDeploymentAndWaitToRollout
that are undocumented and fragile; either replace them with explicit
synchronization (e.g., poll/watch the specific condition you actually need such
as Pod readiness, Service endpoints, or deployment rollout status) or add a
short comment explaining the precise race you're working around and why a fixed
2s delay is required; locate the sleeps (time.Sleep(2 * time.Second)) near
createDeploymentAndWaitToRollout in monitortest.go and either implement the
proper wait (polling the Deployment/Pod/Endpoint conditions or using a watcher)
or document the timing rationale inline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 454fdbc3-70ac-4220-a8f2-ee254b92b486
📒 Files selected for processing (1)
pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go
|
/test verify-deps |
|
Scheduling required tests: |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go (2)
650-655:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMake the two
StartCollectionwaits respectctxcancellation.
StartCollectiontakesctxbut performs two unconditionaltime.Sleep(2 * time.Second)calls (beforecreateLocalhostDeploymentand beforecreateInternalLBDeployment), so cancellation/timeout can be delayed by up to ~4s.♻️ Proposed fix
- time.Sleep(2 * time.Second) + select { + case <-time.After(2 * time.Second): + case <-ctx.Done(): + return ctx.Err() + } err = i.createLocalhostDeployment(ctx) if err != nil { return fmt.Errorf("error creating localhost: %v", err) } - time.Sleep(2 * time.Second) + select { + case <-time.After(2 * time.Second): + case <-ctx.Done(): + return ctx.Err() + } err = i.createInternalLBDeployment(ctx, apiIntHost, apiIntPort)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go` around lines 650 - 655, The two unconditional time.Sleep(2 * time.Second) calls in StartCollection block should be replaced with context-aware waits so ctx cancellation/timeout is respected; instead of sleeping directly before calling createLocalhostDeployment and before createInternalLBDeployment, replace each sleep with a select that waits on time.After(2*time.Second) and on ctx.Done(), returning ctx.Err() if cancelled. Update the StartCollection function where the sleeps occur (surrounding createLocalhostDeployment and createInternalLBDeployment) to perform this select-based wait (or call a small helper waitWithContext(ctx, 2*time.Second)) so we exit promptly on ctx cancellation.
421-426:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftFix
createNamespaceambiguous-success recovery when retryingNamespaces().Create
pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/manifests/namespace.yamlusesmetadata.generateName(not a stable name), so a retry after a post-persist/timeout can create another Namespace with a different generated name.createNamespaceinpkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.goretriesclient.Create(...)and returns the next error verbatim, with noAlreadyExists/Getreconciliation (or label-based recovery forgenerateName), so setup can still fail and/or leave behind extra labeled namespaces.- Reconcile explicitly after a failed create: make the operation idempotent (stable name) or, for
generateName, reconcile by a unique per-test label selector (thenGetthe matching Namespace) instead of blindly retrying rawCreate.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go` around lines 421 - 426, createNamespace currently retries client.Create(...) and can return ambiguous failures because the manifest uses metadata.generateName; instead make the create operation idempotent by reconciling on failure: in the RetryWithExponentialBackoff closure (in function createNamespace in monitortest.go) if client.Create returns an error, do not just return it—query the API for a matching Namespace using a unique per-test label selector (add/ensure a deterministic label on the Namespace creation path so you can List/Get by that label) and if you find exactly one matching Namespace set actualNamespace to that object and return nil; only return the original error if no matching Namespace is found or multiple ambiguous matches exist. Ensure this logic references client.Namespaces().Create and client.Namespaces().List/Get and uses the same test-unique label you apply in the namespace manifest.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`:
- Around line 281-301: The createRBACPrivileged/createMonitorCRB and
createNamespace flows are non-idempotent because they POST manifests with
metadata.generateName and on retry may create duplicates; change these routines
to use deterministic metadata.name (set a fixed name or derive one from
i.testRunID) before Create so retries target the same resource and Cleanup can
reliably delete it; alternatively switch Create to a server-side apply/patch
(e.g., use Patch/Apply) or, after a transient create error, list by a unique
label you add to the manifest (use a dedicated label key/value) instead of
prefix-matching in findClusterRoleBindingByPrefix so you can unambiguously
reconcile the single resource. Also replace the unconditional time.Sleep(2 *
time.Second) in StartCollection with a context-aware wait (select on time.After
vs ctx.Done) so the operation cancels promptly on context timeout/cancellation.
Ensure references: createRBACPrivileged, createMonitorCRB, createNamespace,
findClusterRoleBindingByPrefix, StartCollection.
---
Outside diff comments:
In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`:
- Around line 650-655: The two unconditional time.Sleep(2 * time.Second) calls
in StartCollection block should be replaced with context-aware waits so ctx
cancellation/timeout is respected; instead of sleeping directly before calling
createLocalhostDeployment and before createInternalLBDeployment, replace each
sleep with a select that waits on time.After(2*time.Second) and on ctx.Done(),
returning ctx.Err() if cancelled. Update the StartCollection function where the
sleeps occur (surrounding createLocalhostDeployment and
createInternalLBDeployment) to perform this select-based wait (or call a small
helper waitWithContext(ctx, 2*time.Second)) so we exit promptly on ctx
cancellation.
- Around line 421-426: createNamespace currently retries client.Create(...) and
can return ambiguous failures because the manifest uses metadata.generateName;
instead make the create operation idempotent by reconciling on failure: in the
RetryWithExponentialBackoff closure (in function createNamespace in
monitortest.go) if client.Create returns an error, do not just return it—query
the API for a matching Namespace using a unique per-test label selector
(add/ensure a deterministic label on the Namespace creation path so you can
List/Get by that label) and if you find exactly one matching Namespace set
actualNamespace to that object and return nil; only return the original error if
no matching Namespace is found or multiple ambiguous matches exist. Ensure this
logic references client.Namespaces().Create and client.Namespaces().List/Get and
uses the same test-unique label you apply in the namespace manifest.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 27c0d11f-eb00-4b9a-a37a-644fdb38c97d
📒 Files selected for processing (3)
pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.gopkg/monitortests/network/disruptionpodnetwork/monitortest.gopkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go
💤 Files with no reviewable changes (2)
- pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go
- pkg/monitortests/network/disruptionpodnetwork/monitortest.go
| client := i.kubeClient.RbacV1().ClusterRoleBindings() | ||
| obj, err := client.Create(ctx, rbacPrivilegedObj, metav1.CreateOptions{}) | ||
| if err != nil && !apierrors.IsAlreadyExists(err) { | ||
| var created *rbacv1.ClusterRoleBinding | ||
| if err := utility.RetryWithExponentialBackoff(ctx, func() error { | ||
| var createErr error | ||
| created, createErr = client.Create(ctx, rbacPrivilegedObj, metav1.CreateOptions{}) | ||
| if apierrors.IsAlreadyExists(createErr) { | ||
| return nil | ||
| } | ||
| return createErr | ||
| }); err != nil { | ||
| return fmt.Errorf("error creating privileged SCC CRB: %v", err) | ||
| } | ||
| rbacPrivilegedCRBName = obj.Name | ||
| if created != nil { | ||
| rbacPrivilegedCRBName = created.Name | ||
| } else { | ||
| name, err := findClusterRoleBindingByPrefix(ctx, client, rbacPrivilegedObj.GenerateName) | ||
| if err != nil { | ||
| return fmt.Errorf("error finding existing privileged SCC CRB: %v", err) | ||
| } | ||
| rbacPrivilegedCRBName = name | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Embedded CRB manifests =="
sed -n '1,80p' pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/manifests/crb-privileged.yaml
echo
sed -n '1,80p' pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/manifests/crb-monitor.yaml
echo
echo "== Retried CRB create sites =="
rg -n -C2 'ClusterRoleBindings\(\)\.Create|RetryWithExponentialBackoff|findClusterRoleBindingByPrefix' \
pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.goRepository: openshift/origin
Length of output: 5137
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go"
echo "== findClusterRoleBindingByPrefix implementation =="
sed -n '395,450p' "$FILE" | cat -n
echo
echo "== CRB privileged create/recovery block =="
sed -n '260,330p' "$FILE" | cat -n
echo
echo "== CRB monitor create/recovery block =="
sed -n '330,410p' "$FILE" | cat -n
echo
echo "== createNamespace retry block =="
rg -n "func \\(.*\\) createNamespace|createNamespace\\(" "$FILE"
# Print near the match(s)
rg -n -C 3 "createNamespace\\(" "$FILE" | head -n 80
# Best-effort: show likely region around the namespace create
sed -n '410,520p' "$FILE" | cat -n
echo
echo "== time.Sleep occurrences (context) =="
rg -n "time\\.Sleep\\(" "$FILE" -n
rg -n -C 2 "time\\.Sleep\\(" "$FILE"Repository: openshift/origin
Length of output: 15875
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Find cleanup / usage of stored CRB names
rg -n "rbacPrivilegedCRBName|rbacMonitorCRBName" pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go
echo "----"
# 2) Inspect retry implementation to see what it retries on
FILE="pkg/monitortestlibrary/utility/retry.go"
if [ -f "$FILE" ]; then
sed -n '1,220p' "$FILE" | cat -n
else
echo "Missing expected retry file: $FILE"
fiRepository: openshift/origin
Length of output: 2514
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go"
echo "== namespaceYaml definition =="
rg -n "namespaceYaml" "$FILE"
echo
echo "== Show surrounding YAML embedded/constructed for namespaceYaml =="
# Print up to ~80 lines around first declaration/assignment
line=$(rg -n "namespaceYaml" "$FILE" | head -n 1 | cut -d: -f1)
start=$((line-40))
end=$((line+140))
sed -n "${start},${end}p" "$FILE" | cat -nRepository: openshift/origin
Length of output: 7997
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go"
EMBED_DIR="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/manifests"
echo "== namespace.yaml =="
sed -n '1,120p' "$EMBED_DIR/namespace.yaml" | cat -n
echo
echo "== call site around createNamespace(ctx) and assigned namespaceName =="
# Find where createNamespace return used
rg -n "createNamespace\(ctx\)" "$FILE"
line=$(rg -n "namespace, err := i\.createNamespace\(ctx\)" "$FILE" | head -n 1 | cut -d: -f1)
start=$((line-40)); end=$((line+80))
sed -n "${start},${end}p" "$FILE" | cat -n
echo
echo "== cleanup / deletion of namespaces - where i.namespaceName is used =="
rg -n "deleteNamespace\\(|removeExistingMonitorNamespaces\\(|namespaceName" "$FILE" | head -n 80
echo
echo "== StartCollection time.Sleep context (around 640-670) =="
sed -n '620,690p' "$FILE" | cat -nRepository: openshift/origin
Length of output: 11181
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go"
echo "== Cleanup function around namespace/CRB deletion =="
# show around first occurrence of func (i *InvariantInClusterDisruption) Cleanup
line=$(rg -n "func \\(i \\*InvariantInClusterDisruption\\) Cleanup" "$FILE" | head -n 1 | cut -d: -f1)
start=$((line-40)); end=$((line+120))
sed -n "${start},${end}p" "$FILE" | cat -nRepository: openshift/origin
Length of output: 4352
Fix non-idempotent RBAC/namespace creation on transient retries (generatedName + prefix lookup)
createRBACPrivilegedandcreateMonitorCRBretryClusterRoleBindings().Createusing CRB manifests withmetadata.generateName, then recover viafindClusterRoleBindingByPrefix(list all CRBs and return the firststrings.HasPrefix). If the API creates the CRB but the client sees a retryable timeout/5xx, subsequent retries can create additional CRBs with different generated names;Cleanupdeletes onlyrbacPrivilegedCRBName/rbacMonitorCRBName, leaving orphans.createNamespaceretriesNamespaces().Createusingmetadata.generateName(manifests/namespace.yaml), with no deterministic name reconciliation;Cleanupdeletes onlyi.namespaceName, so transient-create ambiguity can leave orphaned monitoring namespaces.StartCollectionuses unconditionaltime.Sleep(2 * time.Second); switch to a context-aware wait so cancellation/timeouts stop the test promptly.
Also applies to: 343-363, 401-411
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/monitortests/kubeapiserver/disruptioninclusterapiserver/monitortest.go`
around lines 281 - 301, The createRBACPrivileged/createMonitorCRB and
createNamespace flows are non-idempotent because they POST manifests with
metadata.generateName and on retry may create duplicates; change these routines
to use deterministic metadata.name (set a fixed name or derive one from
i.testRunID) before Create so retries target the same resource and Cleanup can
reliably delete it; alternatively switch Create to a server-side apply/patch
(e.g., use Patch/Apply) or, after a transient create error, list by a unique
label you add to the manifest (use a dedicated label key/value) instead of
prefix-matching in findClusterRoleBindingByPrefix so you can unambiguously
reconcile the single resource. Also replace the unconditional time.Sleep(2 *
time.Second) in StartCollection with a context-aware wait (select on time.After
vs ctx.Done) so the operation cancels promptly on context timeout/cancellation.
Ensure references: createRBACPrivileged, createMonitorCRB, createNamespace,
findClusterRoleBindingByPrefix, StartCollection.
|
Scheduling required tests: |
|
/lgtm |
|
/verified by CI |
|
@mkowalski: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mkowalski, xueqzhan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@xueqzhan: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@xueqzhan: Jira Issue Verification Checks: Jira Issue OCPBUGS-85486 Jira Issue OCPBUGS-85486 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in release 5.0.0-0.nightly-2026-06-06-100407 |
…cd errors
Summary by CodeRabbit
New Features
Tests