OCPBUGS-86922: Retry transient API errors in pod-network-avalibility preparation#31253
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughThis PR adds exponential backoff and retry support for transient Kubernetes API errors encountered during network disruption test preparation. A new ChangesRetry Support for Network Disruption Test Setup
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mkowalski The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/monitortests/network/disruptionpodnetwork/monitortest.go (1)
117-120: ⚡ Quick winNarrow
*net.OpErrorretry classification to timeout/reset/refused-like cases.Treating all
*net.OpErroras transient can retry permanent misconfiguration errors unnecessarily. Prefer timeout/connection-reset/refused predicates (or utilnet helpers) to keep fail-fast behavior for non-transient failures.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/network/disruptionpodnetwork/monitortest.go` around lines 117 - 120, The current errors.As(err, &netErr) block treats any *net.OpError as transient; update it to only return true for timeout/connection-reset/refused-like cases by inspecting netErr (e.g., netErr.Timeout() or checking netErr.Err for syscall/temporary errors) instead of blanket acceptance. Specifically, replace the unconditional return in the errors.As(err, &netErr) branch with predicates such as netErr.Timeout(), utilnet.IsConnectionRefused/IsNetworkTimeout equivalents, or errors.Is(netErr.Err, syscall.ECONNRESET|syscall.ECONNREFUSED) (and/or check for underlying *os.SyscallError) so only those transient conditions make the function return true.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/monitortests/network/disruptionpodnetwork/monitortest.go`:
- Around line 117-120: The current errors.As(err, &netErr) block treats any
*net.OpError as transient; update it to only return true for
timeout/connection-reset/refused-like cases by inspecting netErr (e.g.,
netErr.Timeout() or checking netErr.Err for syscall/temporary errors) instead of
blanket acceptance. Specifically, replace the unconditional return in the
errors.As(err, &netErr) branch with predicates such as netErr.Timeout(),
utilnet.IsConnectionRefused/IsNetworkTimeout equivalents, or
errors.Is(netErr.Err, syscall.ECONNRESET|syscall.ECONNREFUSED) (and/or check for
underlying *os.SyscallError) so only those transient conditions make the
function return true.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: a59505cc-395a-476e-b4d3-fffe66a0bf6b
📒 Files selected for processing (1)
pkg/monitortests/network/disruptionpodnetwork/monitortest.go
…preparation The pod-network-avalibility monitor's PrepareCollection phase creates several Deployments, Services, and other resources via direct API calls with no retry logic. On jobs with heavy pre-test deployment activity (e.g. metal BGP-virt jobs installing CNV + FRR), the API server can be temporarily overloaded when the preparation runs, causing errors such as: - unexpected EOF - http2: client connection lost - etcdserver: request timed out When the preparation fails, poller pods are never created, causing the collection tests to also fail with 'no pods found'. This produces false regressions in Component Readiness (e.g. Sippy regression 41630). Analysis of passing vs failing runs of the same job shows identical cluster health profiles (~78 monitor test failures including 54 kubelet-container-restarts). The only difference is whether the preparation API call lands during or after the post-deployment load spike. This commit wraps all Kubernetes Create calls in PrepareCollection with a createWithRetry helper that uses exponential backoff (2s base, 5 retries, ~60s worst case) for transient API errors. Non-retryable errors still fail immediately. Signed-off-by: Mateusz Kowalski <mko@redhat.com>
|
@mkowalski: This pull request references Jira Issue OCPBUGS-86922, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
a80291a to
295c051
Compare
|
/jira refresh |
|
@mkowalski: This pull request references Jira Issue OCPBUGS-86922, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Scheduling required tests: |
There was a problem hiding this comment.
Pull request overview
This PR improves the reliability of the pod-network-avalibility monitor’s PrepareCollection phase by adding retry-with-backoff behavior around Kubernetes API Create() calls, reducing flake due to transient apiserver/network failures under heavy cluster load.
Changes:
- Add
isTransientAPIErrorto classify transient apiserver/network failures worth retrying. - Add generic
createWithRetryhelper using exponential backoff and apply it to resource creation inPrepareCollection.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| err := wait.ExponentialBackoff(retryBackoff, func() (bool, error) { | ||
| var createErr error | ||
| result, createErr = fn() | ||
| if createErr == nil { | ||
| return true, nil |
There was a problem hiding this comment.
@copilot, is this even relevant given this is running only on temporary clusters created for CI purposes?
There was a problem hiding this comment.
@copilot you there?
There was a problem hiding this comment.
The context.Background() usage isn't a regression — the original code already used context.Background() for all these Create() calls before this PR. This change only wraps them in retry logic without altering the context behavior.
That said, the broader point about context propagation: this code runs exclusively in CI on ephemeral clusters. If a job is aborted, the entire cluster gets destroyed — there's nothing to "gracefully cancel" into. The worst case is ~60s of retries against an API server that's about to be torn down anyway.
I could thread ctx through createWithRetry and switch to ExponentialBackoffWithContext, but it adds signature complexity for zero practical benefit in this context. Happy to do it if reviewers feel strongly, but I'd rather keep the helper simple.
🤖 This comment was posted by OpenClaw on behalf of @mkowalski.
|
@mkowalski: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/close Ken's PR #31249 is better |
|
@mkowalski: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mkowalski: This pull request references Jira Issue OCPBUGS-86922. The bug has been updated to no longer refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Problem
The
pod-network-avalibilitymonitor'sPrepareCollectionphase creates severalDeployments, Services, and other Kubernetes resources via direct API calls with no
retry logic. On jobs with heavy pre-test deployment activity (e.g. the metal
BGP-virt job that installs CNV + FRR right before the e2e suite), the API server
can be temporarily overloaded when the preparation runs, causing errors such as:
Post ".../deployments": unexpected EOFPost ".../deployments": http2: client connection lostetcdserver: request timed outWhen the preparation fails, poller pods are never created. The subsequent collection
tests then also fail with
no pods found for poller ..., producing false regressionsin Component Readiness.
Evidence
Sippy regression 41630
tracks this for
periodic-ci-openshift-ovn-kubernetes-release-5.0-periodics-e2e-metal-ipi-ovn-bgp-virt-ipv4.All 3 failing runs show the same pattern:
etcdserver: request timed outhttp2: client connection lostunexpected EOFThree different bare metal hosts — not a single bad machine.
Root cause: CNV installation finishes ~2 minutes before the e2e test starts.
At that point, 52 CNV pods + 52 OVN pods + 31 FRR pods are all starting
simultaneously (125 container start events in a 10-minute window). The API server
is under heavy load when the preparation phase tries to POST the poller Deployment.
Key evidence: Passing runs have the exact same background failure profile
(54
kubelet-container-restarts+ 15required-scc-annotation-checker= ~78 monitortest failures). The only difference is whether
pod-network-avalibility preparationhappens to succeed or not.
Fix
Wrap all Kubernetes
Create()calls inPrepareCollectionwith acreateWithRetryhelper that uses exponential backoff for transient API server errors:
IsServerTimeout,IsTimeout,IsTooManyRequests,IsServiceUnavailable,IsInternalError,io.ErrUnexpectedEOF,net.OpError,http2: client connection lost,connection reset by peer,etcdserver: request timed out,unexpected EOFNo changes to test logic, assertions, or behavior when the API server is healthy.
🤖 This PR was created by OpenClaw on behalf of @mkowalski.
Summary by CodeRabbit