Skip to content

feat: add drain delay to graceful shutdown process#9211

Open
mwain wants to merge 2 commits into
envoyproxy:mainfrom
mwain:add-drain-delay
Open

feat: add drain delay to graceful shutdown process#9211
mwain wants to merge 2 commits into
envoyproxy:mainfrom
mwain:add-drain-delay

Conversation

@mwain

@mwain mwain commented Jun 11, 2026

Copy link
Copy Markdown

What type of PR is this?
feat

What this PR does / why we need it:

When a pod terminates, the shutdown-manager immediately calls healthcheck/fail. This fails the readiness probe and removes the pod from service endpoints within ~5s.

External load balancers that use health checks for deregistration need longer than that. On GKE, the L4 passthrough load balancer probes every 3 seconds and needs 2 failures, so deregistration takes 6 seconds or more. In between, the load balancer keeps sending new connections that kube-proxy no longer routes. They are reset or hang until they time out.

Kubernetes handles this window via KEP-1669. When no other ready endpoints remain, kube-proxy keeps routing to terminating pods that still pass their readiness probe. Failing readiness at the start of shutdown defeats this.

This adds a drainDelay field to ShutdownConfig which delays the healthcheck/fail call.
During the delay the pod stays ready and keeps serving, so the load balancer has time to deregister it.

Which issue(s) this PR fixes:
Fixes #9210

Release Notes: Yes

mwain added 2 commits June 11, 2026 09:26
drainDelay defines how long the shutdown manager waits before failing
healthchecks (delaying call to healthcheck/fail).

The delay allows external load balancers time to deregister a
terminating pod while still serving traffic.

Signed-off-by: Michael Wain <michael@sanity.io>
When a pod terminates, the shutdown-manager immediately calls
`healthcheck/fail` to fail health checks. This causes the readiness
probe to fail and removes the pod from service endpoints.

External load balancers that use health checks for deregistration need
several failed probes before they stop sending new connections to the
pod. On GKE, the L4 passthrough load balancer probes every 3 seconds
and needs 2 failures (at the time of writing), so deregistration takes 6 seconds or more.
The pod readiness probe fails faster than that, so for a few seconds the load
balancer keeps sending connections that kube-proxy no longer routes.

Kubernetes has a mechanism for this window. When a service has no
other ready endpoints, kube-proxy keeps routing to terminating pods
that still pass their readiness probe (KEP-1669). This only works if
the pod keeps passing its probe while the load balancer deregisters
it. Calling `healthcheck/fail` at the start of shutdown breaks that.

This adds a drainDelay field to ShutdownConfig which delays the
`healthcheck/fail` call. During the delay the pod stays ready and keeps
serving, so the load balancer has time to deregister it. The
ready-timeout and termination grace period are extended by the delay.

Signed-off-by: Michael Wain <michael@sanity.io>
@netlify

netlify Bot commented Jun 11, 2026

Copy link
Copy Markdown

Deploy Preview for cerulean-figolla-1f9435 ready!

Name Link
🔨 Latest commit f854ec2
🔍 Latest deploy log https://app.netlify.com/projects/cerulean-figolla-1f9435/deploys/6a2a9fe992b4ec000749c198
😎 Deploy Preview https://deploy-preview-9211--cerulean-figolla-1f9435.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 64.70588% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.88%. Comparing base (b96a2d3) to head (f854ec2).

Files with missing lines Patch % Lines
internal/cmd/envoy/shutdown_manager.go 0.00% 5 Missing ⚠️
...ternal/infrastructure/kubernetes/proxy/resource.go 77.77% 2 Missing and 2 partials ⚠️
...frastructure/kubernetes/proxy/resource_provider.go 60.00% 1 Missing and 1 partial ⚠️
internal/cmd/envoy.go 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9211      +/-   ##
==========================================
- Coverage   74.90%   74.88%   -0.02%     
==========================================
  Files         252      252              
  Lines       40797    40822      +25     
==========================================
+ Hits        30559    30570      +11     
- Misses       8154     8165      +11     
- Partials     2084     2087       +3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zirain zirain left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@mwain

mwain commented Jun 12, 2026

Copy link
Copy Markdown
Author

@zirain Thanks for the speedy approval! What are the next steps for me?

@zirain

zirain commented Jun 12, 2026

Copy link
Copy Markdown
Member

can you fix the conlfict and we need another approval from other maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a delay to the shutdown-manager before failing healthchecks

2 participants