feat: add drain delay to graceful shutdown process by mwain · Pull Request #9211 · envoyproxy/gateway

mwain · 2026-06-11T11:45:41Z

What type of PR is this?
feat

What this PR does / why we need it:

When a pod terminates, the shutdown-manager immediately calls healthcheck/fail. This fails the readiness probe and removes the pod from service endpoints within ~5s.

External load balancers that use health checks for deregistration need longer than that. On GKE, the L4 passthrough load balancer probes every 3 seconds and needs 2 failures, so deregistration takes 6 seconds or more. In between, the load balancer keeps sending new connections that kube-proxy no longer routes. They are reset or hang until they time out.

Kubernetes handles this window via KEP-1669. When no other ready endpoints remain, kube-proxy keeps routing to terminating pods that still pass their readiness probe. Failing readiness at the start of shutdown defeats this.

This adds a drainDelay field to ShutdownConfig which delays the healthcheck/fail call.
During the delay the pod stays ready and keeps serving, so the load balancer has time to deregister it.

Which issue(s) this PR fixes:
Fixes #9210

Release Notes: Yes

drainDelay defines how long the shutdown manager waits before failing healthchecks (delaying call to healthcheck/fail). The delay allows external load balancers time to deregister a terminating pod while still serving traffic. Signed-off-by: Michael Wain <michael@sanity.io>

When a pod terminates, the shutdown-manager immediately calls `healthcheck/fail` to fail health checks. This causes the readiness probe to fail and removes the pod from service endpoints. External load balancers that use health checks for deregistration need several failed probes before they stop sending new connections to the pod. On GKE, the L4 passthrough load balancer probes every 3 seconds and needs 2 failures (at the time of writing), so deregistration takes 6 seconds or more. The pod readiness probe fails faster than that, so for a few seconds the load balancer keeps sending connections that kube-proxy no longer routes. Kubernetes has a mechanism for this window. When a service has no other ready endpoints, kube-proxy keeps routing to terminating pods that still pass their readiness probe (KEP-1669). This only works if the pod keeps passing its probe while the load balancer deregisters it. Calling `healthcheck/fail` at the start of shutdown breaks that. This adds a drainDelay field to ShutdownConfig which delays the `healthcheck/fail` call. During the delay the pod stays ready and keeps serving, so the load balancer has time to deregister it. The ready-timeout and termination grace period are extended by the delay. Signed-off-by: Michael Wain <michael@sanity.io>

netlify · 2026-06-11T12:30:20Z

✅ Deploy Preview for cerulean-figolla-1f9435 ready!

Name	Link
🔨 Latest commit	`f854ec2`
🔍 Latest deploy log	https://app.netlify.com/projects/cerulean-figolla-1f9435/deploys/6a2a9fe992b4ec000749c198
😎 Deploy Preview	https://deploy-preview-9211--cerulean-figolla-1f9435.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

codecov · 2026-06-11T13:13:59Z

Codecov Report

❌ Patch coverage is 64.70588% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.88%. Comparing base (b96a2d3) to head (f854ec2).

Files with missing lines	Patch %	Lines
internal/cmd/envoy/shutdown_manager.go	0.00%	5 Missing ⚠️
...ternal/infrastructure/kubernetes/proxy/resource.go	77.77%	2 Missing and 2 partials ⚠️
...frastructure/kubernetes/proxy/resource_provider.go	60.00%	1 Missing and 1 partial ⚠️
internal/cmd/envoy.go	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9211      +/-   ##
==========================================
- Coverage   74.90%   74.88%   -0.02%     
==========================================
  Files         252      252              
  Lines       40797    40822      +25     
==========================================
+ Hits        30559    30570      +11     
- Misses       8154     8165      +11     
- Partials     2084     2087       +3

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zirain

LGTM, thanks!

mwain · 2026-06-12T11:24:39Z

@zirain Thanks for the speedy approval! What are the next steps for me?

zirain · 2026-06-12T11:25:40Z

can you fix the conlfict and we need another approval from other maintainers.

mwain added 2 commits June 11, 2026 09:26

mwain requested a review from a team as a code owner June 11, 2026 11:45

mwain mentioned this pull request Jun 11, 2026

Add a delay to the shutdown-manager before failing healthchecks #9210

Open

zirain approved these changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add drain delay to graceful shutdown process#9211

feat: add drain delay to graceful shutdown process#9211
mwain wants to merge 2 commits into
envoyproxy:mainfrom
mwain:add-drain-delay

mwain commented Jun 11, 2026

Uh oh!

netlify Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 11, 2026

Uh oh!

zirain left a comment

Uh oh!

mwain commented Jun 12, 2026

Uh oh!

zirain commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mwain commented Jun 11, 2026

Uh oh!

netlify Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cerulean-figolla-1f9435 ready!

Uh oh!

codecov Bot commented Jun 11, 2026

Codecov Report

Uh oh!

zirain left a comment

Choose a reason for hiding this comment

Uh oh!

mwain commented Jun 12, 2026

Uh oh!

zirain commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify Bot commented Jun 11, 2026 •

edited

Loading