[1/N] [HAProxy stability] - Wait for old HAProxy workers to exit before marking proxy drained by harshit-anyscale · Pull Request #63620 · ray-project/ray

harshit-anyscale · 2026-05-25T09:32:00Z

Summary

Fixes a race condition in HAProxyProxyActor.is_drained() that caused in-flight TCP connections to be RST'd (and thus 502s from upstream load balancers) whenever a proxy drained while older HAProxy worker processes were still serving long-running requests.

Detailed investigation: https://gist.github.com/harshit-anyscale/04ed26e4a0ea71886220fed1f903fca9

The bug

HAProxyProxyActor.is_drained() previously gated on:

drain time exceeded PROXY_MIN_DRAINING_PERIOD_S (default 30s)
the current worker's admin-socket stats reported is_system_idle

This was unsafe. Each HAProxy graceful reload spawns a new worker that owns the listening socket (via -x socket transfer), while the OLD worker enters soft-stop and continues to serve its pre-reload in-flight connections in its own process memory. Those in-flight connections are invisible to the admin socket (which is bound to the new worker), so is_system_idle could return True while old workers still held long-running connections.

Marking the proxy as drained while old workers were alive caused proxy_state.shutdown() → HAProxyApi.stop() → proc.kill() to SIGKILL the old workers and sever their in-flight TCP connections. Clients saw 502 from upstream load balancers (target broke connection); replicas recorded HTTP 499 (uvicorn received an ASGI disconnect mid-request).

This affected any deployment with request durations exceeding PROXY_MIN_DRAINING_PERIOD_S — e.g., long-running inference, streaming responses, model cold-starts.

The fix

Add condition (3) to is_drained(): no OLD workers from _old_procs are still alive. Each old worker exits cleanly when its in-flight count reaches zero (or hard-stop-after reaps it as a final safety valve), at which point subprocess.Process.returncode becomes set. The drain check uses this as a precise per-worker signal.

This makes the drain adaptive: it lasts exactly as long as the longest in-flight request actually needs, plus the PROXY_MIN_DRAINING_PERIOD_S floor that's still useful for upstream LB target-deregistration propagation. Previously, operators had to guess a PROXY_MIN_DRAINING_PERIOD_S value that exceeded the longest request duration across all deployments on the cluster; with this fix, the env var reverts to its name's actual semantic — a minimum floor, not a worst-case ceiling.

A new helper HAProxyApi.has_alive_old_procs() also prunes already-exited entries from _old_procs so the list stays bounded across long-lived proxy actors that perform many reloads.

The check order in is_drained was reorganized so the cheap, in-memory checks (timer, old-proc list) short-circuit before the async admin-socket query.

🤖 Generated with Claude Code

…ined `HAProxyProxyActor.is_drained()` previously gated on: 1. drain time exceeded `PROXY_MIN_DRAINING_PERIOD_S` 2. the current worker's admin-socket stats reported `is_system_idle` This was unsafe. Each graceful reload spawns a new HAProxy worker that owns the listening socket (via `-x` socket transfer), while the OLD worker enters soft-stop and continues to serve its pre-reload in-flight connections in its own process memory. Those in-flight connections are invisible to the admin socket (which is bound by the new worker), so `is_system_idle` could return True while old workers still held long-running connections like 60-120s long-runner requests. Marking the proxy as drained while old workers were alive caused `proxy_state.shutdown()` → `HAProxyApi.stop()` → `proc.kill()` to SIGKILL the old workers and sever their in-flight TCP connections. Clients saw 502 from ALB (target broke connection); replicas saw 499 (uvicorn received an ASGI disconnect mid-request). The fix adds condition (2'): no OLD workers from `_old_procs` are still alive. Each old worker exits cleanly when its in-flight count reaches zero (or `hard-stop-after` reaps it as final safety valve), at which point `returncode` becomes set. The drain check uses this as a precise per-worker signal. This makes the drain adaptive: it lasts exactly as long as the longest in-flight request actually needs, plus the `PROXY_MIN_DRAINING_PERIOD_S` floor that's still useful for ALB target-deregistration propagation. Previously, operators had to guess the right `PROXY_MIN_DRAINING_PERIOD_S` value to exceed the longest expected request duration; with this change, the env var's role reverts to "minimum drain time" (its name) rather than "maximum request time" (its de facto behavior). A new helper `HAProxyApi.has_alive_old_procs()` also prunes already- exited entries from `_old_procs` so the list stays bounded across long-lived proxy actors that perform many reloads. Check order in `is_drained` was reorganized so the cheap, in-memory checks (timer, old-proc list) short-circuit before the async admin socket query.

gemini-code-assist

Code Review

This pull request enhances the HAProxy draining logic by ensuring that old worker processes from prior graceful reloads have fully exited before a proxy is considered drained. It introduces the has_alive_old_procs method to track these processes and integrates it into the is_drained check. A review comment identifies a potential memory leak because the _old_procs list is only pruned during the draining state; it is recommended to also prune this list during reload operations to keep it bounded for long-lived proxies.

I am having trouble creating individual review comments. Click here to see my feedback.

python/ray/serve/_private/haproxy.py (1143)

The _old_procs list is only pruned within has_alive_old_procs, which is currently only called when the proxy is in a draining state. For long-lived proxy actors that undergo many graceful reloads without ever being drained, this list will grow indefinitely, leading to a slow memory leak of asyncio.subprocess.Process objects. Consider pruning this list during the reload or _graceful_reload operations as well to ensure it stays bounded regardless of whether the proxy is draining.

akyang-anyscale

could you add a test?

harshit-anyscale · 2026-05-29T08:36:10Z

could you add a test?

done.

Cover the drain logic added in this PR: is_drained() stays False while a soft-stopping old worker from a prior reload is still alive (even past the min drain period with an idle current worker) and flips True once it exits; it is also False before the min drain period elapses. HAProxyManager is an @ray.remote actor, so the imported name is an ActorClass rather than a plain type; the tests reach the underlying class via __ray_metadata__.modified_class and instantiate it with __new__ to exercise is_drained() without starting an actor. Addresses review request on ray-project#63620. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

has_alive_old_procs() was the only pruner, and it only runs from is_drained() (i.e. while draining). A healthy proxy that reloads but never drains would accumulate exited Process objects in _old_procs indefinitely. Extract the prune into _prune_old_procs() and also call it from _graceful_reload so the list stays bounded regardless of whether the proxy ever drains. Addresses review feedback on ray-project#63620. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Merging master (which now includes ray-project#63620) brought in a second _prune_old_procs that only filters _old_procs. As the later definition it shadowed the ring-buffer version added here, so retired log files were never deleted and the per-spawn file count grew unbounded again. Remove the filter-only duplicate; the ring version already filters _old_procs to alive workers AND retires/deletes their log files past the cap, and has_alive_old_procs calls it. Addresses Cursor review on ray-project#63621. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…re marking proxy drained (ray-project#63620) ### Summary Fixes a race condition in `HAProxyProxyActor.is_drained()` that caused in-flight TCP connections to be RST'd (and thus 502s from upstream load balancers) whenever a proxy drained while older HAProxy worker processes were still serving long-running requests. Detailed investigation: https://gist.github.com/harshit-anyscale/04ed26e4a0ea71886220fed1f903fca9 ### The bug `HAProxyProxyActor.is_drained()` previously gated on: 1. drain time exceeded `PROXY_MIN_DRAINING_PERIOD_S` (default 30s) 2. the current worker's admin-socket stats reported `is_system_idle` This was unsafe. Each HAProxy graceful reload spawns a new worker that owns the listening socket (via `-x` socket transfer), while the OLD worker enters soft-stop and continues to serve its pre-reload in-flight connections in its own process memory. **Those in-flight connections are invisible to the admin socket** (which is bound to the new worker), so `is_system_idle` could return True while old workers still held long-running connections. Marking the proxy as drained while old workers were alive caused `proxy_state.shutdown()` → `HAProxyApi.stop()` → `proc.kill()` to SIGKILL the old workers and sever their in-flight TCP connections. Clients saw 502 from upstream load balancers (target broke connection); replicas recorded HTTP 499 (uvicorn received an ASGI disconnect mid-request). This affected any deployment with request durations exceeding `PROXY_MIN_DRAINING_PERIOD_S` — e.g., long-running inference, streaming responses, model cold-starts. ### The fix Add condition (3) to `is_drained()`: no OLD workers from `_old_procs` are still alive. Each old worker exits cleanly when its in-flight count reaches zero (or `hard-stop-after` reaps it as a final safety valve), at which point `subprocess.Process.returncode` becomes set. The drain check uses this as a precise per-worker signal. This makes the drain adaptive: it lasts exactly as long as the longest in-flight request actually needs, plus the `PROXY_MIN_DRAINING_PERIOD_S` floor that's still useful for upstream LB target-deregistration propagation. Previously, operators had to guess a `PROXY_MIN_DRAINING_PERIOD_S` value that exceeded the longest request duration across all deployments on the cluster; with this fix, the env var reverts to its name's actual semantic — a minimum floor, not a worst-case ceiling. A new helper `HAProxyApi.has_alive_old_procs()` also prunes already-exited entries from `_old_procs` so the list stays bounded across long-lived proxy actors that perform many reloads. The check order in `is_drained` was reorganized so the cheap, in-memory checks (timer, old-proc list) short-circuit before the async admin-socket query. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

harshit-anyscale requested a review from a team as a code owner May 25, 2026 09:32

harshit-anyscale self-assigned this May 25, 2026

harshit-anyscale force-pushed the serve-haproxy-drain-fix branch from 9d04af3 to b7a5092 Compare May 25, 2026 09:34

harshit-anyscale changed the title ~~[serve] Wait for old HAProxy workers to exit before marking proxy drained~~ May 25, 2026

harshit-anyscale requested review from abrarsheikh and akyang-anyscale May 25, 2026 09:34

harshit-anyscale added the serve Ray Serve Related Issue label May 25, 2026

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

harshit-anyscale changed the title ~~[serve][HAproxy-stability] Wait for old HAProxy workers to exit before marking proxy drained~~ May 25, 2026

harshit-anyscale added the go add ONLY when ready to merge, run all tests label May 25, 2026

akyang-anyscale approved these changes May 27, 2026

View reviewed changes

harshit-anyscale force-pushed the serve-haproxy-drain-fix branch from 45d6692 to 5f420e8 Compare May 29, 2026 08:35

harshit-anyscale force-pushed the serve-haproxy-drain-fix branch from 5f420e8 to 156e146 Compare May 29, 2026 11:10

eicherseiji reviewed Jun 1, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py

harshit-anyscale force-pushed the serve-haproxy-drain-fix branch from 91ac10a to fc26069 Compare June 2, 2026 04:32

harshit-anyscale requested a review from eicherseiji June 2, 2026 04:33

eicherseiji approved these changes Jun 2, 2026

View reviewed changes

abrarsheikh approved these changes Jun 2, 2026

View reviewed changes

abrarsheikh merged commit 276f535 into ray-project:master Jun 2, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[1/N] [HAProxy stability] - Wait for old HAProxy workers to exit before marking proxy drained#63620

[1/N] [HAProxy stability] - Wait for old HAProxy workers to exit before marking proxy drained#63620
abrarsheikh merged 3 commits into
ray-project:masterfrom
harshit-anyscale:serve-haproxy-drain-fix

harshit-anyscale commented May 25, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

akyang-anyscale left a comment

harshit-anyscale commented May 29, 2026

Uh oh!

Uh oh!

Labels

4 participants

Uh oh!

Conversation

harshit-anyscale commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The bug

The fix

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

python/ray/serve/_private/haproxy.py (1143)

akyang-anyscale left a comment

Choose a reason for hiding this comment

harshit-anyscale commented May 29, 2026

Uh oh!

Uh oh!

Labels

4 participants

harshit-anyscale commented May 25, 2026 •

edited

Loading