[1/N] [HAProxy stability] - Wait for old HAProxy workers to exit before marking proxy drained#63620
Conversation
…ined `HAProxyProxyActor.is_drained()` previously gated on: 1. drain time exceeded `PROXY_MIN_DRAINING_PERIOD_S` 2. the current worker's admin-socket stats reported `is_system_idle` This was unsafe. Each graceful reload spawns a new HAProxy worker that owns the listening socket (via `-x` socket transfer), while the OLD worker enters soft-stop and continues to serve its pre-reload in-flight connections in its own process memory. Those in-flight connections are invisible to the admin socket (which is bound by the new worker), so `is_system_idle` could return True while old workers still held long-running connections like 60-120s long-runner requests. Marking the proxy as drained while old workers were alive caused `proxy_state.shutdown()` → `HAProxyApi.stop()` → `proc.kill()` to SIGKILL the old workers and sever their in-flight TCP connections. Clients saw 502 from ALB (target broke connection); replicas saw 499 (uvicorn received an ASGI disconnect mid-request). The fix adds condition (2'): no OLD workers from `_old_procs` are still alive. Each old worker exits cleanly when its in-flight count reaches zero (or `hard-stop-after` reaps it as final safety valve), at which point `returncode` becomes set. The drain check uses this as a precise per-worker signal. This makes the drain adaptive: it lasts exactly as long as the longest in-flight request actually needs, plus the `PROXY_MIN_DRAINING_PERIOD_S` floor that's still useful for ALB target-deregistration propagation. Previously, operators had to guess the right `PROXY_MIN_DRAINING_PERIOD_S` value to exceed the longest expected request duration; with this change, the env var's role reverts to "minimum drain time" (its name) rather than "maximum request time" (its de facto behavior). A new helper `HAProxyApi.has_alive_old_procs()` also prunes already- exited entries from `_old_procs` so the list stays bounded across long-lived proxy actors that perform many reloads. Check order in `is_drained` was reorganized so the cheap, in-memory checks (timer, old-proc list) short-circuit before the async admin socket query.
9d04af3 to
b7a5092
Compare
There was a problem hiding this comment.
Code Review
This pull request enhances the HAProxy draining logic by ensuring that old worker processes from prior graceful reloads have fully exited before a proxy is considered drained. It introduces the has_alive_old_procs method to track these processes and integrates it into the is_drained check. A review comment identifies a potential memory leak because the _old_procs list is only pruned during the draining state; it is recommended to also prune this list during reload operations to keep it bounded for long-lived proxies.
I am having trouble creating individual review comments. Click here to see my feedback.
python/ray/serve/_private/haproxy.py (1143)
The _old_procs list is only pruned within has_alive_old_procs, which is currently only called when the proxy is in a draining state. For long-lived proxy actors that undergo many graceful reloads without ever being drained, this list will grow indefinitely, leading to a slow memory leak of asyncio.subprocess.Process objects. Consider pruning this list during the reload or _graceful_reload operations as well to ensure it stays bounded regardless of whether the proxy is draining.
akyang-anyscale
left a comment
There was a problem hiding this comment.
could you add a test?
45d6692 to
5f420e8
Compare
done. |
Cover the drain logic added in this PR: is_drained() stays False while a soft-stopping old worker from a prior reload is still alive (even past the min drain period with an idle current worker) and flips True once it exits; it is also False before the min drain period elapses. HAProxyManager is an @ray.remote actor, so the imported name is an ActorClass rather than a plain type; the tests reach the underlying class via __ray_metadata__.modified_class and instantiate it with __new__ to exercise is_drained() without starting an actor. Addresses review request on ray-project#63620. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5f420e8 to
156e146
Compare
has_alive_old_procs() was the only pruner, and it only runs from is_drained() (i.e. while draining). A healthy proxy that reloads but never drains would accumulate exited Process objects in _old_procs indefinitely. Extract the prune into _prune_old_procs() and also call it from _graceful_reload so the list stays bounded regardless of whether the proxy ever drains. Addresses review feedback on ray-project#63620. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
91ac10a to
fc26069
Compare
Merging master (which now includes ray-project#63620) brought in a second _prune_old_procs that only filters _old_procs. As the later definition it shadowed the ring-buffer version added here, so retired log files were never deleted and the per-spawn file count grew unbounded again. Remove the filter-only duplicate; the ring version already filters _old_procs to alive workers AND retires/deletes their log files past the cap, and has_alive_old_procs calls it. Addresses Cursor review on ray-project#63621. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…re marking proxy drained (ray-project#63620) ### Summary Fixes a race condition in `HAProxyProxyActor.is_drained()` that caused in-flight TCP connections to be RST'd (and thus 502s from upstream load balancers) whenever a proxy drained while older HAProxy worker processes were still serving long-running requests. Detailed investigation: https://gist.github.com/harshit-anyscale/04ed26e4a0ea71886220fed1f903fca9 ### The bug `HAProxyProxyActor.is_drained()` previously gated on: 1. drain time exceeded `PROXY_MIN_DRAINING_PERIOD_S` (default 30s) 2. the current worker's admin-socket stats reported `is_system_idle` This was unsafe. Each HAProxy graceful reload spawns a new worker that owns the listening socket (via `-x` socket transfer), while the OLD worker enters soft-stop and continues to serve its pre-reload in-flight connections in its own process memory. **Those in-flight connections are invisible to the admin socket** (which is bound to the new worker), so `is_system_idle` could return True while old workers still held long-running connections. Marking the proxy as drained while old workers were alive caused `proxy_state.shutdown()` → `HAProxyApi.stop()` → `proc.kill()` to SIGKILL the old workers and sever their in-flight TCP connections. Clients saw 502 from upstream load balancers (target broke connection); replicas recorded HTTP 499 (uvicorn received an ASGI disconnect mid-request). This affected any deployment with request durations exceeding `PROXY_MIN_DRAINING_PERIOD_S` — e.g., long-running inference, streaming responses, model cold-starts. ### The fix Add condition (3) to `is_drained()`: no OLD workers from `_old_procs` are still alive. Each old worker exits cleanly when its in-flight count reaches zero (or `hard-stop-after` reaps it as a final safety valve), at which point `subprocess.Process.returncode` becomes set. The drain check uses this as a precise per-worker signal. This makes the drain adaptive: it lasts exactly as long as the longest in-flight request actually needs, plus the `PROXY_MIN_DRAINING_PERIOD_S` floor that's still useful for upstream LB target-deregistration propagation. Previously, operators had to guess a `PROXY_MIN_DRAINING_PERIOD_S` value that exceeded the longest request duration across all deployments on the cluster; with this fix, the env var reverts to its name's actual semantic — a minimum floor, not a worst-case ceiling. A new helper `HAProxyApi.has_alive_old_procs()` also prunes already-exited entries from `_old_procs` so the list stays bounded across long-lived proxy actors that perform many reloads. The check order in `is_drained` was reorganized so the cheap, in-memory checks (timer, old-proc list) short-circuit before the async admin-socket query. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…re marking proxy drained (ray-project#63620) ### Summary Fixes a race condition in `HAProxyProxyActor.is_drained()` that caused in-flight TCP connections to be RST'd (and thus 502s from upstream load balancers) whenever a proxy drained while older HAProxy worker processes were still serving long-running requests. Detailed investigation: https://gist.github.com/harshit-anyscale/04ed26e4a0ea71886220fed1f903fca9 ### The bug `HAProxyProxyActor.is_drained()` previously gated on: 1. drain time exceeded `PROXY_MIN_DRAINING_PERIOD_S` (default 30s) 2. the current worker's admin-socket stats reported `is_system_idle` This was unsafe. Each HAProxy graceful reload spawns a new worker that owns the listening socket (via `-x` socket transfer), while the OLD worker enters soft-stop and continues to serve its pre-reload in-flight connections in its own process memory. **Those in-flight connections are invisible to the admin socket** (which is bound to the new worker), so `is_system_idle` could return True while old workers still held long-running connections. Marking the proxy as drained while old workers were alive caused `proxy_state.shutdown()` → `HAProxyApi.stop()` → `proc.kill()` to SIGKILL the old workers and sever their in-flight TCP connections. Clients saw 502 from upstream load balancers (target broke connection); replicas recorded HTTP 499 (uvicorn received an ASGI disconnect mid-request). This affected any deployment with request durations exceeding `PROXY_MIN_DRAINING_PERIOD_S` — e.g., long-running inference, streaming responses, model cold-starts. ### The fix Add condition (3) to `is_drained()`: no OLD workers from `_old_procs` are still alive. Each old worker exits cleanly when its in-flight count reaches zero (or `hard-stop-after` reaps it as a final safety valve), at which point `subprocess.Process.returncode` becomes set. The drain check uses this as a precise per-worker signal. This makes the drain adaptive: it lasts exactly as long as the longest in-flight request actually needs, plus the `PROXY_MIN_DRAINING_PERIOD_S` floor that's still useful for upstream LB target-deregistration propagation. Previously, operators had to guess a `PROXY_MIN_DRAINING_PERIOD_S` value that exceeded the longest request duration across all deployments on the cluster; with this fix, the env var reverts to its name's actual semantic — a minimum floor, not a worst-case ceiling. A new helper `HAProxyApi.has_alive_old_procs()` also prunes already-exited entries from `_old_procs` so the list stays bounded across long-lived proxy actors that perform many reloads. The check order in `is_drained` was reorganized so the cheap, in-memory checks (timer, old-proc list) short-circuit before the async admin-socket query. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Fixes a race condition in
HAProxyProxyActor.is_drained()that caused in-flight TCP connections to be RST'd (and thus 502s from upstream load balancers) whenever a proxy drained while older HAProxy worker processes were still serving long-running requests.Detailed investigation: https://gist.github.com/harshit-anyscale/04ed26e4a0ea71886220fed1f903fca9
The bug
HAProxyProxyActor.is_drained()previously gated on:PROXY_MIN_DRAINING_PERIOD_S(default 30s)is_system_idleThis was unsafe. Each HAProxy graceful reload spawns a new worker that owns the listening socket (via
-xsocket transfer), while the OLD worker enters soft-stop and continues to serve its pre-reload in-flight connections in its own process memory. Those in-flight connections are invisible to the admin socket (which is bound to the new worker), sois_system_idlecould return True while old workers still held long-running connections.Marking the proxy as drained while old workers were alive caused
proxy_state.shutdown()→HAProxyApi.stop()→proc.kill()to SIGKILL the old workers and sever their in-flight TCP connections. Clients saw 502 from upstream load balancers (target broke connection); replicas recorded HTTP 499 (uvicorn received an ASGI disconnect mid-request).This affected any deployment with request durations exceeding
PROXY_MIN_DRAINING_PERIOD_S— e.g., long-running inference, streaming responses, model cold-starts.The fix
Add condition (3) to
is_drained(): no OLD workers from_old_procsare still alive. Each old worker exits cleanly when its in-flight count reaches zero (orhard-stop-afterreaps it as a final safety valve), at which pointsubprocess.Process.returncodebecomes set. The drain check uses this as a precise per-worker signal.This makes the drain adaptive: it lasts exactly as long as the longest in-flight request actually needs, plus the
PROXY_MIN_DRAINING_PERIOD_Sfloor that's still useful for upstream LB target-deregistration propagation. Previously, operators had to guess aPROXY_MIN_DRAINING_PERIOD_Svalue that exceeded the longest request duration across all deployments on the cluster; with this fix, the env var reverts to its name's actual semantic — a minimum floor, not a worst-case ceiling.A new helper
HAProxyApi.has_alive_old_procs()also prunes already-exited entries from_old_procsso the list stays bounded across long-lived proxy actors that perform many reloads.The check order in
is_drainedwas reorganized so the cheap, in-memory checks (timer, old-proc list) short-circuit before the async admin-socket query.🤖 Generated with Claude Code