Skip to content

[1/N] [HAProxy stability] - Wait for old HAProxy workers to exit before marking proxy drained#63620

Merged
abrarsheikh merged 3 commits into
ray-project:masterfrom
harshit-anyscale:serve-haproxy-drain-fix
Jun 2, 2026
Merged

[1/N] [HAProxy stability] - Wait for old HAProxy workers to exit before marking proxy drained#63620
abrarsheikh merged 3 commits into
ray-project:masterfrom
harshit-anyscale:serve-haproxy-drain-fix

Conversation

@harshit-anyscale

@harshit-anyscale harshit-anyscale commented May 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a race condition in HAProxyProxyActor.is_drained() that caused in-flight TCP connections to be RST'd (and thus 502s from upstream load balancers) whenever a proxy drained while older HAProxy worker processes were still serving long-running requests.

Detailed investigation: https://gist.github.com/harshit-anyscale/04ed26e4a0ea71886220fed1f903fca9

The bug

HAProxyProxyActor.is_drained() previously gated on:

  1. drain time exceeded PROXY_MIN_DRAINING_PERIOD_S (default 30s)
  2. the current worker's admin-socket stats reported is_system_idle

This was unsafe. Each HAProxy graceful reload spawns a new worker that owns the listening socket (via -x socket transfer), while the OLD worker enters soft-stop and continues to serve its pre-reload in-flight connections in its own process memory. Those in-flight connections are invisible to the admin socket (which is bound to the new worker), so is_system_idle could return True while old workers still held long-running connections.

Marking the proxy as drained while old workers were alive caused proxy_state.shutdown()HAProxyApi.stop()proc.kill() to SIGKILL the old workers and sever their in-flight TCP connections. Clients saw 502 from upstream load balancers (target broke connection); replicas recorded HTTP 499 (uvicorn received an ASGI disconnect mid-request).

This affected any deployment with request durations exceeding PROXY_MIN_DRAINING_PERIOD_S — e.g., long-running inference, streaming responses, model cold-starts.

The fix

Add condition (3) to is_drained(): no OLD workers from _old_procs are still alive. Each old worker exits cleanly when its in-flight count reaches zero (or hard-stop-after reaps it as a final safety valve), at which point subprocess.Process.returncode becomes set. The drain check uses this as a precise per-worker signal.

This makes the drain adaptive: it lasts exactly as long as the longest in-flight request actually needs, plus the PROXY_MIN_DRAINING_PERIOD_S floor that's still useful for upstream LB target-deregistration propagation. Previously, operators had to guess a PROXY_MIN_DRAINING_PERIOD_S value that exceeded the longest request duration across all deployments on the cluster; with this fix, the env var reverts to its name's actual semantic — a minimum floor, not a worst-case ceiling.

A new helper HAProxyApi.has_alive_old_procs() also prunes already-exited entries from _old_procs so the list stays bounded across long-lived proxy actors that perform many reloads.

The check order in is_drained was reorganized so the cheap, in-memory checks (timer, old-proc list) short-circuit before the async admin-socket query.

🤖 Generated with Claude Code

@harshit-anyscale harshit-anyscale requested a review from a team as a code owner May 25, 2026 09:32
@harshit-anyscale harshit-anyscale self-assigned this May 25, 2026
…ined

`HAProxyProxyActor.is_drained()` previously gated on:
  1. drain time exceeded `PROXY_MIN_DRAINING_PERIOD_S`
  2. the current worker's admin-socket stats reported `is_system_idle`

This was unsafe. Each graceful reload spawns a new HAProxy worker that
owns the listening socket (via `-x` socket transfer), while the OLD
worker enters soft-stop and continues to serve its pre-reload in-flight
connections in its own process memory. Those in-flight connections are
invisible to the admin socket (which is bound by the new worker), so
`is_system_idle` could return True while old workers still held
long-running connections like 60-120s long-runner requests.

Marking the proxy as drained while old workers were alive caused
`proxy_state.shutdown()` → `HAProxyApi.stop()` → `proc.kill()` to
SIGKILL the old workers and sever their in-flight TCP connections.
Clients saw 502 from ALB (target broke connection); replicas saw
499 (uvicorn received an ASGI disconnect mid-request).

The fix adds condition (2'): no OLD workers from `_old_procs` are
still alive. Each old worker exits cleanly when its in-flight count
reaches zero (or `hard-stop-after` reaps it as final safety valve),
at which point `returncode` becomes set. The drain check uses this
as a precise per-worker signal.

This makes the drain adaptive: it lasts exactly as long as the
longest in-flight request actually needs, plus the
`PROXY_MIN_DRAINING_PERIOD_S` floor that's still useful for ALB
target-deregistration propagation. Previously, operators had to
guess the right `PROXY_MIN_DRAINING_PERIOD_S` value to exceed the
longest expected request duration; with this change, the env var's
role reverts to "minimum drain time" (its name) rather than
"maximum request time" (its de facto behavior).

A new helper `HAProxyApi.has_alive_old_procs()` also prunes already-
exited entries from `_old_procs` so the list stays bounded across
long-lived proxy actors that perform many reloads.

Check order in `is_drained` was reorganized so the cheap, in-memory
checks (timer, old-proc list) short-circuit before the async admin
socket query.
@harshit-anyscale harshit-anyscale force-pushed the serve-haproxy-drain-fix branch from 9d04af3 to b7a5092 Compare May 25, 2026 09:34
@harshit-anyscale harshit-anyscale changed the title [serve] Wait for old HAProxy workers to exit before marking proxy drained May 25, 2026
@harshit-anyscale harshit-anyscale added the serve Ray Serve Related Issue label May 25, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the HAProxy draining logic by ensuring that old worker processes from prior graceful reloads have fully exited before a proxy is considered drained. It introduces the has_alive_old_procs method to track these processes and integrates it into the is_drained check. A review comment identifies a potential memory leak because the _old_procs list is only pruned during the draining state; it is recommended to also prune this list during reload operations to keep it bounded for long-lived proxies.

I am having trouble creating individual review comments. Click here to see my feedback.

python/ray/serve/_private/haproxy.py (1143)

medium

The _old_procs list is only pruned within has_alive_old_procs, which is currently only called when the proxy is in a draining state. For long-lived proxy actors that undergo many graceful reloads without ever being drained, this list will grow indefinitely, leading to a slow memory leak of asyncio.subprocess.Process objects. Consider pruning this list during the reload or _graceful_reload operations as well to ensure it stays bounded regardless of whether the proxy is draining.

@harshit-anyscale harshit-anyscale changed the title [serve][HAproxy-stability] Wait for old HAProxy workers to exit before marking proxy drained May 25, 2026
@harshit-anyscale harshit-anyscale added the go add ONLY when ready to merge, run all tests label May 25, 2026

@akyang-anyscale akyang-anyscale left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a test?

@harshit-anyscale harshit-anyscale force-pushed the serve-haproxy-drain-fix branch from 45d6692 to 5f420e8 Compare May 29, 2026 08:35
@harshit-anyscale

Copy link
Copy Markdown
Contributor Author

could you add a test?

done.

Cover the drain logic added in this PR: is_drained() stays False while
a soft-stopping old worker from a prior reload is still alive (even
past the min drain period with an idle current worker) and flips True
once it exits; it is also False before the min drain period elapses.

HAProxyManager is an @ray.remote actor, so the imported name is an
ActorClass rather than a plain type; the tests reach the underlying
class via __ray_metadata__.modified_class and instantiate it with
__new__ to exercise is_drained() without starting an actor.

Addresses review request on ray-project#63620.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@harshit-anyscale harshit-anyscale force-pushed the serve-haproxy-drain-fix branch from 5f420e8 to 156e146 Compare May 29, 2026 11:10
Comment thread python/ray/serve/_private/haproxy.py
has_alive_old_procs() was the only pruner, and it only runs from
is_drained() (i.e. while draining). A healthy proxy that reloads but
never drains would accumulate exited Process objects in _old_procs
indefinitely. Extract the prune into _prune_old_procs() and also call
it from _graceful_reload so the list stays bounded regardless of
whether the proxy ever drains.

Addresses review feedback on ray-project#63620.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@abrarsheikh abrarsheikh merged commit 276f535 into ray-project:master Jun 2, 2026
6 checks passed
harshit-anyscale added a commit to harshit-anyscale/ray that referenced this pull request Jun 3, 2026
Merging master (which now includes ray-project#63620) brought in a second
_prune_old_procs that only filters _old_procs. As the later definition
it shadowed the ring-buffer version added here, so retired log files
were never deleted and the per-spawn file count grew unbounded again.

Remove the filter-only duplicate; the ring version already filters
_old_procs to alive workers AND retires/deletes their log files past
the cap, and has_alive_old_procs calls it.

Addresses Cursor review on ray-project#63621.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
…re marking proxy drained (ray-project#63620)

### Summary

Fixes a race condition in `HAProxyProxyActor.is_drained()` that caused
in-flight TCP connections to be RST'd (and thus 502s from upstream load
balancers) whenever a proxy drained while older HAProxy worker processes
were still serving long-running requests.

Detailed investigation:
https://gist.github.com/harshit-anyscale/04ed26e4a0ea71886220fed1f903fca9

### The bug

`HAProxyProxyActor.is_drained()` previously gated on:
1. drain time exceeded `PROXY_MIN_DRAINING_PERIOD_S` (default 30s)
2. the current worker's admin-socket stats reported `is_system_idle`

This was unsafe. Each HAProxy graceful reload spawns a new worker that
owns the listening socket (via `-x` socket transfer), while the OLD
worker enters soft-stop and continues to serve its pre-reload in-flight
connections in its own process memory. **Those in-flight connections are
invisible to the admin socket** (which is bound to the new worker), so
`is_system_idle` could return True while old workers still held
long-running connections.

Marking the proxy as drained while old workers were alive caused
`proxy_state.shutdown()` → `HAProxyApi.stop()` → `proc.kill()` to
SIGKILL the old workers and sever their in-flight TCP connections.
Clients saw 502 from upstream load balancers (target broke connection);
replicas recorded HTTP 499 (uvicorn received an ASGI disconnect
mid-request).

This affected any deployment with request durations exceeding
`PROXY_MIN_DRAINING_PERIOD_S` — e.g., long-running inference, streaming
responses, model cold-starts.

### The fix

Add condition (3) to `is_drained()`: no OLD workers from `_old_procs`
are still alive. Each old worker exits cleanly when its in-flight count
reaches zero (or `hard-stop-after` reaps it as a final safety valve), at
which point `subprocess.Process.returncode` becomes set. The drain check
uses this as a precise per-worker signal.

This makes the drain adaptive: it lasts exactly as long as the longest
in-flight request actually needs, plus the `PROXY_MIN_DRAINING_PERIOD_S`
floor that's still useful for upstream LB target-deregistration
propagation. Previously, operators had to guess a
`PROXY_MIN_DRAINING_PERIOD_S` value that exceeded the longest request
duration across all deployments on the cluster; with this fix, the env
var reverts to its name's actual semantic — a minimum floor, not a
worst-case ceiling.

A new helper `HAProxyApi.has_alive_old_procs()` also prunes
already-exited entries from `_old_procs` so the list stays bounded
across long-lived proxy actors that perform many reloads.

The check order in `is_drained` was reorganized so the cheap, in-memory
checks (timer, old-proc list) short-circuit before the async
admin-socket query.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
…re marking proxy drained (ray-project#63620)

### Summary

Fixes a race condition in `HAProxyProxyActor.is_drained()` that caused
in-flight TCP connections to be RST'd (and thus 502s from upstream load
balancers) whenever a proxy drained while older HAProxy worker processes
were still serving long-running requests.

Detailed investigation:
https://gist.github.com/harshit-anyscale/04ed26e4a0ea71886220fed1f903fca9

### The bug

`HAProxyProxyActor.is_drained()` previously gated on:
1. drain time exceeded `PROXY_MIN_DRAINING_PERIOD_S` (default 30s)
2. the current worker's admin-socket stats reported `is_system_idle`

This was unsafe. Each HAProxy graceful reload spawns a new worker that
owns the listening socket (via `-x` socket transfer), while the OLD
worker enters soft-stop and continues to serve its pre-reload in-flight
connections in its own process memory. **Those in-flight connections are
invisible to the admin socket** (which is bound to the new worker), so
`is_system_idle` could return True while old workers still held
long-running connections.

Marking the proxy as drained while old workers were alive caused
`proxy_state.shutdown()` → `HAProxyApi.stop()` → `proc.kill()` to
SIGKILL the old workers and sever their in-flight TCP connections.
Clients saw 502 from upstream load balancers (target broke connection);
replicas recorded HTTP 499 (uvicorn received an ASGI disconnect
mid-request).

This affected any deployment with request durations exceeding
`PROXY_MIN_DRAINING_PERIOD_S` — e.g., long-running inference, streaming
responses, model cold-starts.

### The fix

Add condition (3) to `is_drained()`: no OLD workers from `_old_procs`
are still alive. Each old worker exits cleanly when its in-flight count
reaches zero (or `hard-stop-after` reaps it as a final safety valve), at
which point `subprocess.Process.returncode` becomes set. The drain check
uses this as a precise per-worker signal.

This makes the drain adaptive: it lasts exactly as long as the longest
in-flight request actually needs, plus the `PROXY_MIN_DRAINING_PERIOD_S`
floor that's still useful for upstream LB target-deregistration
propagation. Previously, operators had to guess a
`PROXY_MIN_DRAINING_PERIOD_S` value that exceeded the longest request
duration across all deployments on the cluster; with this fix, the env
var reverts to its name's actual semantic — a minimum floor, not a
worst-case ceiling.

A new helper `HAProxyApi.has_alive_old_procs()` also prunes
already-exited entries from `_old_procs` so the list stays bounded
across long-lived proxy actors that perform many reloads.

The check order in `is_drained` was reorganized so the cheap, in-memory
checks (timer, old-proc list) short-circuit before the async
admin-socket query.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

4 participants