[2/N] [HAProxy stability] - stderr file redirect, broadcast coalescing, redispatch by harshit-anyscale · Pull Request #63308 · ray-project/ray

harshit-anyscale · 2026-05-12T18:57:18Z

Summary

Three targeted HAProxy fixes addressing the highest-impact failure modes observed under production load. Each is independently motivated.

1. Redirect HAProxy stderr to a file (avoids the pipe-buffer deadlock)

The problem. The proxy actor spawns HAProxy via asyncio.create_subprocess_exec(..., stderr=PIPE). That creates an OS pipe with a fixed 64 KB kernel buffer:

HAProxy ──[write(2, ...)]──► [64 KB kernel buffer] ──► proc.stderr (read by Python)

HAProxy's config uses log local0 debug plus option httplog — high-severity logging on every HTTP request and every health-check transition. In environments where syslog is unreachable (containers without /dev/log, no listener on 127.0.0.1:syslog_port), HAProxy falls back to writing those log lines to stderr. At sustained load, this is thousands of lines per second.

If the proxy actor never reads from proc.stderr, the 64 KB buffer fills in seconds. Once full, HAProxy's write(2) on stderr blocks — including writes from threads serving the admin socket. From the outside, runtime-API commands appear to hang, healthz probes time out, and the actor cascades into reload/restart loops. This was the root cause behind the "60-second admin-socket deadlock" we kept hitting under load.

The fix. Instead of piping stderr back to Python, redirect HAProxy's stderr fd directly to a file at spawn time:

with open(tmp_path, "ab", buffering=0) as stderr_file:
    proc = await asyncio.create_subprocess_exec(..., stderr=stderr_file)
os.replace(tmp_path, f"{socket_path}.stderr.{proc.pid}.log")

The kernel uses dup2() at fork to wire the child's fd 2 to the file. No pipe is created, so the 64 KB buffer that previously deadlocked admin-socket threads doesn't exist — write(2) to a regular file doesn't have the "consumer hasn't read me, block now" semantic that pipes do. The rename after spawn is safe because Linux file descriptors bind to inodes, not paths, so the child keeps writing to the same inode under the new pid-suffixed name.

Once the pipe was eliminated, the admin-socket-deadlock cascades disappeared.

2. Coalesce controller broadcasts into a single reload

Under autoscaling churn the Serve controller fires target_groups and fallback_targets broadcasts independently, often only tens of ms apart. Without coalescing, each broadcast triggers its own config regeneration and graceful reload via -sf.

HAProxyManager now marks state dirty on each broadcast and arms a single sleeping coalesce task; updates arriving during the sleep window are absorbed into the same pending apply. Window defaults to 100 ms via RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S; set to 0 to disable. If an apply fails, the task re-arms and retries on the next tick (up to 3 consecutive failures, then waits for a new broadcast — avoids busy-spinning on a persistent error).

3. `option redispatch` + `retry-on conn-failure empty-response` in defaults

Under churn, the first connect attempt to a slot can hit a replica that is briefly down (just-removed, mid-restart, scaling-up slot not yet accepting). Today HAProxy returns the resulting 502/503 directly to the client even though a peer slot is up and ready. The two directives — added to the defaults block so every backend inherits them — make the retry transparent:

option redispatch sends the retry to a different slot, not the same one.
retry-on conn-failure empty-response restricts retries to cases that cannot have leaked partial bytes to the client: TCP connect failure (no bytes sent) and empty response (slot died before sending a body). Streaming responses with a partial body are untouched.

option redispatch was previously set only on the *-via-ingress-request-router backend; the primary routing path had no retry semantics, which is why churn-class 5xx surfaced to clients.

Out of scope

The bigger PR (#63159) includes additional architectural changes (server-template slot pools, runtime-API server updates, plus other refinements). This PR is deliberately the minimal high-impact subset for an easier review and lower-risk merge.

Other knobs explored during the investigation (option abortonclose removal, timeout server / timeout connect defaults, hard-stop-after bump) are intentionally not part of this PR — those are better addressed client-side for the load-test workload rather than as new HAProxy defaults.

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request introduces several stability improvements for HAProxy within Ray Serve. Key changes include increasing the hard stop timeout to 1800 seconds and establishing default server and connection timeouts to prevent undefined behavior during reloads. Additionally, the option abortonclose setting was removed from the configuration. A critical fix involves the implementation of a background task to drain HAProxy's stderr into a per-PID log file, which prevents potential deadlocks caused by the OS pipe buffer filling up. Feedback on these changes suggests increasing the logging severity from DEBUG to WARNING if the stderr drainer fails, ensuring that any failure in this mechanism is visible since it could lead back to the deadlock issue.

HAProxy runs in `-db` debug mode and emits hundreds of stderr lines per second under load. The proxy actor captured stderr via `stderr=PIPE` but never read from it, so the 64KB OS pipe buffer fills in seconds and HAProxy blocks on its next `write(2)` — including from threads serving the admin socket. Runtime-API calls then time out and look like a deadlock from outside. Spawn a fire-and-forget asyncio task right after startup that drains stderr to a per-pid log file. Output is bounded to 10 MB per pid via truncate-on-roll so a long-lived proc cannot fill the disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Serve controller emits TARGET_GROUPS and FALLBACK_TARGETS long-poll updates independently, often within tens of milliseconds of each other during autoscaling churn. Each broadcast was kicking off its own reload, serializing on the reload lock and amplifying replica thrash into proxy churn. Introduce a coalescing window: a broadcast marks state dirty and starts (or extends) a single asyncio task that sleeps the window and then applies the latest state. Default window is 100ms; tunable via `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S`, and setting it to 0 falls back to the prior synchronous behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Under autoscaling churn, the first connect attempt to a slot can hit a replica that is briefly down (just-removed, mid-restart, scaling-up slot not yet accepting). Today HAProxy returns the resulting 502/503 directly to the client even though a peer slot is up and ready. Add two defaults-level directives so requests transparently retry to a healthy peer: - `option redispatch` — when a retry is allowed, send it to a different slot (not the same one). - `retry-on conn-failure empty-response` — only retry in cases that cannot have leaked partial bytes to the client: TCP connect failure (no bytes sent) and empty response (slot died before sending a body). Streaming responses with a partial body are untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

harshit-anyscale · 2026-05-14T06:32:24Z

three test runs after these changes:

all have the failure rate < 0.01%.

After enabling `option redispatch` + `retry-on conn-failure empty-response` in the HAProxy defaults block, HAProxy retries a failed connect to the next slot. When every slot is down (which is the scenario this test creates by killing the only replica), retries exhaust and HAProxy returns its built-in 503 "Service Unavailable" instead of surfacing the original 502. The 502 errorfile maps 502/504 to 500, but it does not cover 503, so the response the client sees changes from 500 to 503. Both are valid error responses for "the service is currently not available"; loosen the assertion to accept either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous comment described what the setting does ("window for coalescing back-to-back controller broadcasts") but not why fast reloads are harmful. Rewrite to lead with the actual concern: reloading HAProxy at controller-broadcast speed (every few tens of ms) bursts process handoffs, saturates the proxy actor's event loop, and forces overlapping drains on the old proc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the asyncio drain task with a direct file redirect: open a stderr destination file before spawn, pass it as `stderr=...` so the kernel uses dup2 to wire the child's fd 2 to the file, then rename the file to include the child's pid after spawn returns. No pipe is created, so the 64KB kernel buffer that previously deadlocked admin-socket threads when stderr backed up can't form. The fix is structural rather than behavioral — there is no buffer to fill because there is no consumer to be slow. Linux file descriptors bind to inodes rather than paths, so renaming the file after spawn is safe: the child keeps writing to the same inode under the new name. Trade-off: the previous drain task capped each file at 10MB via a periodic truncate. Under the file-redirect model the file grows unbounded until the proc exits. Acceptable for now given realistic stderr volumes; a periodic size-cap task can be layered on later if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Shorter wording for two comments that were doing more explaining than the code needs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Down from 9 lines to 4 — keeps the why (overlapping `-sf` saturates the actor) and drops the rest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When `_update_haproxy_backends` threw, `_update_pending` had already been flipped to False before the apply, so the loop would exit and the HAProxy state would stay stale until the next broadcast — which may never arrive in a settled cluster. Re-arm `_update_pending` in the except branch so a failed apply is retried on the next coalesce tick. Cap retries at 3 consecutive failures to avoid busy-looping on a persistent error (e.g. a bad config HAProxy refuses to parse); past that, wait for the next broadcast rather than spinning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Remove the "overlapping `-sf` handoffs saturate the proxy actor's event loop" line — we never confirmed that as the failure mode. Keep the observable problem: under churn, broadcasts can fire reloads tens of ms apart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

7 lines → 3 lines. Keeps the surprising bit (rename safety via fd-binds-to-inode). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After redirecting HAProxy's stderr to a file at spawn, `proc.stderr` became None and `_wait_for_hap_availability` fell back to b"" when the process crashed. The resulting RuntimeError message was just "exit code N" with no diagnostic — even though HAProxy had written the real error (bind failure, config parse error, etc.) to the on-disk stderr log file. Read the tail of the per-pid stderr file when proc.stderr is None, so startup-crash errors carry the actual reason again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pid scheme required an awkward open-with-temp-name then os.replace because the child's pid isn't known until create_subprocess_exec returns. Switch to a monotonic spawn counter held on the HAProxyApi instance: - The stderr path is known pre-spawn, so the file opens at its final name and stays there. - No more `stderr.starting.log` intermediate, no os.replace, no rename race window. - The path is attached to the proc as `_stderr_path` and travels with it, so `_wait_for_hap_availability` reads from the right file whether it was called on a fresh spawn or on an old proc during graceful reload. A new info-level log at spawn time emits the (spawn#, pid, stderr path, args) tuple so the pid → stderr-file mapping is recoverable from the actor log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`_read_stderr_log_tail` was used in exactly one place and existed only to give a name to "open file, seek to end, read last 4KB." Inline it into `_wait_for_hap_availability` so the entire crash-message logic is visible at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 247b07c. Configure here.}

cursor · 2026-05-15T09:33:36Z

+    # Retry to a different slot on safe failures only: connect failed
+    # (no bytes sent) or empty response (slot died before sending body).
+    option redispatch
+    retry-on conn-failure empty-response


Retry exhaustion exposes 503 bypassing error normalization to 500

Medium Severity

Adding retry-on conn-failure empty-response changes the failure path: previously a connection failure produced a 502 (caught by errorfile 502 and normalized to 500), but now retry exhaustion produces a 503 which has no matching errorfile 503 directive. This breaks the stated design goal of normalizing proxy errors to 500 and exposes clients to raw 503 responses with HAProxy's built-in error page instead of the expected "Internal Server Error" body. The test in test_http_routes.py was loosened to accept in (500, 503) rather than fixing the gap.

Additional Locations (1)

python/ray/serve/_private/haproxy_templates.py#L64-L68

��

^{Reviewed by Cursor Bugbot for commit 247b07c. Configure here.}

harshit-anyscale · 2026-05-15T15:25:04Z

latest 3 runs with these changes:

https://console.anyscale-staging.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_g7p6lsu6r8g7garwbxifppyz23/jobs/prodjob_cdfkbj1cik2zpedd1w7wgrl79y?job-tab=overview&job-logs-section-tabs=application_logs

number of failed requests < 0.01%

kouroshHakha · 2026-05-16T01:53:35Z

Great work @harshit-anyscale. We should potentially merge this after we enable ha proxy for the performance tests and let those cook a bit (#63386). I want to make sure we are not trading off stability for speed.

harshit-anyscale · 2026-05-18T04:03:33Z

We should potentially merge this after we enable ha proxy for the performance tests and let those cook a bit (#63386)

makes sense; cool, will plan to merge it after 3-4 days of merging of #63386 .

harshit-anyscale · 2026-05-25T11:43:29Z

closing this PR to create smaller ones, each for an individual fix, that would help in reviewing and also, we can do partial rollback if necessary.

cc: @kouroshHakha

smaller PRs:

[2/N] [HAProxy stability] - redirect stdout and stderr to files to avoid pipe-buffer deadlock #63621
[3/N] [HAProxy stability] - option redispatch 1 + retry-on conn-failure #63622

…oid pipe-buffer deadlock (#63621) ## Summary Carved out of #63308 so the std-stream redirect fix can land independently of the broadcast-coalescing and redispatch changes in that PR. Today the proxy actor spawns HAProxy with `stdout=PIPE` / `stderr=PIPE` and never reads from them. Under load HAProxy's `Alert()` / `Warning()` calls (replicas flapping, backends going DOWN/UP, config reload events) can emit hundreds of stderr lines per second. The 64KB kernel pipe buffer fills in seconds, HAProxy blocks on its next `write(2)` — including from threads serving the admin socket — and runtime-API calls start timing out. From the outside it looks like a deadlock. This PR redirects both standard streams to files at spawn time: - Open `<socket_path>.<N>.stdout.log` and `<socket_path>.<N>.stderr.log` before `create_subprocess_exec` and pass them as `stdout=...` / `stderr=...`, so the kernel uses `dup2` to wire the child's fds 1 and 2 directly to the files. No pipes, no buffers to fill. - `N` is a monotonic spawn counter on `HAProxyApi`. Using a counter (rather than pid) keeps the path known pre-spawn and lets diagnostics correlate across graceful reloads. - Stash both paths on the `proc` object so the startup-crash error path can tail the files when reporting why HAProxy died. The tail logic is shared between the two streams via a small `_tail_file` helper. Trade-off: the files grow unbounded until the proc exits. Realistic volumes make this fine for now; a periodic size-cap task can be layered on later if needed. ## Design Q&A **Q: Why a separate pair of files per spawn instead of one shared pair?** Per-spawn files let us correlate output with a specific reload, keep the prior worker's logs intact while the new worker runs (workers overlap during graceful reload), and avoid two `dup2`'d fds appending into the same inode and interleaving lines. --------- Signed-off-by: harshit-anyscale <harshit@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…re (#63622) ## Summary Carved out of #63308 so the retry/redispatch fix can land independently of the stderr-redirect (#63621) and broadcast-coalescing changes in that PR. Under autoscaling churn, the first connect attempt to a slot can hit a replica that's briefly down (just-removed, mid-restart, scaling-up slot not yet accepting). HAProxy returns the resulting 5xx directly to the client even though a peer slot is up and ready. This PR adds three defaults-level directives so churn-class failures are retried transparently and normalized consistently: - **`option redispatch 1`** — every retry picks a *different* slot, not just the last one (HAProxy's default). With default `retries 3`, a bare `option redispatch` would send `A, A, A, B` against a dead slot; `redispatch 1` makes it `A, B, C, D` so a dead slot doesn't burn N attempts on itself. Previously redispatch was set only on the `*-via-ingress-request-router` backend; the primary routing path had no retry semantics, which is why churn-class 5xx surfaced to clients. - **`retry-on conn-failure`** — retries only TCP connect failures: nothing was sent to the replica, so the request is safe to replay for any method. We deliberately do **not** retry backend 503s (those are Serve's intentional `max_queued_requests` backpressure / `DeploymentUnavailable` responses — retrying them defeats backpressure and amplifies load), and do **not** add `empty-response` (the request already reached the replica, so replaying it could double-execute a non-idempotent endpoint, and Serve hosts arbitrary user code we can't assume is idempotent). HAProxy's own "no server" 503 isn't matched by `retry-on <status>` regardless. - **`errorfile 503`** — added alongside the existing `errorfile 502` / `errorfile 504`. When `conn-failure` retries are exhausted (all slots unreachable), HAProxy emits its *own* 503; routing it through the 500.http file preserves Serve's contract that upstream-unavailable looks like a 500. (This only rewrites HAProxy-generated 503s, not a backend's deliberate backpressure 503, which still reaches the client as 503.) Streaming responses where a partial body has already been written are explicitly **not** retried — neither directive triggers on `junk-response` / `response-timeout`. ### Changes from the original #63308 form - `retry-on conn-failure empty-response` → `retry-on conn-failure`. Dropped `empty-response`: it replays a request that already reached the replica, which can double-execute a non-idempotent endpoint (Serve runs arbitrary user code, so we can't assume idempotency). `conn-failure` alone covers the churn case (slot briefly down → connect fails) and is safe for any method. - `option redispatch` → `option redispatch 1`. The bare form only redispatches the final retry; `1` redispatches every retry. - Added `errorfile 503` to normalize HAProxy's own no-server / connect-exhaustion 503 to 500. ([Cursor review feedback](#63622 (comment))). --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…gle reload (#63623) ## Summary Carved out of #63308 so the coalescing change can land independently of the stderr-redirect (#63621) and redispatch (#63622) changes in that PR. Under autoscaling churn the controller's `target_groups` broadcast fires frequently — its replica set changes on essentially every reconcile that adds or removes a replica — and each broadcast triggers its own config regeneration and graceful reload via `-sf`. Both `target_groups` and `fallback_targets` are emitted from the same control-loop step, so when both change in one ~100 ms tick they reach the proxy microseconds apart and (pre-coalescing) cause two back-to-back reloads. Both broadcasts are already change-gated on the controller (`broadcast_*_if_changed`), so this PR collapses genuine consecutive / same-tick changes into one reload — it is not suppressing redundant no-op broadcasts. (`fallback_targets` in particular is near-static in steady state; it only changes when the fallback proxy's health flips or its actor is replaced.) This PR collapses adjacent broadcasts into a single reload: - `HAProxyManager` keeps two pieces of state: `_update_pending: bool` and `_coalesce_task: Optional[asyncio.Task]`. - Each incoming broadcast sets `_update_pending = True` and arms (or re-arms) a single sleeping coalesce task. - The task sleeps for `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S` (default 100 ms; set to 0 to disable) and then runs one `_update_haproxy_backends()` call against whatever the latest state is. - Broadcasts arriving inside the window are absorbed — they flip the dirty flag, but the task is already armed, so no second reload is scheduled. **Failure handling.** If the apply fails, the task re-arms itself and retries on the next tick. After 3 consecutive failures it stops re-arming and waits for the next broadcast to trigger fresh state — this prevents busy-spinning against a persistent error (e.g. HAProxy crashed and is stuck restarting) while still recovering automatically once a new broadcast lands. **Knob.** `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S` (env var, float seconds). 100 ms by default — small enough that the worst-case extra latency from a single broadcast is bounded, large enough to catch the common back-to-back pattern observed during scale events. Set to `0` to opt out entirely. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…oid pipe-buffer deadlock (ray-project#63621) ## Summary Carved out of ray-project#63308 so the std-stream redirect fix can land independently of the broadcast-coalescing and redispatch changes in that PR. Today the proxy actor spawns HAProxy with `stdout=PIPE` / `stderr=PIPE` and never reads from them. Under load HAProxy's `Alert()` / `Warning()` calls (replicas flapping, backends going DOWN/UP, config reload events) can emit hundreds of stderr lines per second. The 64KB kernel pipe buffer fills in seconds, HAProxy blocks on its next `write(2)` — including from threads serving the admin socket — and runtime-API calls start timing out. From the outside it looks like a deadlock. This PR redirects both standard streams to files at spawn time: - Open `<socket_path>.<N>.stdout.log` and `<socket_path>.<N>.stderr.log` before `create_subprocess_exec` and pass them as `stdout=...` / `stderr=...`, so the kernel uses `dup2` to wire the child's fds 1 and 2 directly to the files. No pipes, no buffers to fill. - `N` is a monotonic spawn counter on `HAProxyApi`. Using a counter (rather than pid) keeps the path known pre-spawn and lets diagnostics correlate across graceful reloads. - Stash both paths on the `proc` object so the startup-crash error path can tail the files when reporting why HAProxy died. The tail logic is shared between the two streams via a small `_tail_file` helper. Trade-off: the files grow unbounded until the proc exits. Realistic volumes make this fine for now; a periodic size-cap task can be layered on later if needed. ## Design Q&A **Q: Why a separate pair of files per spawn instead of one shared pair?** Per-spawn files let us correlate output with a specific reload, keep the prior worker's logs intact while the new worker runs (workers overlap during graceful reload), and avoid two `dup2`'d fds appending into the same inode and interleaving lines. --------- Signed-off-by: harshit-anyscale <harshit@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…re (ray-project#63622) ## Summary Carved out of ray-project#63308 so the retry/redispatch fix can land independently of the stderr-redirect (ray-project#63621) and broadcast-coalescing changes in that PR. Under autoscaling churn, the first connect attempt to a slot can hit a replica that's briefly down (just-removed, mid-restart, scaling-up slot not yet accepting). HAProxy returns the resulting 5xx directly to the client even though a peer slot is up and ready. This PR adds three defaults-level directives so churn-class failures are retried transparently and normalized consistently: - **`option redispatch 1`** — every retry picks a *different* slot, not just the last one (HAProxy's default). With default `retries 3`, a bare `option redispatch` would send `A, A, A, B` against a dead slot; `redispatch 1` makes it `A, B, C, D` so a dead slot doesn't burn N attempts on itself. Previously redispatch was set only on the `*-via-ingress-request-router` backend; the primary routing path had no retry semantics, which is why churn-class 5xx surfaced to clients. - **`retry-on conn-failure`** — retries only TCP connect failures: nothing was sent to the replica, so the request is safe to replay for any method. We deliberately do **not** retry backend 503s (those are Serve's intentional `max_queued_requests` backpressure / `DeploymentUnavailable` responses — retrying them defeats backpressure and amplifies load), and do **not** add `empty-response` (the request already reached the replica, so replaying it could double-execute a non-idempotent endpoint, and Serve hosts arbitrary user code we can't assume is idempotent). HAProxy's own "no server" 503 isn't matched by `retry-on <status>` regardless. - **`errorfile 503`** — added alongside the existing `errorfile 502` / `errorfile 504`. When `conn-failure` retries are exhausted (all slots unreachable), HAProxy emits its *own* 503; routing it through the 500.http file preserves Serve's contract that upstream-unavailable looks like a 500. (This only rewrites HAProxy-generated 503s, not a backend's deliberate backpressure 503, which still reaches the client as 503.) Streaming responses where a partial body has already been written are explicitly **not** retried — neither directive triggers on `junk-response` / `response-timeout`. ### Changes from the original ray-project#63308 form - `retry-on conn-failure empty-response` → `retry-on conn-failure`. Dropped `empty-response`: it replays a request that already reached the replica, which can double-execute a non-idempotent endpoint (Serve runs arbitrary user code, so we can't assume idempotency). `conn-failure` alone covers the churn case (slot briefly down → connect fails) and is safe for any method. - `option redispatch` → `option redispatch 1`. The bare form only redispatches the final retry; `1` redispatches every retry. - Added `errorfile 503` to normalize HAProxy's own no-server / connect-exhaustion 503 to 500. ([Cursor review feedback](ray-project#63622 (comment))). --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…gle reload (ray-project#63623) ## Summary Carved out of ray-project#63308 so the coalescing change can land independently of the stderr-redirect (ray-project#63621) and redispatch (ray-project#63622) changes in that PR. Under autoscaling churn the controller's `target_groups` broadcast fires frequently — its replica set changes on essentially every reconcile that adds or removes a replica — and each broadcast triggers its own config regeneration and graceful reload via `-sf`. Both `target_groups` and `fallback_targets` are emitted from the same control-loop step, so when both change in one ~100 ms tick they reach the proxy microseconds apart and (pre-coalescing) cause two back-to-back reloads. Both broadcasts are already change-gated on the controller (`broadcast_*_if_changed`), so this PR collapses genuine consecutive / same-tick changes into one reload — it is not suppressing redundant no-op broadcasts. (`fallback_targets` in particular is near-static in steady state; it only changes when the fallback proxy's health flips or its actor is replaced.) This PR collapses adjacent broadcasts into a single reload: - `HAProxyManager` keeps two pieces of state: `_update_pending: bool` and `_coalesce_task: Optional[asyncio.Task]`. - Each incoming broadcast sets `_update_pending = True` and arms (or re-arms) a single sleeping coalesce task. - The task sleeps for `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S` (default 100 ms; set to 0 to disable) and then runs one `_update_haproxy_backends()` call against whatever the latest state is. - Broadcasts arriving inside the window are absorbed — they flip the dirty flag, but the task is already armed, so no second reload is scheduled. **Failure handling.** If the apply fails, the task re-arms itself and retries on the next tick. After 3 consecutive failures it stops re-arming and waits for the next broadcast to trigger fresh state — this prevents busy-spinning against a persistent error (e.g. HAProxy crashed and is stuck restarting) while still recovering automatically once a new broadcast lands. **Knob.** `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S` (env var, float seconds). 100 ms by default — small enough that the worst-case extra latency from a single broadcast is bounded, large enough to catch the common back-to-back pattern observed during scale events. Set to `0` to opt out entirely. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…oid pipe-buffer deadlock (ray-project#63621) ## Summary Carved out of ray-project#63308 so the std-stream redirect fix can land independently of the broadcast-coalescing and redispatch changes in that PR. Today the proxy actor spawns HAProxy with `stdout=PIPE` / `stderr=PIPE` and never reads from them. Under load HAProxy's `Alert()` / `Warning()` calls (replicas flapping, backends going DOWN/UP, config reload events) can emit hundreds of stderr lines per second. The 64KB kernel pipe buffer fills in seconds, HAProxy blocks on its next `write(2)` — including from threads serving the admin socket — and runtime-API calls start timing out. From the outside it looks like a deadlock. This PR redirects both standard streams to files at spawn time: - Open `<socket_path>.<N>.stdout.log` and `<socket_path>.<N>.stderr.log` before `create_subprocess_exec` and pass them as `stdout=...` / `stderr=...`, so the kernel uses `dup2` to wire the child's fds 1 and 2 directly to the files. No pipes, no buffers to fill. - `N` is a monotonic spawn counter on `HAProxyApi`. Using a counter (rather than pid) keeps the path known pre-spawn and lets diagnostics correlate across graceful reloads. - Stash both paths on the `proc` object so the startup-crash error path can tail the files when reporting why HAProxy died. The tail logic is shared between the two streams via a small `_tail_file` helper. Trade-off: the files grow unbounded until the proc exits. Realistic volumes make this fine for now; a periodic size-cap task can be layered on later if needed. ## Design Q&A **Q: Why a separate pair of files per spawn instead of one shared pair?** Per-spawn files let us correlate output with a specific reload, keep the prior worker's logs intact while the new worker runs (workers overlap during graceful reload), and avoid two `dup2`'d fds appending into the same inode and interleaving lines. --------- Signed-off-by: harshit-anyscale <harshit@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…re (ray-project#63622) ## Summary Carved out of ray-project#63308 so the retry/redispatch fix can land independently of the stderr-redirect (ray-project#63621) and broadcast-coalescing changes in that PR. Under autoscaling churn, the first connect attempt to a slot can hit a replica that's briefly down (just-removed, mid-restart, scaling-up slot not yet accepting). HAProxy returns the resulting 5xx directly to the client even though a peer slot is up and ready. This PR adds three defaults-level directives so churn-class failures are retried transparently and normalized consistently: - **`option redispatch 1`** — every retry picks a *different* slot, not just the last one (HAProxy's default). With default `retries 3`, a bare `option redispatch` would send `A, A, A, B` against a dead slot; `redispatch 1` makes it `A, B, C, D` so a dead slot doesn't burn N attempts on itself. Previously redispatch was set only on the `*-via-ingress-request-router` backend; the primary routing path had no retry semantics, which is why churn-class 5xx surfaced to clients. - **`retry-on conn-failure`** — retries only TCP connect failures: nothing was sent to the replica, so the request is safe to replay for any method. We deliberately do **not** retry backend 503s (those are Serve's intentional `max_queued_requests` backpressure / `DeploymentUnavailable` responses — retrying them defeats backpressure and amplifies load), and do **not** add `empty-response` (the request already reached the replica, so replaying it could double-execute a non-idempotent endpoint, and Serve hosts arbitrary user code we can't assume is idempotent). HAProxy's own "no server" 503 isn't matched by `retry-on <status>` regardless. - **`errorfile 503`** — added alongside the existing `errorfile 502` / `errorfile 504`. When `conn-failure` retries are exhausted (all slots unreachable), HAProxy emits its *own* 503; routing it through the 500.http file preserves Serve's contract that upstream-unavailable looks like a 500. (This only rewrites HAProxy-generated 503s, not a backend's deliberate backpressure 503, which still reaches the client as 503.) Streaming responses where a partial body has already been written are explicitly **not** retried — neither directive triggers on `junk-response` / `response-timeout`. ### Changes from the original ray-project#63308 form - `retry-on conn-failure empty-response` → `retry-on conn-failure`. Dropped `empty-response`: it replays a request that already reached the replica, which can double-execute a non-idempotent endpoint (Serve runs arbitrary user code, so we can't assume idempotency). `conn-failure` alone covers the churn case (slot briefly down → connect fails) and is safe for any method. - `option redispatch` → `option redispatch 1`. The bare form only redispatches the final retry; `1` redispatches every retry. - Added `errorfile 503` to normalize HAProxy's own no-server / connect-exhaustion 503 to 500. ([Cursor review feedback](ray-project#63622 (comment))). --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…gle reload (ray-project#63623) ## Summary Carved out of ray-project#63308 so the coalescing change can land independently of the stderr-redirect (ray-project#63621) and redispatch (ray-project#63622) changes in that PR. Under autoscaling churn the controller's `target_groups` broadcast fires frequently — its replica set changes on essentially every reconcile that adds or removes a replica — and each broadcast triggers its own config regeneration and graceful reload via `-sf`. Both `target_groups` and `fallback_targets` are emitted from the same control-loop step, so when both change in one ~100 ms tick they reach the proxy microseconds apart and (pre-coalescing) cause two back-to-back reloads. Both broadcasts are already change-gated on the controller (`broadcast_*_if_changed`), so this PR collapses genuine consecutive / same-tick changes into one reload — it is not suppressing redundant no-op broadcasts. (`fallback_targets` in particular is near-static in steady state; it only changes when the fallback proxy's health flips or its actor is replaced.) This PR collapses adjacent broadcasts into a single reload: - `HAProxyManager` keeps two pieces of state: `_update_pending: bool` and `_coalesce_task: Optional[asyncio.Task]`. - Each incoming broadcast sets `_update_pending = True` and arms (or re-arms) a single sleeping coalesce task. - The task sleeps for `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S` (default 100 ms; set to 0 to disable) and then runs one `_update_haproxy_backends()` call against whatever the latest state is. - Broadcasts arriving inside the window are absorbed — they flip the dirty flag, but the task is already armed, so no second reload is scheduled. **Failure handling.** If the apply fails, the task re-arms itself and retries on the next tick. After 3 consecutive failures it stops re-arming and waits for the next broadcast to trigger fresh state — this prevents busy-spinning against a persistent error (e.g. HAProxy crashed and is stuck restarting) while still recovering automatically once a new broadcast lands. **Knob.** `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S` (env var, float seconds). 100 ms by default — small enough that the worst-case extra latency from a single broadcast is bounded, large enough to catch the common back-to-back pattern observed during scale events. Set to `0` to opt out entirely. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

harshit-anyscale requested a review from a team as a code owner May 12, 2026 18:57

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py Outdated

cursor Bot reviewed May 12, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py Outdated

Comment thread python/ray/serve/_private/haproxy.py Outdated

ray-gardener Bot added the serve Ray Serve Related Issue label May 12, 2026

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py Outdated

harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from 3c44db2 to 002fd33 Compare May 13, 2026 15:29

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py Outdated

harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from 002fd33 to 5bb2f5b Compare May 13, 2026 17:45

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py Outdated

harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from 5bb2f5b to 3a30a51 Compare May 13, 2026 19:18

harshit-anyscale changed the title ~~[serve] HAProxy stability fixes: stderr drain, abortonclose, timeout defaults, hard-stop~~ May 13, 2026

harshit-anyscale and others added 2 commits May 13, 2026 19:52

harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from 3a30a51 to e61e719 Compare May 13, 2026 19:53

harshit-anyscale changed the title ~~[serve] HAProxy stability fixes: stderr drain, abortonclose, broadcast coalescing~~ May 13, 2026

harshit-anyscale changed the title ~~[serve] HAProxy stability fixes: stderr drain, broadcast coalescing~~ May 14, 2026

harshit-anyscale and others added 2 commits May 14, 2026 18:01

Merge branch 'master' into serve-haproxy-minimal-fixes

56aee88

harshit-anyscale added the go add ONLY when ready to merge, run all tests label May 14, 2026

harshit-anyscale self-assigned this May 14, 2026

harshit-anyscale and others added 7 commits May 15, 2026 04:07

[serve] HAProxy: tighten retry and coalescing comments

097e396

Shorter wording for two comments that were doing more explaining than the code needs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[serve] Shorten broadcast-coalesce comment

f4773af

Down from 9 lines to 4 — keeps the why (overlapping `-sf` saturates the actor) and drops the rest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[serve] Shorten stderr-redirect comment

45d6acf

7 lines → 3 lines. Keeps the surprising bit (rename safety via fd-binds-to-inode). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

harshit-anyscale changed the title ~~[serve] HAProxy stability fixes: stderr drain, broadcast coalescing, redispatch~~ May 15, 2026

cursor Bot reviewed May 15, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py Outdated

harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from c1f08b0 to 45d6acf Compare May 15, 2026 07:45

harshit-anyscale and others added 3 commits May 15, 2026 08:28

cursor Bot reviewed May 15, 2026

View reviewed changes

kouroshHakha approved these changes May 16, 2026

View reviewed changes

harshit-anyscale changed the title ~~[serve] HAProxy stability fixes: stderr file redirect, broadcast coalescing, redispatch~~ May 25, 2026

This was referenced May 25, 2026

[2/N] [HAProxy stability] - redirect stdout and stderr to files to avoid pipe-buffer deadlock #63621

Merged

[3/N] [HAProxy stability] - option redispatch 1 + retry-on conn-failure #63622

Merged

harshit-anyscale closed this May 25, 2026

harshit-anyscale mentioned this pull request May 25, 2026

[4/N] [HAProxy stability] - coalesce controller broadcasts into a single reload #63623

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2/N] [HAProxy stability] - stderr file redirect, broadcast coalescing, redispatch#63308

[2/N] [HAProxy stability] - stderr file redirect, broadcast coalescing, redispatch#63308
harshit-anyscale wants to merge 15 commits into
ray-project:masterfrom
harshit-anyscale:serve-haproxy-minimal-fixes

harshit-anyscale commented May 12, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

harshit-anyscale commented May 14, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

cursor Bot May 15, 2026

harshit-anyscale commented May 15, 2026

kouroshHakha commented May 16, 2026

harshit-anyscale commented May 18, 2026

harshit-anyscale commented May 25, 2026

Labels

2 participants

Uh oh!

Conversation

harshit-anyscale commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Redirect HAProxy stderr to a file (avoids the pipe-buffer deadlock)

2. Coalesce controller broadcasts into a single reload

3. option redispatch + retry-on conn-failure empty-response in defaults

Out of scope

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

harshit-anyscale commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

cursor Bot May 15, 2026

Choose a reason for hiding this comment

Retry exhaustion exposes 503 bypassing error normalization to 500

harshit-anyscale commented May 15, 2026

kouroshHakha commented May 16, 2026

harshit-anyscale commented May 18, 2026

harshit-anyscale commented May 25, 2026

Labels

2 participants

harshit-anyscale commented May 12, 2026 •

edited

Loading

3. `option redispatch` + `retry-on conn-failure empty-response` in defaults

harshit-anyscale commented May 14, 2026 •

edited

Loading