[2/N] [HAProxy stability] - stderr file redirect, broadcast coalescing, redispatch#63308
[2/N] [HAProxy stability] - stderr file redirect, broadcast coalescing, redispatch#63308harshit-anyscale wants to merge 15 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces several stability improvements for HAProxy within Ray Serve. Key changes include increasing the hard stop timeout to 1800 seconds and establishing default server and connection timeouts to prevent undefined behavior during reloads. Additionally, the option abortonclose setting was removed from the configuration. A critical fix involves the implementation of a background task to drain HAProxy's stderr into a per-PID log file, which prevents potential deadlocks caused by the OS pipe buffer filling up. Feedback on these changes suggests increasing the logging severity from DEBUG to WARNING if the stderr drainer fails, ensuring that any failure in this mechanism is visible since it could lead back to the deadlock issue.
3c44db2 to
002fd33
Compare
002fd33 to
5bb2f5b
Compare
5bb2f5b to
3a30a51
Compare
HAProxy runs in `-db` debug mode and emits hundreds of stderr lines per second under load. The proxy actor captured stderr via `stderr=PIPE` but never read from it, so the 64KB OS pipe buffer fills in seconds and HAProxy blocks on its next `write(2)` — including from threads serving the admin socket. Runtime-API calls then time out and look like a deadlock from outside. Spawn a fire-and-forget asyncio task right after startup that drains stderr to a per-pid log file. Output is bounded to 10 MB per pid via truncate-on-roll so a long-lived proc cannot fill the disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Serve controller emits TARGET_GROUPS and FALLBACK_TARGETS long-poll updates independently, often within tens of milliseconds of each other during autoscaling churn. Each broadcast was kicking off its own reload, serializing on the reload lock and amplifying replica thrash into proxy churn. Introduce a coalescing window: a broadcast marks state dirty and starts (or extends) a single asyncio task that sleeps the window and then applies the latest state. Default window is 100ms; tunable via `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S`, and setting it to 0 falls back to the prior synchronous behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3a30a51 to
e61e719
Compare
Under autoscaling churn, the first connect attempt to a slot can hit a
replica that is briefly down (just-removed, mid-restart, scaling-up
slot not yet accepting). Today HAProxy returns the resulting 502/503
directly to the client even though a peer slot is up and ready.
Add two defaults-level directives so requests transparently retry to a
healthy peer:
- `option redispatch` — when a retry is allowed, send it to a
different slot (not the same one).
- `retry-on conn-failure empty-response` — only retry in cases that
cannot have leaked partial bytes to the client: TCP connect
failure (no bytes sent) and empty response (slot died before
sending a body). Streaming responses with a partial body are
untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
three test runs after these changes: all have the failure rate < 0.01%. |
After enabling `option redispatch` + `retry-on conn-failure empty-response` in the HAProxy defaults block, HAProxy retries a failed connect to the next slot. When every slot is down (which is the scenario this test creates by killing the only replica), retries exhaust and HAProxy returns its built-in 503 "Service Unavailable" instead of surfacing the original 502. The 502 errorfile maps 502/504 to 500, but it does not cover 503, so the response the client sees changes from 500 to 503. Both are valid error responses for "the service is currently not available"; loosen the assertion to accept either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous comment described what the setting does ("window for
coalescing back-to-back controller broadcasts") but not why fast
reloads are harmful. Rewrite to lead with the actual concern:
reloading HAProxy at controller-broadcast speed (every few tens of
ms) bursts process handoffs, saturates the proxy actor's event loop,
and forces overlapping drains on the old proc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the asyncio drain task with a direct file redirect: open a stderr destination file before spawn, pass it as `stderr=...` so the kernel uses dup2 to wire the child's fd 2 to the file, then rename the file to include the child's pid after spawn returns. No pipe is created, so the 64KB kernel buffer that previously deadlocked admin-socket threads when stderr backed up can't form. The fix is structural rather than behavioral — there is no buffer to fill because there is no consumer to be slow. Linux file descriptors bind to inodes rather than paths, so renaming the file after spawn is safe: the child keeps writing to the same inode under the new name. Trade-off: the previous drain task capped each file at 10MB via a periodic truncate. Under the file-redirect model the file grows unbounded until the proc exits. Acceptable for now given realistic stderr volumes; a periodic size-cap task can be layered on later if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shorter wording for two comments that were doing more explaining than the code needs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Down from 9 lines to 4 — keeps the why (overlapping `-sf` saturates the actor) and drops the rest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When `_update_haproxy_backends` threw, `_update_pending` had already been flipped to False before the apply, so the loop would exit and the HAProxy state would stay stale until the next broadcast — which may never arrive in a settled cluster. Re-arm `_update_pending` in the except branch so a failed apply is retried on the next coalesce tick. Cap retries at 3 consecutive failures to avoid busy-looping on a persistent error (e.g. a bad config HAProxy refuses to parse); past that, wait for the next broadcast rather than spinning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove the "overlapping `-sf` handoffs saturate the proxy actor's event loop" line — we never confirmed that as the failure mode. Keep the observable problem: under churn, broadcasts can fire reloads tens of ms apart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 lines → 3 lines. Keeps the surprising bit (rename safety via fd-binds-to-inode). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c1f08b0 to
45d6acf
Compare
After redirecting HAProxy's stderr to a file at spawn, `proc.stderr` became None and `_wait_for_hap_availability` fell back to b"" when the process crashed. The resulting RuntimeError message was just "exit code N" with no diagnostic — even though HAProxy had written the real error (bind failure, config parse error, etc.) to the on-disk stderr log file. Read the tail of the per-pid stderr file when proc.stderr is None, so startup-crash errors carry the actual reason again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pid scheme required an awkward open-with-temp-name then os.replace because the child's pid isn't known until create_subprocess_exec returns. Switch to a monotonic spawn counter held on the HAProxyApi instance: - The stderr path is known pre-spawn, so the file opens at its final name and stays there. - No more `stderr.starting.log` intermediate, no os.replace, no rename race window. - The path is attached to the proc as `_stderr_path` and travels with it, so `_wait_for_hap_availability` reads from the right file whether it was called on a fresh spawn or on an old proc during graceful reload. A new info-level log at spawn time emits the (spawn#, pid, stderr path, args) tuple so the pid → stderr-file mapping is recoverable from the actor log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_read_stderr_log_tail` was used in exactly one place and existed only to give a name to "open file, seek to end, read last 4KB." Inline it into `_wait_for_hap_availability` so the entire crash-message logic is visible at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 247b07c. Configure here.
| # Retry to a different slot on safe failures only: connect failed | ||
| # (no bytes sent) or empty response (slot died before sending body). | ||
| option redispatch | ||
| retry-on conn-failure empty-response |
There was a problem hiding this comment.
Retry exhaustion exposes 503 bypassing error normalization to 500
Medium Severity
Adding retry-on conn-failure empty-response changes the failure path: previously a connection failure produced a 502 (caught by errorfile 502 and normalized to 500), but now retry exhaustion produces a 503 which has no matching errorfile 503 directive. This breaks the stated design goal of normalizing proxy errors to 500 and exposes clients to raw 503 responses with HAProxy's built-in error page instead of the expected "Internal Server Error" body. The test in test_http_routes.py was loosened to accept in (500, 503) rather than fixing the gap.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 247b07c. Configure here.
|
latest 3 runs with these changes: number of failed requests < 0.01% |
|
Great work @harshit-anyscale. We should potentially merge this after we enable ha proxy for the performance tests and let those cook a bit (#63386). I want to make sure we are not trading off stability for speed. |
|
closing this PR to create smaller ones, each for an individual fix, that would help in reviewing and also, we can do partial rollback if necessary. cc: @kouroshHakha smaller PRs: |
…oid pipe-buffer deadlock (#63621) ## Summary Carved out of #63308 so the std-stream redirect fix can land independently of the broadcast-coalescing and redispatch changes in that PR. Today the proxy actor spawns HAProxy with `stdout=PIPE` / `stderr=PIPE` and never reads from them. Under load HAProxy's `Alert()` / `Warning()` calls (replicas flapping, backends going DOWN/UP, config reload events) can emit hundreds of stderr lines per second. The 64KB kernel pipe buffer fills in seconds, HAProxy blocks on its next `write(2)` — including from threads serving the admin socket — and runtime-API calls start timing out. From the outside it looks like a deadlock. This PR redirects both standard streams to files at spawn time: - Open `<socket_path>.<N>.stdout.log` and `<socket_path>.<N>.stderr.log` before `create_subprocess_exec` and pass them as `stdout=...` / `stderr=...`, so the kernel uses `dup2` to wire the child's fds 1 and 2 directly to the files. No pipes, no buffers to fill. - `N` is a monotonic spawn counter on `HAProxyApi`. Using a counter (rather than pid) keeps the path known pre-spawn and lets diagnostics correlate across graceful reloads. - Stash both paths on the `proc` object so the startup-crash error path can tail the files when reporting why HAProxy died. The tail logic is shared between the two streams via a small `_tail_file` helper. Trade-off: the files grow unbounded until the proc exits. Realistic volumes make this fine for now; a periodic size-cap task can be layered on later if needed. ## Design Q&A **Q: Why a separate pair of files per spawn instead of one shared pair?** Per-spawn files let us correlate output with a specific reload, keep the prior worker's logs intact while the new worker runs (workers overlap during graceful reload), and avoid two `dup2`'d fds appending into the same inode and interleaving lines. --------- Signed-off-by: harshit-anyscale <harshit@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…re (#63622) ## Summary Carved out of #63308 so the retry/redispatch fix can land independently of the stderr-redirect (#63621) and broadcast-coalescing changes in that PR. Under autoscaling churn, the first connect attempt to a slot can hit a replica that's briefly down (just-removed, mid-restart, scaling-up slot not yet accepting). HAProxy returns the resulting 5xx directly to the client even though a peer slot is up and ready. This PR adds three defaults-level directives so churn-class failures are retried transparently and normalized consistently: - **`option redispatch 1`** — every retry picks a *different* slot, not just the last one (HAProxy's default). With default `retries 3`, a bare `option redispatch` would send `A, A, A, B` against a dead slot; `redispatch 1` makes it `A, B, C, D` so a dead slot doesn't burn N attempts on itself. Previously redispatch was set only on the `*-via-ingress-request-router` backend; the primary routing path had no retry semantics, which is why churn-class 5xx surfaced to clients. - **`retry-on conn-failure`** — retries only TCP connect failures: nothing was sent to the replica, so the request is safe to replay for any method. We deliberately do **not** retry backend 503s (those are Serve's intentional `max_queued_requests` backpressure / `DeploymentUnavailable` responses — retrying them defeats backpressure and amplifies load), and do **not** add `empty-response` (the request already reached the replica, so replaying it could double-execute a non-idempotent endpoint, and Serve hosts arbitrary user code we can't assume is idempotent). HAProxy's own "no server" 503 isn't matched by `retry-on <status>` regardless. - **`errorfile 503`** — added alongside the existing `errorfile 502` / `errorfile 504`. When `conn-failure` retries are exhausted (all slots unreachable), HAProxy emits its *own* 503; routing it through the 500.http file preserves Serve's contract that upstream-unavailable looks like a 500. (This only rewrites HAProxy-generated 503s, not a backend's deliberate backpressure 503, which still reaches the client as 503.) Streaming responses where a partial body has already been written are explicitly **not** retried — neither directive triggers on `junk-response` / `response-timeout`. ### Changes from the original #63308 form - `retry-on conn-failure empty-response` → `retry-on conn-failure`. Dropped `empty-response`: it replays a request that already reached the replica, which can double-execute a non-idempotent endpoint (Serve runs arbitrary user code, so we can't assume idempotency). `conn-failure` alone covers the churn case (slot briefly down → connect fails) and is safe for any method. - `option redispatch` → `option redispatch 1`. The bare form only redispatches the final retry; `1` redispatches every retry. - Added `errorfile 503` to normalize HAProxy's own no-server / connect-exhaustion 503 to 500. ([Cursor review feedback](#63622 (comment))). --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…gle reload (#63623) ## Summary Carved out of #63308 so the coalescing change can land independently of the stderr-redirect (#63621) and redispatch (#63622) changes in that PR. Under autoscaling churn the controller's `target_groups` broadcast fires frequently — its replica set changes on essentially every reconcile that adds or removes a replica — and each broadcast triggers its own config regeneration and graceful reload via `-sf`. Both `target_groups` and `fallback_targets` are emitted from the same control-loop step, so when both change in one ~100 ms tick they reach the proxy microseconds apart and (pre-coalescing) cause two back-to-back reloads. Both broadcasts are already change-gated on the controller (`broadcast_*_if_changed`), so this PR collapses genuine consecutive / same-tick changes into one reload — it is not suppressing redundant no-op broadcasts. (`fallback_targets` in particular is near-static in steady state; it only changes when the fallback proxy's health flips or its actor is replaced.) This PR collapses adjacent broadcasts into a single reload: - `HAProxyManager` keeps two pieces of state: `_update_pending: bool` and `_coalesce_task: Optional[asyncio.Task]`. - Each incoming broadcast sets `_update_pending = True` and arms (or re-arms) a single sleeping coalesce task. - The task sleeps for `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S` (default 100 ms; set to 0 to disable) and then runs one `_update_haproxy_backends()` call against whatever the latest state is. - Broadcasts arriving inside the window are absorbed — they flip the dirty flag, but the task is already armed, so no second reload is scheduled. **Failure handling.** If the apply fails, the task re-arms itself and retries on the next tick. After 3 consecutive failures it stops re-arming and waits for the next broadcast to trigger fresh state — this prevents busy-spinning against a persistent error (e.g. HAProxy crashed and is stuck restarting) while still recovering automatically once a new broadcast lands. **Knob.** `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S` (env var, float seconds). 100 ms by default — small enough that the worst-case extra latency from a single broadcast is bounded, large enough to catch the common back-to-back pattern observed during scale events. Set to `0` to opt out entirely. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…oid pipe-buffer deadlock (ray-project#63621) ## Summary Carved out of ray-project#63308 so the std-stream redirect fix can land independently of the broadcast-coalescing and redispatch changes in that PR. Today the proxy actor spawns HAProxy with `stdout=PIPE` / `stderr=PIPE` and never reads from them. Under load HAProxy's `Alert()` / `Warning()` calls (replicas flapping, backends going DOWN/UP, config reload events) can emit hundreds of stderr lines per second. The 64KB kernel pipe buffer fills in seconds, HAProxy blocks on its next `write(2)` — including from threads serving the admin socket — and runtime-API calls start timing out. From the outside it looks like a deadlock. This PR redirects both standard streams to files at spawn time: - Open `<socket_path>.<N>.stdout.log` and `<socket_path>.<N>.stderr.log` before `create_subprocess_exec` and pass them as `stdout=...` / `stderr=...`, so the kernel uses `dup2` to wire the child's fds 1 and 2 directly to the files. No pipes, no buffers to fill. - `N` is a monotonic spawn counter on `HAProxyApi`. Using a counter (rather than pid) keeps the path known pre-spawn and lets diagnostics correlate across graceful reloads. - Stash both paths on the `proc` object so the startup-crash error path can tail the files when reporting why HAProxy died. The tail logic is shared between the two streams via a small `_tail_file` helper. Trade-off: the files grow unbounded until the proc exits. Realistic volumes make this fine for now; a periodic size-cap task can be layered on later if needed. ## Design Q&A **Q: Why a separate pair of files per spawn instead of one shared pair?** Per-spawn files let us correlate output with a specific reload, keep the prior worker's logs intact while the new worker runs (workers overlap during graceful reload), and avoid two `dup2`'d fds appending into the same inode and interleaving lines. --------- Signed-off-by: harshit-anyscale <harshit@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…re (ray-project#63622) ## Summary Carved out of ray-project#63308 so the retry/redispatch fix can land independently of the stderr-redirect (ray-project#63621) and broadcast-coalescing changes in that PR. Under autoscaling churn, the first connect attempt to a slot can hit a replica that's briefly down (just-removed, mid-restart, scaling-up slot not yet accepting). HAProxy returns the resulting 5xx directly to the client even though a peer slot is up and ready. This PR adds three defaults-level directives so churn-class failures are retried transparently and normalized consistently: - **`option redispatch 1`** — every retry picks a *different* slot, not just the last one (HAProxy's default). With default `retries 3`, a bare `option redispatch` would send `A, A, A, B` against a dead slot; `redispatch 1` makes it `A, B, C, D` so a dead slot doesn't burn N attempts on itself. Previously redispatch was set only on the `*-via-ingress-request-router` backend; the primary routing path had no retry semantics, which is why churn-class 5xx surfaced to clients. - **`retry-on conn-failure`** — retries only TCP connect failures: nothing was sent to the replica, so the request is safe to replay for any method. We deliberately do **not** retry backend 503s (those are Serve's intentional `max_queued_requests` backpressure / `DeploymentUnavailable` responses — retrying them defeats backpressure and amplifies load), and do **not** add `empty-response` (the request already reached the replica, so replaying it could double-execute a non-idempotent endpoint, and Serve hosts arbitrary user code we can't assume is idempotent). HAProxy's own "no server" 503 isn't matched by `retry-on <status>` regardless. - **`errorfile 503`** — added alongside the existing `errorfile 502` / `errorfile 504`. When `conn-failure` retries are exhausted (all slots unreachable), HAProxy emits its *own* 503; routing it through the 500.http file preserves Serve's contract that upstream-unavailable looks like a 500. (This only rewrites HAProxy-generated 503s, not a backend's deliberate backpressure 503, which still reaches the client as 503.) Streaming responses where a partial body has already been written are explicitly **not** retried — neither directive triggers on `junk-response` / `response-timeout`. ### Changes from the original ray-project#63308 form - `retry-on conn-failure empty-response` → `retry-on conn-failure`. Dropped `empty-response`: it replays a request that already reached the replica, which can double-execute a non-idempotent endpoint (Serve runs arbitrary user code, so we can't assume idempotency). `conn-failure` alone covers the churn case (slot briefly down → connect fails) and is safe for any method. - `option redispatch` → `option redispatch 1`. The bare form only redispatches the final retry; `1` redispatches every retry. - Added `errorfile 503` to normalize HAProxy's own no-server / connect-exhaustion 503 to 500. ([Cursor review feedback](ray-project#63622 (comment))). --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…gle reload (ray-project#63623) ## Summary Carved out of ray-project#63308 so the coalescing change can land independently of the stderr-redirect (ray-project#63621) and redispatch (ray-project#63622) changes in that PR. Under autoscaling churn the controller's `target_groups` broadcast fires frequently — its replica set changes on essentially every reconcile that adds or removes a replica — and each broadcast triggers its own config regeneration and graceful reload via `-sf`. Both `target_groups` and `fallback_targets` are emitted from the same control-loop step, so when both change in one ~100 ms tick they reach the proxy microseconds apart and (pre-coalescing) cause two back-to-back reloads. Both broadcasts are already change-gated on the controller (`broadcast_*_if_changed`), so this PR collapses genuine consecutive / same-tick changes into one reload — it is not suppressing redundant no-op broadcasts. (`fallback_targets` in particular is near-static in steady state; it only changes when the fallback proxy's health flips or its actor is replaced.) This PR collapses adjacent broadcasts into a single reload: - `HAProxyManager` keeps two pieces of state: `_update_pending: bool` and `_coalesce_task: Optional[asyncio.Task]`. - Each incoming broadcast sets `_update_pending = True` and arms (or re-arms) a single sleeping coalesce task. - The task sleeps for `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S` (default 100 ms; set to 0 to disable) and then runs one `_update_haproxy_backends()` call against whatever the latest state is. - Broadcasts arriving inside the window are absorbed — they flip the dirty flag, but the task is already armed, so no second reload is scheduled. **Failure handling.** If the apply fails, the task re-arms itself and retries on the next tick. After 3 consecutive failures it stops re-arming and waits for the next broadcast to trigger fresh state — this prevents busy-spinning against a persistent error (e.g. HAProxy crashed and is stuck restarting) while still recovering automatically once a new broadcast lands. **Knob.** `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S` (env var, float seconds). 100 ms by default — small enough that the worst-case extra latency from a single broadcast is bounded, large enough to catch the common back-to-back pattern observed during scale events. Set to `0` to opt out entirely. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…oid pipe-buffer deadlock (ray-project#63621) ## Summary Carved out of ray-project#63308 so the std-stream redirect fix can land independently of the broadcast-coalescing and redispatch changes in that PR. Today the proxy actor spawns HAProxy with `stdout=PIPE` / `stderr=PIPE` and never reads from them. Under load HAProxy's `Alert()` / `Warning()` calls (replicas flapping, backends going DOWN/UP, config reload events) can emit hundreds of stderr lines per second. The 64KB kernel pipe buffer fills in seconds, HAProxy blocks on its next `write(2)` — including from threads serving the admin socket — and runtime-API calls start timing out. From the outside it looks like a deadlock. This PR redirects both standard streams to files at spawn time: - Open `<socket_path>.<N>.stdout.log` and `<socket_path>.<N>.stderr.log` before `create_subprocess_exec` and pass them as `stdout=...` / `stderr=...`, so the kernel uses `dup2` to wire the child's fds 1 and 2 directly to the files. No pipes, no buffers to fill. - `N` is a monotonic spawn counter on `HAProxyApi`. Using a counter (rather than pid) keeps the path known pre-spawn and lets diagnostics correlate across graceful reloads. - Stash both paths on the `proc` object so the startup-crash error path can tail the files when reporting why HAProxy died. The tail logic is shared between the two streams via a small `_tail_file` helper. Trade-off: the files grow unbounded until the proc exits. Realistic volumes make this fine for now; a periodic size-cap task can be layered on later if needed. ## Design Q&A **Q: Why a separate pair of files per spawn instead of one shared pair?** Per-spawn files let us correlate output with a specific reload, keep the prior worker's logs intact while the new worker runs (workers overlap during graceful reload), and avoid two `dup2`'d fds appending into the same inode and interleaving lines. --------- Signed-off-by: harshit-anyscale <harshit@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…re (ray-project#63622) ## Summary Carved out of ray-project#63308 so the retry/redispatch fix can land independently of the stderr-redirect (ray-project#63621) and broadcast-coalescing changes in that PR. Under autoscaling churn, the first connect attempt to a slot can hit a replica that's briefly down (just-removed, mid-restart, scaling-up slot not yet accepting). HAProxy returns the resulting 5xx directly to the client even though a peer slot is up and ready. This PR adds three defaults-level directives so churn-class failures are retried transparently and normalized consistently: - **`option redispatch 1`** — every retry picks a *different* slot, not just the last one (HAProxy's default). With default `retries 3`, a bare `option redispatch` would send `A, A, A, B` against a dead slot; `redispatch 1` makes it `A, B, C, D` so a dead slot doesn't burn N attempts on itself. Previously redispatch was set only on the `*-via-ingress-request-router` backend; the primary routing path had no retry semantics, which is why churn-class 5xx surfaced to clients. - **`retry-on conn-failure`** — retries only TCP connect failures: nothing was sent to the replica, so the request is safe to replay for any method. We deliberately do **not** retry backend 503s (those are Serve's intentional `max_queued_requests` backpressure / `DeploymentUnavailable` responses — retrying them defeats backpressure and amplifies load), and do **not** add `empty-response` (the request already reached the replica, so replaying it could double-execute a non-idempotent endpoint, and Serve hosts arbitrary user code we can't assume is idempotent). HAProxy's own "no server" 503 isn't matched by `retry-on <status>` regardless. - **`errorfile 503`** — added alongside the existing `errorfile 502` / `errorfile 504`. When `conn-failure` retries are exhausted (all slots unreachable), HAProxy emits its *own* 503; routing it through the 500.http file preserves Serve's contract that upstream-unavailable looks like a 500. (This only rewrites HAProxy-generated 503s, not a backend's deliberate backpressure 503, which still reaches the client as 503.) Streaming responses where a partial body has already been written are explicitly **not** retried — neither directive triggers on `junk-response` / `response-timeout`. ### Changes from the original ray-project#63308 form - `retry-on conn-failure empty-response` → `retry-on conn-failure`. Dropped `empty-response`: it replays a request that already reached the replica, which can double-execute a non-idempotent endpoint (Serve runs arbitrary user code, so we can't assume idempotency). `conn-failure` alone covers the churn case (slot briefly down → connect fails) and is safe for any method. - `option redispatch` → `option redispatch 1`. The bare form only redispatches the final retry; `1` redispatches every retry. - Added `errorfile 503` to normalize HAProxy's own no-server / connect-exhaustion 503 to 500. ([Cursor review feedback](ray-project#63622 (comment))). --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…gle reload (ray-project#63623) ## Summary Carved out of ray-project#63308 so the coalescing change can land independently of the stderr-redirect (ray-project#63621) and redispatch (ray-project#63622) changes in that PR. Under autoscaling churn the controller's `target_groups` broadcast fires frequently — its replica set changes on essentially every reconcile that adds or removes a replica — and each broadcast triggers its own config regeneration and graceful reload via `-sf`. Both `target_groups` and `fallback_targets` are emitted from the same control-loop step, so when both change in one ~100 ms tick they reach the proxy microseconds apart and (pre-coalescing) cause two back-to-back reloads. Both broadcasts are already change-gated on the controller (`broadcast_*_if_changed`), so this PR collapses genuine consecutive / same-tick changes into one reload — it is not suppressing redundant no-op broadcasts. (`fallback_targets` in particular is near-static in steady state; it only changes when the fallback proxy's health flips or its actor is replaced.) This PR collapses adjacent broadcasts into a single reload: - `HAProxyManager` keeps two pieces of state: `_update_pending: bool` and `_coalesce_task: Optional[asyncio.Task]`. - Each incoming broadcast sets `_update_pending = True` and arms (or re-arms) a single sleeping coalesce task. - The task sleeps for `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S` (default 100 ms; set to 0 to disable) and then runs one `_update_haproxy_backends()` call against whatever the latest state is. - Broadcasts arriving inside the window are absorbed — they flip the dirty flag, but the task is already armed, so no second reload is scheduled. **Failure handling.** If the apply fails, the task re-arms itself and retries on the next tick. After 3 consecutive failures it stops re-arming and waits for the next broadcast to trigger fresh state — this prevents busy-spinning against a persistent error (e.g. HAProxy crashed and is stuck restarting) while still recovering automatically once a new broadcast lands. **Knob.** `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S` (env var, float seconds). 100 ms by default — small enough that the worst-case extra latency from a single broadcast is bounded, large enough to catch the common back-to-back pattern observed during scale events. Set to `0` to opt out entirely. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>


Summary
Three targeted HAProxy fixes addressing the highest-impact failure modes observed under production load. Each is independently motivated.
1. Redirect HAProxy stderr to a file (avoids the pipe-buffer deadlock)
The problem. The proxy actor spawns HAProxy via
asyncio.create_subprocess_exec(..., stderr=PIPE). That creates an OS pipe with a fixed 64 KB kernel buffer:HAProxy's config uses
log local0 debugplusoption httplog— high-severity logging on every HTTP request and every health-check transition. In environments where syslog is unreachable (containers without/dev/log, no listener on127.0.0.1:syslog_port), HAProxy falls back to writing those log lines to stderr. At sustained load, this is thousands of lines per second.If the proxy actor never reads from
proc.stderr, the 64 KB buffer fills in seconds. Once full, HAProxy'swrite(2)on stderr blocks — including writes from threads serving the admin socket. From the outside, runtime-API commands appear to hang, healthz probes time out, and the actor cascades into reload/restart loops. This was the root cause behind the "60-second admin-socket deadlock" we kept hitting under load.The fix. Instead of piping stderr back to Python, redirect HAProxy's stderr fd directly to a file at spawn time:
The kernel uses
dup2()at fork to wire the child's fd 2 to the file. No pipe is created, so the 64 KB buffer that previously deadlocked admin-socket threads doesn't exist —write(2)to a regular file doesn't have the "consumer hasn't read me, block now" semantic that pipes do. The rename after spawn is safe because Linux file descriptors bind to inodes, not paths, so the child keeps writing to the same inode under the new pid-suffixed name.Once the pipe was eliminated, the admin-socket-deadlock cascades disappeared.
2. Coalesce controller broadcasts into a single reload
Under autoscaling churn the Serve controller fires
target_groupsandfallback_targetsbroadcasts independently, often only tens of ms apart. Without coalescing, each broadcast triggers its own config regeneration and graceful reload via-sf.HAProxyManagernow marks state dirty on each broadcast and arms a single sleeping coalesce task; updates arriving during the sleep window are absorbed into the same pending apply. Window defaults to 100 ms viaRAY_SERVE_HAPROXY_BROADCAST_COALESCE_S; set to0to disable. If an apply fails, the task re-arms and retries on the next tick (up to 3 consecutive failures, then waits for a new broadcast — avoids busy-spinning on a persistent error).3.
option redispatch+retry-on conn-failure empty-responsein defaultsUnder churn, the first connect attempt to a slot can hit a replica that is briefly down (just-removed, mid-restart, scaling-up slot not yet accepting). Today HAProxy returns the resulting 502/503 directly to the client even though a peer slot is up and ready. The two directives — added to the
defaultsblock so every backend inherits them — make the retry transparent:option redispatchsends the retry to a different slot, not the same one.retry-on conn-failure empty-responserestricts retries to cases that cannot have leaked partial bytes to the client: TCP connect failure (no bytes sent) and empty response (slot died before sending a body). Streaming responses with a partial body are untouched.option redispatchwas previously set only on the*-via-ingress-request-routerbackend; the primary routing path had no retry semantics, which is why churn-class 5xx surfaced to clients.Out of scope
The bigger PR (#63159) includes additional architectural changes (server-template slot pools, runtime-API server updates, plus other refinements). This PR is deliberately the minimal high-impact subset for an easier review and lower-risk merge.
Other knobs explored during the investigation (
option abortoncloseremoval,timeout server/timeout connectdefaults,hard-stop-afterbump) are intentionally not part of this PR — those are better addressed client-side for the load-test workload rather than as new HAProxy defaults.🤖 Generated with Claude Code