[5/N] [HAProxy stability] - Quarantine recently-released replica ports to close 404 routing race#63628
Merged
eicherseiji merged 2 commits intoJun 3, 2026
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a port quarantine mechanism to Ray Serve, preventing race conditions where released ports are reassigned before downstream proxies update their routing tables. The implementation adds a configurable quarantine period, logic to hold and release ports, and corresponding unit tests. Review feedback highlights a potential port collision bug during recovery, recommends using time.monotonic() for reliable timing, and suggests using a safer environment variable parsing utility.
16dc215 to
1f9a451
Compare
akyang-anyscale
approved these changes
May 27, 2026
When a replica stops and its port is released, today the port goes straight back into the available pool — and the next replica to be spawned can grab it. If that happens before HAProxy (or any proxy that mirrors deployment state) has propagated the previous replica's slot removal, the proxy still has a slot pointing at host:port for the OLD deployment, but a NEW deployment's replica is now serving that port. Requests routed via the stale slot hit the new replica, which doesn't serve that route, and return 404. We observed this extensively in load tests. Close the race at its source: hold released ports in a quarantine set for RAY_SERVE_PORT_QUARANTINE_S (default 10s) before they re-enter the available pool. Quarantine is drained lazily on each allocate(), so there's no background timer. `block_port=True` (permanent block) bypasses quarantine because it's a strictly stronger guarantee. 10s comfortably covers HAProxy's reconfigure latency (broadcast coalescing + reload) under load. For deployments with long graceful_shutdown windows, the drain itself already provides most of the buffer, and the quarantine only matters for crash/force-kill recoveries. Tunable via env var. Tests: existing port-reuse tests are unaffected because the autouse fixture sets quarantine to 0; four new tests cover the quarantine behavior (held, expires, bypassed by block_port, disabled at 0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1f9a451 to
121afff
Compare
eicherseiji
reviewed
Jun 1, 2026
eicherseiji
approved these changes
Jun 1, 2026
update_port_if_missing recovers a port into _allocated_ports without popping it off _available_ports. If that replica is then released into quarantine, the port ends up in both the heap and _quarantined_ports, so allocate() could hand out a port that is still quarantined — reopening the 404 reuse race this feature prevents. Add `port not in self._quarantined_ports` to the allocate guard so a quarantined port in the heap is skipped until _drain_expired_quarantine returns it to the pool. (The duplicate-in-heap case from a recovered port is already covered by the _allocated_ports guard.) Addresses review feedback on ray-project#63628 (eicherseiji). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rueian
pushed a commit
to rueian/ray
that referenced
this pull request
Jun 4, 2026
…s to close 404 routing race (ray-project#63628) ### Summary Carved out of ray-project#63507 to land independently. When a replica stops and its port is released, the port goes straight back into the available pool today. The next replica to be spawned can grab it. If that happens **before** HAProxy (or any proxy that mirrors deployment state) has propagated the previous replica's slot removal, the proxy still has a slot pointing at `host:port` for the **old** deployment, but a **new** deployment's replica is now serving that port. Requests routed via the stale slot hit the new replica, which doesn't serve that route, and return **404**. We observed this extensively in load tests with autoscaling churn. The fix closes the race at its source by holding released ports in a quarantine set on `NodePortManager` for a configurable window before they re-enter the available pool. - `RAY_SERVE_PORT_QUARANTINE_S` (env var, float seconds, default `10`). 10s comfortably covers HAProxy's reconfigure latency (broadcast coalescing + reload) under load. - Quarantine is drained **lazily** on each `allocate()` call — no background timer. - `block_port=True` (permanent block) bypasses quarantine because it's a strictly stronger guarantee. Why 10s and not larger: for deployments with long `graceful_shutdown` windows, the drain itself already provides most of the buffer, and the quarantine only really matters for crash/force-kill recoveries. High-churn environments can bump the env var; small clusters where port-pool pressure matters can lower it or set to `0` to disable entirely. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
limarkdcunha
pushed a commit
to limarkdcunha/ray
that referenced
this pull request
Jun 30, 2026
…s to close 404 routing race (ray-project#63628) ### Summary Carved out of ray-project#63507 to land independently. When a replica stops and its port is released, the port goes straight back into the available pool today. The next replica to be spawned can grab it. If that happens **before** HAProxy (or any proxy that mirrors deployment state) has propagated the previous replica's slot removal, the proxy still has a slot pointing at `host:port` for the **old** deployment, but a **new** deployment's replica is now serving that port. Requests routed via the stale slot hit the new replica, which doesn't serve that route, and return **404**. We observed this extensively in load tests with autoscaling churn. The fix closes the race at its source by holding released ports in a quarantine set on `NodePortManager` for a configurable window before they re-enter the available pool. - `RAY_SERVE_PORT_QUARANTINE_S` (env var, float seconds, default `10`). 10s comfortably covers HAProxy's reconfigure latency (broadcast coalescing + reload) under load. - Quarantine is drained **lazily** on each `allocate()` call — no background timer. - `block_port=True` (permanent block) bypasses quarantine because it's a strictly stronger guarantee. Why 10s and not larger: for deployments with long `graceful_shutdown` windows, the drain itself already provides most of the buffer, and the quarantine only really matters for crash/force-kill recoveries. High-churn environments can bump the env var; small clusters where port-pool pressure matters can lower it or set to `0` to disable entirely. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Carved out of #63507 to land independently.
When a replica stops and its port is released, the port goes straight back into the available pool today. The next replica to be spawned can grab it. If that happens before HAProxy (or any proxy that mirrors deployment state) has propagated the previous replica's slot removal, the proxy still has a slot pointing at
host:portfor the old deployment, but a new deployment's replica is now serving that port. Requests routed via the stale slot hit the new replica, which doesn't serve that route, and return 404. We observed this extensively in load tests with autoscaling churn.The fix closes the race at its source by holding released ports in a quarantine set on
NodePortManagerfor a configurable window before they re-enter the available pool.RAY_SERVE_PORT_QUARANTINE_S(env var, float seconds, default10). 10s comfortably covers HAProxy's reconfigure latency (broadcast coalescing + reload) under load.allocate()call — no background timer.block_port=True(permanent block) bypasses quarantine because it's a strictly stronger guarantee.Why 10s and not larger: for deployments with long
graceful_shutdownwindows, the drain itself already provides most of the buffer, and the quarantine only really matters for crash/force-kill recoveries. High-churn environments can bump the env var; small clusters where port-pool pressure matters can lower it or set to0to disable entirely.