[5/N] [HAProxy stability] - Quarantine recently-released replica ports to close 404 routing race by harshit-anyscale · Pull Request #63628 · ray-project/ray

harshit-anyscale · 2026-05-25T14:34:40Z

Summary

Carved out of #63507 to land independently.

When a replica stops and its port is released, the port goes straight back into the available pool today. The next replica to be spawned can grab it. If that happens before HAProxy (or any proxy that mirrors deployment state) has propagated the previous replica's slot removal, the proxy still has a slot pointing at host:port for the old deployment, but a new deployment's replica is now serving that port. Requests routed via the stale slot hit the new replica, which doesn't serve that route, and return 404. We observed this extensively in load tests with autoscaling churn.

The fix closes the race at its source by holding released ports in a quarantine set on NodePortManager for a configurable window before they re-enter the available pool.

RAY_SERVE_PORT_QUARANTINE_S (env var, float seconds, default 10). 10s comfortably covers HAProxy's reconfigure latency (broadcast coalescing + reload) under load.
Quarantine is drained lazily on each allocate() call — no background timer.
block_port=True (permanent block) bypasses quarantine because it's a strictly stronger guarantee.

Why 10s and not larger: for deployments with long graceful_shutdown windows, the drain itself already provides most of the buffer, and the quarantine only really matters for crash/force-kill recoveries. High-churn environments can bump the env var; small clusters where port-pool pressure matters can lower it or set to 0 to disable entirely.

gemini-code-assist

Code Review

This pull request introduces a port quarantine mechanism to Ray Serve, preventing race conditions where released ports are reassigned before downstream proxies update their routing tables. The implementation adds a configurable quarantine period, logic to hold and release ports, and corresponding unit tests. Review feedback highlights a potential port collision bug during recovery, recommends using time.monotonic() for reliable timing, and suggests using a safer environment variable parsing utility.

When a replica stops and its port is released, today the port goes straight back into the available pool — and the next replica to be spawned can grab it. If that happens before HAProxy (or any proxy that mirrors deployment state) has propagated the previous replica's slot removal, the proxy still has a slot pointing at host:port for the OLD deployment, but a NEW deployment's replica is now serving that port. Requests routed via the stale slot hit the new replica, which doesn't serve that route, and return 404. We observed this extensively in load tests. Close the race at its source: hold released ports in a quarantine set for RAY_SERVE_PORT_QUARANTINE_S (default 10s) before they re-enter the available pool. Quarantine is drained lazily on each allocate(), so there's no background timer. `block_port=True` (permanent block) bypasses quarantine because it's a strictly stronger guarantee. 10s comfortably covers HAProxy's reconfigure latency (broadcast coalescing + reload) under load. For deployments with long graceful_shutdown windows, the drain itself already provides most of the buffer, and the quarantine only matters for crash/force-kill recoveries. Tunable via env var. Tests: existing port-reuse tests are unaffected because the autouse fixture sets quarantine to 0; four new tests cover the quarantine behavior (held, expires, bypassed by block_port, disabled at 0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

update_port_if_missing recovers a port into _allocated_ports without popping it off _available_ports. If that replica is then released into quarantine, the port ends up in both the heap and _quarantined_ports, so allocate() could hand out a port that is still quarantined — reopening the 404 reuse race this feature prevents. Add `port not in self._quarantined_ports` to the allocate guard so a quarantined port in the heap is skipped until _drain_expired_quarantine returns it to the pool. (The duplicate-in-heap case from a recovered port is already covered by the _allocated_ports guard.) Addresses review feedback on ray-project#63628 (eicherseiji). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

eicherseiji

🛳️

…s to close 404 routing race (ray-project#63628) ### Summary Carved out of ray-project#63507 to land independently. When a replica stops and its port is released, the port goes straight back into the available pool today. The next replica to be spawned can grab it. If that happens **before** HAProxy (or any proxy that mirrors deployment state) has propagated the previous replica's slot removal, the proxy still has a slot pointing at `host:port` for the **old** deployment, but a **new** deployment's replica is now serving that port. Requests routed via the stale slot hit the new replica, which doesn't serve that route, and return **404**. We observed this extensively in load tests with autoscaling churn. The fix closes the race at its source by holding released ports in a quarantine set on `NodePortManager` for a configurable window before they re-enter the available pool. - `RAY_SERVE_PORT_QUARANTINE_S` (env var, float seconds, default `10`). 10s comfortably covers HAProxy's reconfigure latency (broadcast coalescing + reload) under load. - Quarantine is drained **lazily** on each `allocate()` call — no background timer. - `block_port=True` (permanent block) bypasses quarantine because it's a strictly stronger guarantee. Why 10s and not larger: for deployments with long `graceful_shutdown` windows, the drain itself already provides most of the buffer, and the quarantine only really matters for crash/force-kill recoveries. High-churn environments can bump the env var; small clusters where port-pool pressure matters can lower it or set to `0` to disable entirely. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

harshit-anyscale requested a review from a team as a code owner May 25, 2026 14:34

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

Comment thread python/ray/serve/_private/node_port_manager.py Outdated

Comment thread python/ray/serve/_private/constants.py Outdated

Comment thread python/ray/serve/_private/node_port_manager.py Outdated

Comment thread python/ray/serve/_private/node_port_manager.py Outdated

harshit-anyscale changed the title ~~[serve] Quarantine recently-released replica ports to close 404 routing race~~ May 25, 2026

harshit-anyscale self-assigned this May 25, 2026

harshit-anyscale added serve Ray Serve Related Issue go add ONLY when ready to merge, run all tests labels May 25, 2026

harshit-anyscale force-pushed the serve-port-quarantine branch 2 times, most recently from 16dc215 to 1f9a451 Compare May 25, 2026 14:46

akyang-anyscale approved these changes May 27, 2026

View reviewed changes

Comment thread python/ray/serve/_private/node_port_manager.py

harshit-anyscale force-pushed the serve-port-quarantine branch from 1f9a451 to 121afff Compare May 29, 2026 08:38

eicherseiji reviewed Jun 1, 2026

View reviewed changes

Comment thread python/ray/serve/_private/node_port_manager.py Outdated

eicherseiji approved these changes Jun 1, 2026

View reviewed changes

eicherseiji self-requested a review June 2, 2026 18:51

eicherseiji approved these changes Jun 3, 2026

View reviewed changes

eicherseiji enabled auto-merge (squash) June 3, 2026 06:16

eicherseiji merged commit a1b7130 into ray-project:master Jun 3, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[5/N] [HAProxy stability] - Quarantine recently-released replica ports to close 404 routing race#63628

[5/N] [HAProxy stability] - Quarantine recently-released replica ports to close 404 routing race#63628
eicherseiji merged 2 commits into
ray-project:masterfrom
harshit-anyscale:serve-port-quarantine

harshit-anyscale commented May 25, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eicherseiji left a comment

Uh oh!

Labels

3 participants

Uh oh!

Conversation

harshit-anyscale commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eicherseiji left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

harshit-anyscale commented May 25, 2026 •

edited

Loading