Skip to content

[5/N] [HAProxy stability] - Quarantine recently-released replica ports to close 404 routing race#63628

Merged
eicherseiji merged 2 commits into
ray-project:masterfrom
harshit-anyscale:serve-port-quarantine
Jun 3, 2026
Merged

[5/N] [HAProxy stability] - Quarantine recently-released replica ports to close 404 routing race#63628
eicherseiji merged 2 commits into
ray-project:masterfrom
harshit-anyscale:serve-port-quarantine

Conversation

@harshit-anyscale

@harshit-anyscale harshit-anyscale commented May 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Carved out of #63507 to land independently.

When a replica stops and its port is released, the port goes straight back into the available pool today. The next replica to be spawned can grab it. If that happens before HAProxy (or any proxy that mirrors deployment state) has propagated the previous replica's slot removal, the proxy still has a slot pointing at host:port for the old deployment, but a new deployment's replica is now serving that port. Requests routed via the stale slot hit the new replica, which doesn't serve that route, and return 404. We observed this extensively in load tests with autoscaling churn.

The fix closes the race at its source by holding released ports in a quarantine set on NodePortManager for a configurable window before they re-enter the available pool.

  • RAY_SERVE_PORT_QUARANTINE_S (env var, float seconds, default 10). 10s comfortably covers HAProxy's reconfigure latency (broadcast coalescing + reload) under load.
  • Quarantine is drained lazily on each allocate() call — no background timer.
  • block_port=True (permanent block) bypasses quarantine because it's a strictly stronger guarantee.

Why 10s and not larger: for deployments with long graceful_shutdown windows, the drain itself already provides most of the buffer, and the quarantine only really matters for crash/force-kill recoveries. High-churn environments can bump the env var; small clusters where port-pool pressure matters can lower it or set to 0 to disable entirely.

@harshit-anyscale harshit-anyscale requested a review from a team as a code owner May 25, 2026 14:34

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a port quarantine mechanism to Ray Serve, preventing race conditions where released ports are reassigned before downstream proxies update their routing tables. The implementation adds a configurable quarantine period, logic to hold and release ports, and corresponding unit tests. Review feedback highlights a potential port collision bug during recovery, recommends using time.monotonic() for reliable timing, and suggests using a safer environment variable parsing utility.

Comment thread python/ray/serve/_private/node_port_manager.py Outdated
Comment thread python/ray/serve/_private/constants.py Outdated
Comment thread python/ray/serve/_private/node_port_manager.py Outdated
Comment thread python/ray/serve/_private/node_port_manager.py Outdated
@harshit-anyscale harshit-anyscale changed the title [serve] Quarantine recently-released replica ports to close 404 routing race May 25, 2026
@harshit-anyscale harshit-anyscale self-assigned this May 25, 2026
@harshit-anyscale harshit-anyscale added serve Ray Serve Related Issue go add ONLY when ready to merge, run all tests labels May 25, 2026
@harshit-anyscale harshit-anyscale force-pushed the serve-port-quarantine branch 2 times, most recently from 16dc215 to 1f9a451 Compare May 25, 2026 14:46
Comment thread python/ray/serve/_private/node_port_manager.py
When a replica stops and its port is released, today the port goes
straight back into the available pool — and the next replica to be
spawned can grab it. If that happens before HAProxy (or any proxy
that mirrors deployment state) has propagated the previous replica's
slot removal, the proxy still has a slot pointing at host:port for
the OLD deployment, but a NEW deployment's replica is now serving
that port. Requests routed via the stale slot hit the new replica,
which doesn't serve that route, and return 404. We observed this
extensively in load tests.

Close the race at its source: hold released ports in a quarantine
set for RAY_SERVE_PORT_QUARANTINE_S (default 10s) before they re-enter
the available pool. Quarantine is drained lazily on each allocate(),
so there's no background timer. `block_port=True` (permanent block)
bypasses quarantine because it's a strictly stronger guarantee.

10s comfortably covers HAProxy's reconfigure latency (broadcast
coalescing + reload) under load. For deployments with long
graceful_shutdown windows, the drain itself already provides most of
the buffer, and the quarantine only matters for crash/force-kill
recoveries. Tunable via env var.

Tests: existing port-reuse tests are unaffected because the autouse
fixture sets quarantine to 0; four new tests cover the quarantine
behavior (held, expires, bypassed by block_port, disabled at 0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harshit-anyscale harshit-anyscale force-pushed the serve-port-quarantine branch from 1f9a451 to 121afff Compare May 29, 2026 08:38
Comment thread python/ray/serve/_private/node_port_manager.py Outdated
@eicherseiji eicherseiji self-requested a review June 2, 2026 18:51
update_port_if_missing recovers a port into _allocated_ports without
popping it off _available_ports. If that replica is then released into
quarantine, the port ends up in both the heap and _quarantined_ports,
so allocate() could hand out a port that is still quarantined —
reopening the 404 reuse race this feature prevents.

Add `port not in self._quarantined_ports` to the allocate guard so a
quarantined port in the heap is skipped until _drain_expired_quarantine
returns it to the pool. (The duplicate-in-heap case from a recovered
port is already covered by the _allocated_ports guard.)

Addresses review feedback on ray-project#63628 (eicherseiji).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@eicherseiji eicherseiji left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛳️

@eicherseiji eicherseiji enabled auto-merge (squash) June 3, 2026 06:16
@eicherseiji eicherseiji merged commit a1b7130 into ray-project:master Jun 3, 2026
7 checks passed
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
…s to close 404 routing race (ray-project#63628)

### Summary

Carved out of ray-project#63507 to land independently.

When a replica stops and its port is released, the port goes straight
back into the available pool today. The next replica to be spawned can
grab it. If that happens **before** HAProxy (or any proxy that mirrors
deployment state) has propagated the previous replica's slot removal,
the proxy still has a slot pointing at `host:port` for the **old**
deployment, but a **new** deployment's replica is now serving that port.
Requests routed via the stale slot hit the new replica, which doesn't
serve that route, and return **404**. We observed this extensively in
load tests with autoscaling churn.

The fix closes the race at its source by holding released ports in a
quarantine set on `NodePortManager` for a configurable window before
they re-enter the available pool.

- `RAY_SERVE_PORT_QUARANTINE_S` (env var, float seconds, default `10`).
10s comfortably covers HAProxy's reconfigure latency (broadcast
coalescing + reload) under load.
- Quarantine is drained **lazily** on each `allocate()` call — no
background timer.
- `block_port=True` (permanent block) bypasses quarantine because it's a
strictly stronger guarantee.

Why 10s and not larger: for deployments with long `graceful_shutdown`
windows, the drain itself already provides most of the buffer, and the
quarantine only really matters for crash/force-kill recoveries.
High-churn environments can bump the env var; small clusters where
port-pool pressure matters can lower it or set to `0` to disable
entirely.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
…s to close 404 routing race (ray-project#63628)

### Summary

Carved out of ray-project#63507 to land independently.

When a replica stops and its port is released, the port goes straight
back into the available pool today. The next replica to be spawned can
grab it. If that happens **before** HAProxy (or any proxy that mirrors
deployment state) has propagated the previous replica's slot removal,
the proxy still has a slot pointing at `host:port` for the **old**
deployment, but a **new** deployment's replica is now serving that port.
Requests routed via the stale slot hit the new replica, which doesn't
serve that route, and return **404**. We observed this extensively in
load tests with autoscaling churn.

The fix closes the race at its source by holding released ports in a
quarantine set on `NodePortManager` for a configurable window before
they re-enter the available pool.

- `RAY_SERVE_PORT_QUARANTINE_S` (env var, float seconds, default `10`).
10s comfortably covers HAProxy's reconfigure latency (broadcast
coalescing + reload) under load.
- Quarantine is drained **lazily** on each `allocate()` call — no
background timer.
- `block_port=True` (permanent block) bypasses quarantine because it's a
strictly stronger guarantee.

Why 10s and not larger: for deployments with long `graceful_shutdown`
windows, the drain itself already provides most of the buffer, and the
quarantine only really matters for crash/force-kill recoveries.
High-churn environments can bump the env var; small clusters where
port-pool pressure matters can lower it or set to `0` to disable
entirely.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

3 participants