Skip to content

[Serve] Avoid proxy readiness future timeout race#62194

Merged
abrarsheikh merged 3 commits into
ray-project:masterfrom
Ziy1-Tan:fix/issue-60651-proxy-readiness-race
Apr 12, 2026
Merged

[Serve] Avoid proxy readiness future timeout race#62194
abrarsheikh merged 3 commits into
ray-project:masterfrom
Ziy1-Tan:fix/issue-60651-proxy-readiness-race

Conversation

@Ziy1-Tan

Copy link
Copy Markdown
Contributor

Summary

This fixes the race in wrap_as_future() used by Serve proxy readiness/health/drain checks.

The old implementation combined:

  • asyncio.wrap_future(ref.future()), which relies on asyncio's internal _chain_future
  • a timeout callback that directly called set_exception() on that same wrapped future

If the Ray future completed just after the timeout path won, asyncio's _copy_future_state could still run and trip assert not dest.done().

This change keeps the timeout on an outer asyncio future and propagates the concurrent.futures.Future state into that outer future on the event loop. That removes the double-writer race while preserving the existing success/timeout semantics.

Testing

  • PYTHONPATH=/home/simple/github/ray/.worktrees/fix-issue-60651-proxy-readiness-race/python python -m pytest /home/simple/github/ray/.worktrees/fix-issue-60651-proxy-readiness-race/python/ray/serve/tests/test_proxy_actor_wrapper.py /home/simple/github/ray/.worktrees/fix-issue-60651-proxy-readiness-race/python/ray/serve/tests/unit/test_proxy_state.py -q
  • uvx pre-commit run --files python/ray/serve/_private/proxy_state.py python/ray/serve/tests/test_proxy_actor_wrapper.py

Closes #60651.

Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
@Ziy1-Tan Ziy1-Tan requested a review from a team as a code owner March 30, 2026 16:08

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the wrap_as_future function in proxy_state.py to improve timeout handling. By introducing an intermediate asyncio.Future and a propagation helper, the implementation ensures that late completion of the source future does not conflict with a previously triggered timeout. A corresponding test case has been added to verify this behavior. The review feedback suggests simplifying the _propagate_concurrent_future_state logic by centralizing the check for the destination future's completion status.

Comment on lines +993 to +1007
def _propagate_concurrent_future_state(
source_fut: concurrent.futures.Future, dest_fut: asyncio.Future
):
if source_fut.cancelled():
if not dest_fut.done():
dest_fut.cancel()
return

exception = source_fut.exception()
if exception is not None:
_try_set_exception(dest_fut, exception)
return

if not dest_fut.done():
dest_fut.set_result(source_fut.result())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic in _propagate_concurrent_future_state can be simplified by checking if dest_fut is already done at the beginning of the function. This avoids redundant checks and makes the control flow more linear, improving maintainability.

Suggested change
def _propagate_concurrent_future_state(
source_fut: concurrent.futures.Future, dest_fut: asyncio.Future
):
if source_fut.cancelled():
if not dest_fut.done():
dest_fut.cancel()
return
exception = source_fut.exception()
if exception is not None:
_try_set_exception(dest_fut, exception)
return
if not dest_fut.done():
dest_fut.set_result(source_fut.result())
def _propagate_concurrent_future_state(
source_fut: concurrent.futures.Future, dest_fut: asyncio.Future
):
if dest_fut.done():
return
if source_fut.cancelled():
dest_fut.cancel()
return
exception = source_fut.exception()
if exception is not None:
dest_fut.set_exception(exception)
return
dest_fut.set_result(source_fut.result())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. better for readability

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Mar 30, 2026

@abrarsheikh abrarsheikh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change looks good to me, thanks for the explanation. lets incorporate bots comments

Comment on lines +993 to +1007
def _propagate_concurrent_future_state(
source_fut: concurrent.futures.Future, dest_fut: asyncio.Future
):
if source_fut.cancelled():
if not dest_fut.done():
dest_fut.cancel()
return

exception = source_fut.exception()
if exception is not None:
_try_set_exception(dest_fut, exception)
return

if not dest_fut.done():
dest_fut.set_result(source_fut.result())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. better for readability

@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Apr 11, 2026
@Ziy1-Tan Ziy1-Tan requested a review from abrarsheikh April 12, 2026 02:35
@abrarsheikh abrarsheikh merged commit 42ae181 into ray-project:master Apr 12, 2026
6 checks passed
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
## Summary

This fixes the race in `wrap_as_future()` used by Serve proxy
readiness/health/drain checks.

The old implementation combined:
- `asyncio.wrap_future(ref.future())`, which relies on asyncio's
internal `_chain_future`
- a timeout callback that directly called `set_exception()` on that same
wrapped future

If the Ray future completed just after the timeout path won, asyncio's
`_copy_future_state` could still run and trip `assert not dest.done()`.

This change keeps the timeout on an outer asyncio future and propagates
the `concurrent.futures.Future` state into that outer future on the
event loop. That removes the double-writer race while preserving the
existing success/timeout semantics.

## Testing

-
`PYTHONPATH=/home/simple/github/ray/.worktrees/fix-issue-60651-proxy-readiness-race/python
python -m pytest
/home/simple/github/ray/.worktrees/fix-issue-60651-proxy-readiness-race/python/ray/serve/tests/test_proxy_actor_wrapper.py
/home/simple/github/ray/.worktrees/fix-issue-60651-proxy-readiness-race/python/ray/serve/tests/unit/test_proxy_state.py
-q`
- `uvx pre-commit run --files python/ray/serve/_private/proxy_state.py
python/ray/serve/tests/test_proxy_actor_wrapper.py`

Closes ray-project#60651.

---------

Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
Co-authored-by: abrar <abrar@anyscale.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
## Summary

This fixes the race in `wrap_as_future()` used by Serve proxy
readiness/health/drain checks.

The old implementation combined:
- `asyncio.wrap_future(ref.future())`, which relies on asyncio's
internal `_chain_future`
- a timeout callback that directly called `set_exception()` on that same
wrapped future

If the Ray future completed just after the timeout path won, asyncio's
`_copy_future_state` could still run and trip `assert not dest.done()`.

This change keeps the timeout on an outer asyncio future and propagates
the `concurrent.futures.Future` state into that outer future on the
event loop. That removes the double-writer race while preserving the
existing success/timeout semantics.

## Testing

-
`PYTHONPATH=/home/simple/github/ray/.worktrees/fix-issue-60651-proxy-readiness-race/python
python -m pytest
/home/simple/github/ray/.worktrees/fix-issue-60651-proxy-readiness-race/python/ray/serve/tests/test_proxy_actor_wrapper.py
/home/simple/github/ray/.worktrees/fix-issue-60651-proxy-readiness-race/python/ray/serve/tests/unit/test_proxy_state.py
-q`
- `uvx pre-commit run --files python/ray/serve/_private/proxy_state.py
python/ray/serve/tests/test_proxy_actor_wrapper.py`

Closes ray-project#60651.

---------

Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
Co-authored-by: abrar <abrar@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

2 participants