[Serve] Avoid self-cause on non-gRPC replica exceptions#62412
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors exception handling within the Replica class to prevent self-referential __cause__ chains, particularly for non-gRPC requests. It introduces a centralized _raise_user_exception helper method and adds unit tests to verify the fix. The feedback points out that the unit tests currently redefine the production logic within a fake class, which risks testing the test code itself rather than the actual implementation; it is recommended to reference the production method directly in the test setup.
| def _raise_user_exception(self, e, request_metadata): | ||
| wrapped_exception = self._maybe_wrap_grpc_exception(e, request_metadata) | ||
| if wrapped_exception is e: | ||
| raise e | ||
| raise wrapped_exception from e |
There was a problem hiding this comment.
The FakeReplica class redefines _raise_user_exception with the exact same logic as the production code in replica.py. This means the test is verifying the logic defined within the test file itself rather than the actual implementation in the codebase. To ensure the production code is correctly verified, you should remove this redundant definition and instead assign the production method to the fake class.
_raise_user_exception = Replica._raise_user_exception| @@ -0,0 +1,108 @@ | |||
| from contextlib import asynccontextmanager, contextmanager | |||
There was a problem hiding this comment.
i suggest dropping this unit test. instead use the repro from here #62358 and add a integration test in test_handle.py
There was a problem hiding this comment.
Ok. Let me take a look.
|
CI is failing. |
Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
…hain Drop the FakeReplica unit test in favour of an integration test in test_handle.py that exercises the full replica stack. The integration test spins up a real deployment, registers a CauseWalker logging handler that walks the __cause__ chain (simulating Sentry / Bugsnag), triggers a user-code exception, and asserts no self-referential cycle is detected. Addresses review feedback from abrarsheikh and gemini-code-assist. Fixes ray-project#62358 Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
The CauseWalker logging handler runs inside a Ray Serve replica actor (a separate OS process). The previous implementation captured a plain Python list from the driver scope; Ray serialises it by value (deep copy) when the App class is sent to the actor, so appending to it inside the actor never propagates back to the driver. The assertion therefore always passed, providing zero regression coverage. Fix: replace the driver-side list with a named @ray.remote actor (_CycleDetector) that lives in the Ray cluster. The handler calls det.mark.remote() via Ray RPC, and the driver queries detector.detected.remote() after the request completes. Verified: - Ray 3.0.0.dev0 (fix present): PASS - Ray 2.54.0 (bug present): FAIL (correctly detects self-cause chain) Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
cafa48c to
9c52b0c
Compare
| await handle.fail.remote() | ||
|
|
||
| # Give the replica a moment to process the exception through logging. | ||
| await asyncio.sleep(1) |
There was a problem hiding this comment.
how can we make the test deterministic, relying on sleep can cause the test to flake in CI.
|
|
||
| @pytest.mark.asyncio | ||
| @pytest.mark.timeout(30) | ||
| async def test_non_grpc_exception_no_self_cause(serve_instance): |
There was a problem hiding this comment.
this test is a little hard to follow, is there a more simpler version we can write?
There was a problem hiding this comment.
I can reuse the issue repro as the basis, but I don’t think the script should be copied directly into CI because it relies on sleep and health-check timing.
I’ll either reduce it to a focused unit test around _raise_user_exception, or keep a thin integration test with an explicit synchronization signal instead of timing-based waiting.
There was a problem hiding this comment.
explicit synchronization signal sounds good to me.
Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 09866b6. Configure here.
| wrapped_exception = self._maybe_wrap_grpc_exception(e, request_metadata) | ||
| if wrapped_exception is e: | ||
| raise e | ||
| raise wrapped_exception from e |
There was a problem hiding this comment.
Return type annotation hides unconditional raise from callers
Low Severity
_raise_user_exception always raises and never returns, but is annotated -> None instead of -> NoReturn. This means type checkers and readers at the three call sites (e.g., handle_request which expects Tuple[bytes, Any]) cannot tell the function is unconditionally raising. If someone later adds a conditional return path to this helper, callers would silently return None instead of propagating the exception, introducing a subtle bug.
Reviewed by Cursor Bugbot for commit 09866b6. Configure here.
Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
…62412) ## Why Fixes ray-project#62358. For non-gRPC Serve requests, `Replica.handle_request*` currently does: ```python raise self._maybe_wrap_grpc_exception(e, request_metadata) from e ``` When `_maybe_wrap_grpc_exception()` returns the original exception unchanged, this degenerates into `raise e from e`, creating a self-referential `__cause__` chain (`e.__cause__ is e`). That breaks error-reporting tools that walk `__cause__` recursively and can also confuse debugging and logging paths for normal HTTP and handle-based requests. ## Root cause This regression came from `ray-project#60482` (`221a19395a`), which unified exception handling across the `handle_request*` paths so gRPC requests could preserve user-set status codes. That behavior is correct when a new gRPC wrapper exception is created. It is wrong for non-gRPC requests, because `_maybe_wrap_grpc_exception()` just returns `e`, so the code effectively executes `raise e from e`. ## Fix Introduce a small `_raise_user_exception()` helper in `replica.py`: - if `_maybe_wrap_grpc_exception()` returns the original exception, raise it directly - if it returns a wrapped gRPC exception, raise the wrapped exception `from e` This preserves the intended gRPC chaining behavior while eliminating the circular `__cause__` chain on non-gRPC paths. ## Testing ```bash source /home/simple/github/ray/.venv/bin/activate PYTHONPATH=/home/simple/github/ray/.worktrees/fix-issue-62358-replica-exception-chaining/python \ python -m pytest python/ray/serve/tests/unit/test_replica_exception_chaining.py -v ``` --------- Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
…62412) ## Why Fixes ray-project#62358. For non-gRPC Serve requests, `Replica.handle_request*` currently does: ```python raise self._maybe_wrap_grpc_exception(e, request_metadata) from e ``` When `_maybe_wrap_grpc_exception()` returns the original exception unchanged, this degenerates into `raise e from e`, creating a self-referential `__cause__` chain (`e.__cause__ is e`). That breaks error-reporting tools that walk `__cause__` recursively and can also confuse debugging and logging paths for normal HTTP and handle-based requests. ## Root cause This regression came from `ray-project#60482` (`221a19395a`), which unified exception handling across the `handle_request*` paths so gRPC requests could preserve user-set status codes. That behavior is correct when a new gRPC wrapper exception is created. It is wrong for non-gRPC requests, because `_maybe_wrap_grpc_exception()` just returns `e`, so the code effectively executes `raise e from e`. ## Fix Introduce a small `_raise_user_exception()` helper in `replica.py`: - if `_maybe_wrap_grpc_exception()` returns the original exception, raise it directly - if it returns a wrapped gRPC exception, raise the wrapped exception `from e` This preserves the intended gRPC chaining behavior while eliminating the circular `__cause__` chain on non-gRPC paths. ## Testing ```bash source /home/simple/github/ray/.venv/bin/activate PYTHONPATH=/home/simple/github/ray/.worktrees/fix-issue-62358-replica-exception-chaining/python \ python -m pytest python/ray/serve/tests/unit/test_replica_exception_chaining.py -v ``` --------- Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>


Why
Fixes #62358.
For non-gRPC Serve requests,
Replica.handle_request*currently does:When
_maybe_wrap_grpc_exception()returns the original exception unchanged, this degenerates intoraise e from e, creating a self-referential__cause__chain (e.__cause__ is e).That breaks error-reporting tools that walk
__cause__recursively and can also confuse debugging and logging paths for normal HTTP and handle-based requests.Root cause
This regression came from
#60482(221a19395a), which unified exception handling across thehandle_request*paths so gRPC requests could preserve user-set status codes.That behavior is correct when a new gRPC wrapper exception is created. It is wrong for non-gRPC requests, because
_maybe_wrap_grpc_exception()just returnse, so the code effectively executesraise e from e.Fix
Introduce a small
_raise_user_exception()helper inreplica.py:_maybe_wrap_grpc_exception()returns the original exception, raise it directlyfrom eThis preserves the intended gRPC chaining behavior while eliminating the circular
__cause__chain on non-gRPC paths.Testing
source /home/simple/github/ray/.venv/bin/activate PYTHONPATH=/home/simple/github/ray/.worktrees/fix-issue-62358-replica-exception-chaining/python \ python -m pytest python/ray/serve/tests/unit/test_replica_exception_chaining.py -v