Skip to content

[Serve] Avoid self-cause on non-gRPC replica exceptions#62412

Merged
abrarsheikh merged 7 commits into
ray-project:masterfrom
Ziy1-Tan:fix/issue-62358-replica-exception-chaining
Apr 17, 2026
Merged

[Serve] Avoid self-cause on non-gRPC replica exceptions#62412
abrarsheikh merged 7 commits into
ray-project:masterfrom
Ziy1-Tan:fix/issue-62358-replica-exception-chaining

Conversation

@Ziy1-Tan

@Ziy1-Tan Ziy1-Tan commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Why

Fixes #62358.

For non-gRPC Serve requests, Replica.handle_request* currently does:

raise self._maybe_wrap_grpc_exception(e, request_metadata) from e

When _maybe_wrap_grpc_exception() returns the original exception unchanged, this degenerates into raise e from e, creating a self-referential __cause__ chain (e.__cause__ is e).

That breaks error-reporting tools that walk __cause__ recursively and can also confuse debugging and logging paths for normal HTTP and handle-based requests.

Root cause

This regression came from #60482 (221a19395a), which unified exception handling across the handle_request* paths so gRPC requests could preserve user-set status codes.

That behavior is correct when a new gRPC wrapper exception is created. It is wrong for non-gRPC requests, because _maybe_wrap_grpc_exception() just returns e, so the code effectively executes raise e from e.

Fix

Introduce a small _raise_user_exception() helper in replica.py:

  • if _maybe_wrap_grpc_exception() returns the original exception, raise it directly
  • if it returns a wrapped gRPC exception, raise the wrapped exception from e

This preserves the intended gRPC chaining behavior while eliminating the circular __cause__ chain on non-gRPC paths.

Testing

source /home/simple/github/ray/.venv/bin/activate
PYTHONPATH=/home/simple/github/ray/.worktrees/fix-issue-62358-replica-exception-chaining/python \
python -m pytest python/ray/serve/tests/unit/test_replica_exception_chaining.py -v
@Ziy1-Tan Ziy1-Tan requested a review from a team as a code owner April 8, 2026 01:34

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors exception handling within the Replica class to prevent self-referential __cause__ chains, particularly for non-gRPC requests. It introduces a centralized _raise_user_exception helper method and adds unit tests to verify the fix. The feedback points out that the unit tests currently redefine the production logic within a fake class, which risks testing the test code itself rather than the actual implementation; it is recommended to reference the production method directly in the test setup.

Comment on lines +55 to +59
def _raise_user_exception(self, e, request_metadata):
wrapped_exception = self._maybe_wrap_grpc_exception(e, request_metadata)
if wrapped_exception is e:
raise e
raise wrapped_exception from e

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The FakeReplica class redefines _raise_user_exception with the exact same logic as the production code in replica.py. This means the test is verifying the logic defined within the test file itself rather than the actual implementation in the codebase. To ensure the production code is correctly verified, you should remove this redundant definition and instead assign the production method to the fake class.

    _raise_user_exception = Replica._raise_user_exception
@@ -0,0 +1,108 @@
from contextlib import asynccontextmanager, contextmanager

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suggest dropping this unit test. instead use the repro from here #62358 and add a integration test in test_handle.py

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Let me take a look.

@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Apr 8, 2026
Comment thread python/ray/serve/tests/test_handle.py
@abrarsheikh

Copy link
Copy Markdown
Contributor

CI is failing.

Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
…hain

Drop the FakeReplica unit test in favour of an integration test in
test_handle.py that exercises the full replica stack.

The integration test spins up a real deployment, registers a CauseWalker
logging handler that walks the __cause__ chain (simulating Sentry / Bugsnag),
triggers a user-code exception, and asserts no self-referential cycle is
detected.

Addresses review feedback from abrarsheikh and gemini-code-assist.

Fixes ray-project#62358

Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
The CauseWalker logging handler runs inside a Ray Serve replica actor
(a separate OS process). The previous implementation captured a plain
Python list from the driver scope; Ray serialises it by value (deep copy)
when the App class is sent to the actor, so appending to it inside the
actor never propagates back to the driver. The assertion therefore always
passed, providing zero regression coverage.

Fix: replace the driver-side list with a named @ray.remote actor
(_CycleDetector) that lives in the Ray cluster. The handler calls
det.mark.remote() via Ray RPC, and the driver queries
detector.detected.remote() after the request completes.

Verified:
- Ray 3.0.0.dev0 (fix present): PASS
- Ray 2.54.0 (bug present): FAIL (correctly detects self-cause chain)

Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
@Ziy1-Tan Ziy1-Tan force-pushed the fix/issue-62358-replica-exception-chaining branch from cafa48c to 9c52b0c Compare April 11, 2026 14:52
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Apr 11, 2026
@Ziy1-Tan Ziy1-Tan requested a review from abrarsheikh April 12, 2026 02:35
Comment thread python/ray/serve/tests/test_handle.py Outdated
await handle.fail.remote()

# Give the replica a moment to process the exception through logging.
await asyncio.sleep(1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we make the test deterministic, relying on sleep can cause the test to flake in CI.


@pytest.mark.asyncio
@pytest.mark.timeout(30)
async def test_non_grpc_exception_no_self_cause(serve_instance):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is a little hard to follow, is there a more simpler version we can write?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reuse the issue repro as the basis, but I don’t think the script should be copied directly into CI because it relies on sleep and health-check timing.

I’ll either reduce it to a focused unit test around _raise_user_exception, or keep a thin integration test with an explicit synchronization signal instead of timing-based waiting.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicit synchronization signal sounds good to me.

Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 09866b6. Configure here.

wrapped_exception = self._maybe_wrap_grpc_exception(e, request_metadata)
if wrapped_exception is e:
raise e
raise wrapped_exception from e

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return type annotation hides unconditional raise from callers

Low Severity

_raise_user_exception always raises and never returns, but is annotated -> None instead of -> NoReturn. This means type checkers and readers at the three call sites (e.g., handle_request which expects Tuple[bytes, Any]) cannot tell the function is unconditionally raising. If someone later adds a conditional return path to this helper, callers would silently return None instead of propagating the exception, introducing a subtle bug.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 09866b6. Configure here.

Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
@abrarsheikh abrarsheikh merged commit d5f4c8c into ray-project:master Apr 17, 2026
6 checks passed
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
…62412)

## Why

Fixes ray-project#62358.

For non-gRPC Serve requests, `Replica.handle_request*` currently does:

```python
raise self._maybe_wrap_grpc_exception(e, request_metadata) from e
```

When `_maybe_wrap_grpc_exception()` returns the original exception
unchanged, this degenerates into `raise e from e`, creating a
self-referential `__cause__` chain (`e.__cause__ is e`).

That breaks error-reporting tools that walk `__cause__` recursively and
can also confuse debugging and logging paths for normal HTTP and
handle-based requests.

## Root cause

This regression came from `ray-project#60482` (`221a19395a`), which unified
exception handling across the `handle_request*` paths so gRPC requests
could preserve user-set status codes.

That behavior is correct when a new gRPC wrapper exception is created.
It is wrong for non-gRPC requests, because
`_maybe_wrap_grpc_exception()` just returns `e`, so the code effectively
executes `raise e from e`.

## Fix

Introduce a small `_raise_user_exception()` helper in `replica.py`:

- if `_maybe_wrap_grpc_exception()` returns the original exception,
raise it directly
- if it returns a wrapped gRPC exception, raise the wrapped exception
`from e`

This preserves the intended gRPC chaining behavior while eliminating the
circular `__cause__` chain on non-gRPC paths.

## Testing

```bash
source /home/simple/github/ray/.venv/bin/activate
PYTHONPATH=/home/simple/github/ray/.worktrees/fix-issue-62358-replica-exception-chaining/python \
python -m pytest python/ray/serve/tests/unit/test_replica_exception_chaining.py -v
```

---------

Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…62412)

## Why

Fixes ray-project#62358.

For non-gRPC Serve requests, `Replica.handle_request*` currently does:

```python
raise self._maybe_wrap_grpc_exception(e, request_metadata) from e
```

When `_maybe_wrap_grpc_exception()` returns the original exception
unchanged, this degenerates into `raise e from e`, creating a
self-referential `__cause__` chain (`e.__cause__ is e`).

That breaks error-reporting tools that walk `__cause__` recursively and
can also confuse debugging and logging paths for normal HTTP and
handle-based requests.

## Root cause

This regression came from `ray-project#60482` (`221a19395a`), which unified
exception handling across the `handle_request*` paths so gRPC requests
could preserve user-set status codes.

That behavior is correct when a new gRPC wrapper exception is created.
It is wrong for non-gRPC requests, because
`_maybe_wrap_grpc_exception()` just returns `e`, so the code effectively
executes `raise e from e`.

## Fix

Introduce a small `_raise_user_exception()` helper in `replica.py`:

- if `_maybe_wrap_grpc_exception()` returns the original exception,
raise it directly
- if it returns a wrapped gRPC exception, raise the wrapped exception
`from e`

This preserves the intended gRPC chaining behavior while eliminating the
circular `__cause__` chain on non-gRPC paths.

## Testing

```bash
source /home/simple/github/ray/.venv/bin/activate
PYTHONPATH=/home/simple/github/ray/.worktrees/fix-issue-62358-replica-exception-chaining/python \
python -m pytest python/ray/serve/tests/unit/test_replica_exception_chaining.py -v
```

---------

Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

2 participants