Skip to content

[serve] Minor Serve test reliability and gRPC tracing improvements#63833

Merged
eicherseiji merged 8 commits into
ray-project:masterfrom
eicherseiji:seiji/test-metrics-haproxy-timeout
Jun 5, 2026
Merged

[serve] Minor Serve test reliability and gRPC tracing improvements#63833
eicherseiji merged 8 commits into
ray-project:masterfrom
eicherseiji:seiji/test-metrics-haproxy-timeout

Conversation

@eicherseiji

@eicherseiji eicherseiji commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Why are these changes needed?

A few small, independent Serve improvements bundled together:

  • gRPC error-path tracing (replica.py): the gRPC success path already records span attributes and exceptions via _wrap_request; route the direct-ingress unary error path through the same _handle_errors_and_metrics handling (by surfacing the status code and re-raising) so failed gRPC requests are traced with their status code and exception instead of silently dropping out of the trace.
  • Timeout bump for test_serve_metrics_for_successful_connection to reduce flakiness while waiting for metrics.

No production behavior changes beyond the added error-path tracing instrumentation.

Related issue number

N/A

Checks

  • I've signed off every commit (by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(
…tion

Bump the wait_for_condition timeout in
test_serve_metrics_for_successful_connection from the default (10s) to
40s to reduce flakiness while waiting for HAProxy metrics to appear.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the test_metrics_haproxy.py test file by adding a timeout=40 parameter to the wait_for_condition function call. This change helps prevent test flakiness by allowing more time for the metrics verification condition to be met. There are no review comments to address, and I have no additional feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@eicherseiji eicherseiji added the go add ONLY when ready to merge, run all tests label Jun 3, 2026
@eicherseiji eicherseiji self-assigned this Jun 3, 2026
The gRPC success path already records span attributes via
set_rpc_span_attributes and set_span_exception. Record the same on the
error path so failed gRPC requests are traced with their status code and
exception instead of dropping out of the trace silently.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Add a regression test covering the case where replacement replicas are
STARTING during node draining: the rank-consistency check should be
skipped rather than raising, matching the guard in DeploymentState.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Replace the racy 10-request burst plus wait_for_condition polling with a
single queued request and a fixed rejection loop, then assert the queued
request completes after the in-flight request is unblocked.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@eicherseiji eicherseiji changed the title [serve] Increase timeout for test_serve_metrics_for_successful_connection Jun 3, 2026
The test drains a node via gcs_client.drain_node and waits for STARTING
replacement replicas to appear, but the open-source controller's
DefaultClusterNodeInfoCache.get_draining_nodes() always returns an empty
dict, so the drain is never observed and no replicas migrate. The test
times out deterministically. Remove it along with the imports added
solely for it.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@eicherseiji eicherseiji marked this pull request as ready for review June 4, 2026 22:51
@eicherseiji eicherseiji requested a review from a team as a code owner June 4, 2026 22:51

@akyang-anyscale akyang-anyscale left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rank-consistency regression test (test_replica_ranks.py): adds coverage for the case where replacement replicas are STARTING during node draining, where the rank-consistency check should be skipped rather than raise (matching the existing guard in DeploymentState).

Didn't see the changes for this item

Comment thread python/ray/serve/tests/test_backpressure_grpc.py Outdated
Comment thread python/ray/serve/_private/replica.py Outdated
Instead of calling set_rpc_span_attributes/set_span_exception directly in
the direct-ingress unary error path, surface the status code via the
status_code_callback and re-raise so _handle_errors_and_metrics (inside
_wrap_request) records the span attributes, exception, and metrics. This
mirrors handle_request_with_rejection and avoids duplicating the
span-recording logic.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Keep the existing burst-based backpressure check rather than the
single-queued-request rewrite.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Jun 5, 2026
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@eicherseiji eicherseiji merged commit 03d6c1d into ray-project:master Jun 5, 2026
6 checks passed
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
…ay-project#63833)

## Why are these changes needed?

A few small, independent Serve improvements bundled together:

- **gRPC error-path tracing** (`replica.py`): the gRPC success path
already records span attributes and exceptions via `_wrap_request`;
route the direct-ingress unary error path through the same
`_handle_errors_and_metrics` handling (by surfacing the status code and
re-raising) so failed gRPC requests are traced with their status code
and exception instead of silently dropping out of the trace.
- **Timeout bump** for `test_serve_metrics_for_successful_connection` to
reduce flakiness while waiting for metrics.

No production behavior changes beyond the added error-path tracing
instrumentation.

## Related issue number

N/A

## Checks

- [x] I've signed off every commit (by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing.
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling serve Ray Serve Related Issue

2 participants