[serve] Minor Serve test reliability and gRPC tracing improvements#63833
Conversation
…tion Bump the wait_for_condition timeout in test_serve_metrics_for_successful_connection from the default (10s) to 40s to reduce flakiness while waiting for HAProxy metrics to appear. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request updates the test_metrics_haproxy.py test file by adding a timeout=40 parameter to the wait_for_condition function call. This change helps prevent test flakiness by allowing more time for the metrics verification condition to be met. There are no review comments to address, and I have no additional feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
The gRPC success path already records span attributes via set_rpc_span_attributes and set_span_exception. Record the same on the error path so failed gRPC requests are traced with their status code and exception instead of dropping out of the trace silently. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Add a regression test covering the case where replacement replicas are STARTING during node draining: the rank-consistency check should be skipped rather than raising, matching the guard in DeploymentState. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Replace the racy 10-request burst plus wait_for_condition polling with a single queued request and a fixed rejection loop, then assert the queued request completes after the in-flight request is unblocked. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
The test drains a node via gcs_client.drain_node and waits for STARTING replacement replicas to appear, but the open-source controller's DefaultClusterNodeInfoCache.get_draining_nodes() always returns an empty dict, so the drain is never observed and no replicas migrate. The test times out deterministically. Remove it along with the imports added solely for it. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
akyang-anyscale
left a comment
There was a problem hiding this comment.
Rank-consistency regression test (test_replica_ranks.py): adds coverage for the case where replacement replicas are STARTING during node draining, where the rank-consistency check should be skipped rather than raise (matching the existing guard in DeploymentState).
Didn't see the changes for this item
Instead of calling set_rpc_span_attributes/set_span_exception directly in the direct-ingress unary error path, surface the status code via the status_code_callback and re-raise so _handle_errors_and_metrics (inside _wrap_request) records the span attributes, exception, and metrics. This mirrors handle_request_with_rejection and avoids duplicating the span-recording logic. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Keep the existing burst-based backpressure check rather than the single-queued-request rewrite. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ay-project#63833) ## Why are these changes needed? A few small, independent Serve improvements bundled together: - **gRPC error-path tracing** (`replica.py`): the gRPC success path already records span attributes and exceptions via `_wrap_request`; route the direct-ingress unary error path through the same `_handle_errors_and_metrics` handling (by surfacing the status code and re-raising) so failed gRPC requests are traced with their status code and exception instead of silently dropping out of the trace. - **Timeout bump** for `test_serve_metrics_for_successful_connection` to reduce flakiness while waiting for metrics. No production behavior changes beyond the added error-path tracing instrumentation. ## Related issue number N/A ## Checks - [x] I've signed off every commit (by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Why are these changes needed?
A few small, independent Serve improvements bundled together:
replica.py): the gRPC success path already records span attributes and exceptions via_wrap_request; route the direct-ingress unary error path through the same_handle_errors_and_metricshandling (by surfacing the status code and re-raising) so failed gRPC requests are traced with their status code and exception instead of silently dropping out of the trace.test_serve_metrics_for_successful_connectionto reduce flakiness while waiting for metrics.No production behavior changes beyond the added error-path tracing instrumentation.
Related issue number
N/A
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.