[serve] Fix autoscaling for streaming deployments after inflight requests drain to 0#61920
Conversation
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
There was a problem hiding this comment.
Code Review
The pull request introduces a crucial fix for resource management in gRPC streaming requests. By explicitly calling result.cancel() when a request is rejected by a replica, it ensures that done callbacks are properly executed and prevents potential counter leaks, especially for same-loop gRPC streaming results. This is a valuable improvement for the stability and correctness of the system.
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
| return result | ||
|
|
||
| # Request was rejected by the replica. Cancel the result so that | ||
| # done callbacks run. Without this, same-loop gRPC |
There was a problem hiding this comment.
Why is same-loop a necessary condition here?
There was a problem hiding this comment.
I think, when we specify to run on separate loop (here), we create an additional background task (here). This task continuously fetches from the streaming grpc call. This way callbacks will actually be called when the request finishes even without the user explicitly consuming the response.
We don't have this mechanism when running same-loop variant. In this case: _use_queue=False and self._calling_from_same_loop=True here.
I think we take this path: get_async() -> return await self._get_internal() -> return await self._gen.__anext__(). The stream is only consumed lazily. For abandoned result, nobody ever calls __anext__() or cancel(), nothing ever drives that generator forward.
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
|
thanks for the clear explanations they make sense |
…lit into streaming/unary tests Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
… existing patterns Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
abrarsheikh
left a comment
There was a problem hiding this comment.
left nits, non-blocking okay-to-address in follow-up
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
… readability Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
All nits addressed. |
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
…ests drain to 0 (ray-project#61920) ## Summary Fixes ray-project#61551 - Serve autoscaling can stay pinned at 2 replicas for streaming deployments after real inflight requests drain to 0 **Root cause** When `RAY_SERVE_USE_GRPC_BY_DEFAULT=1` and `RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=0` (both set by `RAY_SERVE_THROUGHPUT_OPTIMIZED=1`), rejected requests leak `inc_num_running_requests_for_replica` increments. The rejected `gRPCReplicaResult` is discarded without being cancelled or iterated, so its done callback never executes. Each Ingress replica's `DeploymentHandle` maintains a running request counter for autoscaling. The leaked increments inflate this counter, which is periodically pushed to the controller as the primary autoscaling metric. The controller sees a non-zero `autoscaling_total_requests`, which resets the downscale delay timer, blocking downscaling indefinitely. **Fix** Call `result.cancel()` on rejected results in `router.py`, forcing the gRPC call into a terminal state so done callbacks fire and the counter stays balanced. ## Test plan - Added `test_unary_with_rejection` and `test_streaming_with_rejection`: deploys an app, sends a load profile, asserts scale-up to 2 replicas, then asserts scale-down back to 1 after drain. - Few test cases with minimal repro env vars and `RAY_SERVE_THROUGHPUT_OPTIMIZED=1`, with both streaming and non-streaming variants. --------- Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com> Signed-off-by: Frank Mancina <fmancina@haproxy.com>
…ests drain to 0 (ray-project#61920) ## Summary Fixes ray-project#61551 - Serve autoscaling can stay pinned at 2 replicas for streaming deployments after real inflight requests drain to 0 **Root cause** When `RAY_SERVE_USE_GRPC_BY_DEFAULT=1` and `RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=0` (both set by `RAY_SERVE_THROUGHPUT_OPTIMIZED=1`), rejected requests leak `inc_num_running_requests_for_replica` increments. The rejected `gRPCReplicaResult` is discarded without being cancelled or iterated, so its done callback never executes. Each Ingress replica's `DeploymentHandle` maintains a running request counter for autoscaling. The leaked increments inflate this counter, which is periodically pushed to the controller as the primary autoscaling metric. The controller sees a non-zero `autoscaling_total_requests`, which resets the downscale delay timer, blocking downscaling indefinitely. **Fix** Call `result.cancel()` on rejected results in `router.py`, forcing the gRPC call into a terminal state so done callbacks fire and the counter stays balanced. ## Test plan - Added `test_unary_with_rejection` and `test_streaming_with_rejection`: deploys an app, sends a load profile, asserts scale-up to 2 replicas, then asserts scale-down back to 1 after drain. - Few test cases with minimal repro env vars and `RAY_SERVE_THROUGHPUT_OPTIMIZED=1`, with both streaming and non-streaming variants. --------- Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

Summary
Fixes #61551 - Serve autoscaling can stay pinned at 2 replicas for streaming deployments after real inflight requests drain to 0
Root cause
When
RAY_SERVE_USE_GRPC_BY_DEFAULT=1andRAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=0(both set byRAY_SERVE_THROUGHPUT_OPTIMIZED=1), rejected requests leakinc_num_running_requests_for_replicaincrements. The rejectedgRPCReplicaResultis discarded without being cancelled or iterated, so its done callback never executes. Each Ingress replica'sDeploymentHandlemaintains a running request counter for autoscaling. The leaked increments inflate this counter, which is periodically pushed to the controller as the primary autoscaling metric. The controller sees a non-zeroautoscaling_total_requests, which resets the downscale delay timer, blocking downscaling indefinitely.Fix
Call
result.cancel()on rejected results inrouter.py, forcing the gRPC call into a terminal state so done callbacks fire and the counter stays balanced.Test plan
test_unary_with_rejectionandtest_streaming_with_rejection: deploys an app, sends a load profile, asserts scale-up to 2 replicas, then asserts scale-down back to 1 after drain.RAY_SERVE_THROUGHPUT_OPTIMIZED=1, with both streaming and non-streaming variants.