Skip to content

[serve] Fix autoscaling for streaming deployments after inflight requests drain to 0#61920

Merged
abrarsheikh merged 39 commits into
masterfrom
kk/fix-61551
Mar 28, 2026
Merged

[serve] Fix autoscaling for streaming deployments after inflight requests drain to 0#61920
abrarsheikh merged 39 commits into
masterfrom
kk/fix-61551

Conversation

@kamil-kaczmarek

@kamil-kaczmarek kamil-kaczmarek commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #61551 - Serve autoscaling can stay pinned at 2 replicas for streaming deployments after real inflight requests drain to 0

Root cause

When RAY_SERVE_USE_GRPC_BY_DEFAULT=1 and RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=0 (both set by RAY_SERVE_THROUGHPUT_OPTIMIZED=1), rejected requests leak inc_num_running_requests_for_replica increments. The rejected gRPCReplicaResult is discarded without being cancelled or iterated, so its done callback never executes. Each Ingress replica's DeploymentHandle maintains a running request counter for autoscaling. The leaked increments inflate this counter, which is periodically pushed to the controller as the primary autoscaling metric. The controller sees a non-zero autoscaling_total_requests, which resets the downscale delay timer, blocking downscaling indefinitely.

Fix

Call result.cancel() on rejected results in router.py, forcing the gRPC call into a terminal state so done callbacks fire and the counter stays balanced.

Test plan

  • Added test_unary_with_rejection and test_streaming_with_rejection: deploys an app, sends a load profile, asserts scale-up to 2 replicas, then asserts scale-down back to 1 after drain.
  • Few test cases with minimal repro env vars and RAY_SERVE_THROUGHPUT_OPTIMIZED=1, with both streaming and non-streaming variants.
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
@kamil-kaczmarek kamil-kaczmarek self-assigned this Mar 20, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a crucial fix for resource management in gRPC streaming requests. By explicitly calling result.cancel() when a request is rejected by a replica, it ensures that done callbacks are properly executed and prevents potential counter leaks, especially for same-loop gRPC streaming results. This is a valuable improvement for the stability and correctness of the system.

Comment thread python/ray/serve/_private/router.py
@kamil-kaczmarek kamil-kaczmarek added serve Ray Serve Related Issue go add ONLY when ready to merge, run all tests and removed go add ONLY when ready to merge, run all tests labels Mar 21, 2026
@kamil-kaczmarek kamil-kaczmarek added the go add ONLY when ready to merge, run all tests label Mar 23, 2026
@kamil-kaczmarek kamil-kaczmarek changed the title [serve][fix][WIP] Mar 23, 2026
@kamil-kaczmarek kamil-kaczmarek marked this pull request as ready for review March 23, 2026 05:53
@kamil-kaczmarek kamil-kaczmarek requested a review from a team as a code owner March 23, 2026 05:53
Comment thread python/ray/serve/tests/test_autoscaling_with_streaming.py Outdated
@kamil-kaczmarek kamil-kaczmarek changed the title [serve][fix] Fix autoscaling for streaming deployments after inflight requests drain to 0 Mar 24, 2026
Comment thread python/ray/serve/_private/router.py Outdated
return result

# Request was rejected by the replica. Cancel the result so that
# done callbacks run. Without this, same-loop gRPC

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is same-loop a necessary condition here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, when we specify to run on separate loop (here), we create an additional background task (here). This task continuously fetches from the streaming grpc call. This way callbacks will actually be called when the request finishes even without the user explicitly consuming the response.

We don't have this mechanism when running same-loop variant. In this case: _use_queue=False and self._calling_from_same_loop=True here.
I think we take this path: get_async() -> return await self._get_internal() -> return await self._gen.__anext__(). The stream is only consumed lazily. For abandoned result, nobody ever calls __anext__() or cancel(), nothing ever drives that generator forward.

Comment thread python/ray/serve/_private/router.py
Comment thread python/ray/serve/tests/test_autoscaling_with_streaming.py Outdated
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
@abrarsheikh

Copy link
Copy Markdown
Contributor

thanks for the clear explanations they make sense

…lit into streaming/unary tests

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
… existing patterns

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Comment thread python/ray/serve/tests/test_autoscaling_policy.py Outdated
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

@abrarsheikh abrarsheikh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left nits, non-blocking okay-to-address in follow-up

Comment thread python/ray/serve/tests/test_autoscaling_policy.py Outdated
Comment thread python/ray/serve/tests/test_autoscaling_policy.py Outdated
Comment thread python/ray/serve/tests/test_autoscaling_policy.py Outdated
Comment thread python/ray/serve/tests/test_autoscaling_policy.py Outdated
Comment thread python/ray/serve/tests/test_autoscaling_policy.py Outdated
Comment thread python/ray/serve/tests/test_autoscaling_policy.py
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
… readability

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
@kamil-kaczmarek

Copy link
Copy Markdown
Contributor Author

left nits, non-blocking okay-to-address in follow-up

All nits addressed.

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
@abrarsheikh abrarsheikh merged commit bf44dfc into master Mar 28, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the kk/fix-61551 branch March 28, 2026 02:34
mancfactor pushed a commit to mancfactor/ray that referenced this pull request Apr 2, 2026
…ests drain to 0 (ray-project#61920)

## Summary

Fixes ray-project#61551 - Serve autoscaling can stay pinned at 2 replicas for
streaming deployments after real inflight requests drain to 0

**Root cause**

When `RAY_SERVE_USE_GRPC_BY_DEFAULT=1` and
`RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=0` (both set by
`RAY_SERVE_THROUGHPUT_OPTIMIZED=1`), rejected requests leak
`inc_num_running_requests_for_replica` increments. The rejected
`gRPCReplicaResult` is discarded without being cancelled or iterated, so
its done callback never executes. Each Ingress replica's
`DeploymentHandle` maintains a running request counter for autoscaling.
The leaked increments inflate this counter, which is periodically pushed
to the controller as the primary autoscaling metric. The controller sees
a non-zero `autoscaling_total_requests`, which resets the downscale
delay timer, blocking downscaling indefinitely.

**Fix**

Call `result.cancel()` on rejected results in `router.py`, forcing the
gRPC call into a terminal state so done callbacks fire and the counter
stays balanced.

## Test plan
- Added `test_unary_with_rejection` and `test_streaming_with_rejection`:
deploys an app, sends a load profile, asserts scale-up to 2 replicas,
then asserts scale-down back to 1 after drain.
- Few test cases with minimal repro env vars and
`RAY_SERVE_THROUGHPUT_OPTIMIZED=1`, with both streaming and
non-streaming variants.

---------

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Frank Mancina <fmancina@haproxy.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…ests drain to 0 (ray-project#61920)

## Summary

Fixes ray-project#61551 - Serve autoscaling can stay pinned at 2 replicas for
streaming deployments after real inflight requests drain to 0

**Root cause**

When `RAY_SERVE_USE_GRPC_BY_DEFAULT=1` and
`RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=0` (both set by
`RAY_SERVE_THROUGHPUT_OPTIMIZED=1`), rejected requests leak
`inc_num_running_requests_for_replica` increments. The rejected
`gRPCReplicaResult` is discarded without being cancelled or iterated, so
its done callback never executes. Each Ingress replica's
`DeploymentHandle` maintains a running request counter for autoscaling.
The leaked increments inflate this counter, which is periodically pushed
to the controller as the primary autoscaling metric. The controller sees
a non-zero `autoscaling_total_requests`, which resets the downscale
delay timer, blocking downscaling indefinitely.

**Fix**

Call `result.cancel()` on rejected results in `router.py`, forcing the
gRPC call into a terminal state so done callbacks fire and the counter
stays balanced.

## Test plan
- Added `test_unary_with_rejection` and `test_streaming_with_rejection`:
deploys an app, sends a load profile, asserts scale-up to 2 replicas,
then asserts scale-down back to 1 after drain.
- Few test cases with minimal repro env vars and
`RAY_SERVE_THROUGHPUT_OPTIMIZED=1`, with both streaming and
non-streaming variants.

---------

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

3 participants