[serve] Fix autoscaling for streaming deployments after inflight requests drain to 0#61920

Merged

abrarsheikh merged 39 commits into

Mar 28, 2026

kamil-kaczmarek commented Mar 20, 2026 •

edited

Loading

Contributor

Summary

Fixes #61551 - Serve autoscaling can stay pinned at 2 replicas for streaming deployments after real inflight requests drain to 0

Root cause

When RAY_SERVE_USE_GRPC_BY_DEFAULT=1 and RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=0 (both set by RAY_SERVE_THROUGHPUT_OPTIMIZED=1), rejected requests leak inc_num_running_requests_for_replica increments. The rejected gRPCReplicaResult is discarded without being cancelled or iterated, so its done callback never executes. Each Ingress replica's DeploymentHandle maintains a running request counter for autoscaling. The leaked increments inflate this counter, which is periodically pushed to the controller as the primary autoscaling metric. The controller sees a non-zero autoscaling_total_requests, which resets the downscale delay timer, blocking downscaling indefinitely.

Fix

Call result.cancel() on rejected results in router.py, forcing the gRPC call into a terminal state so done callbacks fire and the counter stays balanced.

Test plan

Added test_unary_with_rejection and test_streaming_with_rejection: deploys an app, sends a load profile, asserts scale-up to 2 replicas, then asserts scale-down back to 1 after drain.
Few test cases with minimal repro env vars and RAY_SERVE_THROUGHPUT_OPTIMIZED=1, with both streaming and non-streaming variants.


          cancel rejected request

39f842f

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

kamil-kaczmarek self-assigned this

gemini-code-assist Bot reviewed

View reviewed changes

gemini-code-assist Bot left a comment

Contributor

Code Review

The pull request introduces a crucial fix for resource management in gRPC streaming requests. By explicitly calling result.cancel() when a request is rejected by a replica, it ensures that done callbacks are properly executed and prevents potential counter leaks, especially for same-loop gRPC streaming results. This is a valuable improvement for the stability and correctness of the system.

python/ray/serve/_private/router.py

kamil-kaczmarek added serve go and removed go labels

kamil-kaczmarek added 4 commits

March 20, 2026 19:31


          Merge branch 'master' into kk/fix-61551

42e907d


          Merge branch 'master' into kk/fix-61551

10b74eb


          Add regression test for autoscaling bug

68755dc

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          Add test to BUILD.bazel

15fc001

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

kamil-kaczmarek added the go label

kamil-kaczmarek changed the title ~~[serve][fix][WIP]~~

kamil-kaczmarek marked this pull request as ready for review

March 23, 2026 05:53

kamil-kaczmarek requested a review from a team as a code owner

March 23, 2026 05:53

kamil-kaczmarek added 6 commits

March 23, 2026 18:23


          add test_autoscaling_with_streaming to HAProxy tests

72f5a14

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          Merge branch 'master' into kk/fix-61551

cc8af42


          lint

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          drop HAProxy and direct ingress from the tests

b930a51

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          Merge branch 'master' into kk/fix-61551

7ccd245


          Merge branch 'master' into kk/fix-61551

5564cf2

harshit-anyscale reviewed

View reviewed changes

python/ray/serve/tests/test_autoscaling_with_streaming.py Outdated

kamil-kaczmarek changed the title ~~[serve][fix] Fix autoscaling for streaming deployments after inflight requests drain to 0~~


          Merge branch 'master' into kk/fix-61551

6732be1

harshit-anyscale approved these changes

View reviewed changes

abrarsheikh reviewed

View reviewed changes

python/ray/serve/_private/router.py Outdated

                               return result
+                          # Request was rejected by the replica. Cancel the result so that
+                          # done callbacks run. Without this, same-loop gRPC

abrarsheikh Mar 25, 2026

Contributor

Why is same-loop a necessary condition here?

kamil-kaczmarek Mar 27, 2026

Contributor Author

I think, when we specify to run on separate loop (here), we create an additional background task (here). This task continuously fetches from the streaming grpc call. This way callbacks will actually be called when the request finishes even without the user explicitly consuming the response.

We don't have this mechanism when running same-loop variant. In this case: _use_queue=False and self._calling_from_same_loop=True here.
I think we take this path: get_async() -> return await self._get_internal() -> return await self._gen.__anext__(). The stream is only consumed lazily. For abandoned result, nobody ever calls __anext__() or cancel(), nothing ever drives that generator forward.

python/ray/serve/_private/router.py

python/ray/serve/tests/test_autoscaling_with_streaming.py Outdated

kamil-kaczmarek added 5 commits

March 26, 2026 13:42


          Merge branch 'master' into kk/fix-61551

1bc27e6


          dataclass for test configuration

f849a08

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          remove test_autoscaling_with_streaming

7ea5e6b

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          remove test_autoscaling_with_streaming

b0979c7

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          migrate test_autoscaling_with_streaming to test_autoscaling_policy.py

6eb7078

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          Merge branch 'master' into kk/fix-61551

c02bb41

abrarsheikh commented Mar 27, 2026

Contributor

thanks for the clear explanations they make sense

kamil-kaczmarek added 7 commits

March 26, 2026 22:14


          Merge branch 'master' into kk/fix-61551

975961f


          refactor streaming autoscaling test: move env vars to BUILD.bazel, sp…

9bb669a

…lit into streaming/unary tests

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          remove config and simplify parameters

4e42ed7

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          rename

7ca973e

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          tweaks

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          refactor TestAutoscalingWithStreaming to make it more consistent with…

c59f67d

… existing patterns

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          lint

cbe6196

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

cursor Bot reviewed

View reviewed changes

cursor Bot left a comment

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

python/ray/serve/tests/test_autoscaling_policy.py Outdated

kamil-kaczmarek added 4 commits

March 27, 2026 08:23


          remove unnecessary code comments

ffa1077

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

nit

c17ab5e

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          Merge branch 'master' into kk/fix-61551

156b98d


          nit (2)

ead163f

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

kamil-kaczmarek requested a review from abrarsheikh

March 27, 2026 09:04

abrarsheikh approved these changes

View reviewed changes

abrarsheikh left a comment

Contributor

left nits, non-blocking okay-to-address in follow-up

python/ray/serve/tests/test_autoscaling_policy.py Outdated

python/ray/serve/tests/test_autoscaling_policy.py Outdated

python/ray/serve/tests/test_autoscaling_policy.py Outdated

python/ray/serve/tests/test_autoscaling_policy.py Outdated

python/ray/serve/tests/test_autoscaling_policy.py Outdated

python/ray/serve/tests/test_autoscaling_policy.py

kamil-kaczmarek added 6 commits

March 27, 2026 19:40


          limit to only run TestAutoscalingWithRejection

d7a435f

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          Added more detailed comment to the test

124ede7

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          lint

f3b5688

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          Set min/max replicas and load profile values directly for better code…

3444a54

… readability

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          replace ray_instance with serve_instance, assert total running requests

2b9003b

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          Merge branch 'master' into kk/fix-61551

26e7b77

kamil-kaczmarek commented Mar 27, 2026

Contributor Author

left nits, non-blocking okay-to-address in follow-up

All nits addressed.

kamil-kaczmarek added 2 commits

March 27, 2026 22:50


          switch to large size

fa8596c

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>


          Merge branch 'master' into kk/fix-61551

7fbab8d

abrarsheikh approved these changes

View reviewed changes

abrarsheikh merged commit bf44dfc into master

6 checks passed

abrarsheikh deleted the kk/fix-61551 branch

March 28, 2026 02:34

mancfactor pushed a commit to mancfactor/ray that referenced this pull request


          [serve] Fix autoscaling for streaming deployments after inflight requ…

a570dc8

…ests drain to 0 (ray-project#61920)

## Summary

Fixes ray-project#61551 - Serve autoscaling can stay pinned at 2 replicas for
streaming deployments after real inflight requests drain to 0

**Root cause**

When `RAY_SERVE_USE_GRPC_BY_DEFAULT=1` and
`RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=0` (both set by
`RAY_SERVE_THROUGHPUT_OPTIMIZED=1`), rejected requests leak
`inc_num_running_requests_for_replica` increments. The rejected
`gRPCReplicaResult` is discarded without being cancelled or iterated, so
its done callback never executes. Each Ingress replica's
`DeploymentHandle` maintains a running request counter for autoscaling.
The leaked increments inflate this counter, which is periodically pushed
to the controller as the primary autoscaling metric. The controller sees
a non-zero `autoscaling_total_requests`, which resets the downscale
delay timer, blocking downscaling indefinitely.

**Fix**

Call `result.cancel()` on rejected results in `router.py`, forcing the
gRPC call into a terminal state so done callbacks fire and the counter
stays balanced.

## Test plan
- Added `test_unary_with_rejection` and `test_streaming_with_rejection`:
deploys an app, sends a load profile, asserts scale-up to 2 replicas,
then asserts scale-down back to 1 after drain.
- Few test cases with minimal repro env vars and
`RAY_SERVE_THROUGHPUT_OPTIMIZED=1`, with both streaming and
non-streaming variants.

---------

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Frank Mancina <fmancina@haproxy.com>

Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request


          [serve] Fix autoscaling for streaming deployments after inflight requ…

51dbf19

…ests drain to 0 (ray-project#61920)

## Summary

Fixes ray-project#61551 - Serve autoscaling can stay pinned at 2 replicas for
streaming deployments after real inflight requests drain to 0

**Root cause**

When `RAY_SERVE_USE_GRPC_BY_DEFAULT=1` and
`RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=0` (both set by
`RAY_SERVE_THROUGHPUT_OPTIMIZED=1`), rejected requests leak
`inc_num_running_requests_for_replica` increments. The rejected
`gRPCReplicaResult` is discarded without being cancelled or iterated, so
its done callback never executes. Each Ingress replica's
`DeploymentHandle` maintains a running request counter for autoscaling.
The leaked increments inflate this counter, which is periodically pushed
to the controller as the primary autoscaling metric. The controller sees
a non-zero `autoscaling_total_requests`, which resets the downscale
delay timer, blocking downscaling indefinitely.

**Fix**

Call `result.cancel()` on rejected results in `router.py`, forcing the
gRPC call into a terminal state so done callbacks fire and the counter
stays balanced.

## Test plan
- Added `test_unary_with_rejection` and `test_streaming_with_rejection`:
deploys an app, sends a load profile, asserts scale-up to 2 replicas,
then asserts scale-down back to 1 after drain.
- Few test cases with minimal repro env vars and
`RAY_SERVE_THROUGHPUT_OPTIMIZED=1`, with both streaming and
non-streaming variants.

---------

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment