[Serve] Make batching work with multiplexing by abrarsheikh · Pull Request #59334 · ray-project/ray

abrarsheikh · 2025-12-10T04:26:56Z

Add documentation
update get_multiplexed_model_id to see if we are batch context first
update logic
add tests
does not introduce any backwards incompatibility, previously the system did not provide any guarantee about contents of a batch and now we are add a constraint that guarantees each batch contains requests for same model.
execute sub batches concurrently

The thing I dislike about this implementation is that it does not fill the batch in the case where the replica is responsible for > 2 models and incoming traffic is equally distributed between those models. Becasue the current implementation fills the batch first, then divides them.

Metric	Baseline (42905 reqs)	Master (27526 reqs)	Δ Change (Master − Baseline)
Requests	42,905	27,526	−15,379
Fails	0	0	0
Median (ms)	290	300	+10 ms
95%ile (ms)	560	570	+10 ms
99%ile (ms)	620	640	+20 ms
Average (ms)	327.41	332.96	+5.55 ms
Min (ms)	61	80	+19 ms
Max (ms)	764	802	+38 ms
Avg Size (bytes)	13	13	0
Current RPS	299	293	−6
Current Failures/s	0	0	0

Signed-off-by: abrar <abrar@anyscale.com>

gemini-code-assist · 2025-12-10T04:27:00Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Signed-off-by: abrar <abrar@anyscale.com>

harshit-anyscale

lgtm, except model_1.pt file is added but has no changes

…batch

Signed-off-by: abrar <abrar@anyscale.com>

…batch

Signed-off-by: abrar <abrar@anyscale.com>

akyang-anyscale · 2025-12-18T19:47:00Z

+
+## Using model multiplexing with batching
+
+You can combine model multiplexing with the `@serve.batch` decorator for efficient batched inference. When you use both features together, Ray Serve automatically splits batches by model ID to ensure each batch contains only requests for the same model. This prevents issues where a single batch would contain requests targeting different models.


The way I understand this description is that Serve will treat each model's batch independently, i.e. waiting to reach the max_batch_size or the timeout before firing for each model, but in reality, it waits for the max_batch_size or timeout across all models. For example if our max_batch_size=8, Serve will process sub batches of size [1, 4, 3] instead of waiting for each model to have 8 request.

you are right.

[Serve] Make batching work with multiplexing

4a1ec3b

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh requested review from a team as code owners December 10, 2025 04:26

abrarsheikh added the go add ONLY when ready to merge, run all tests label Dec 10, 2025

abrarsheikh mentioned this pull request Dec 10, 2025

[Serve] model multiplexing and batching does not work together #56633

Closed

abrarsheikh requested a review from harshit-anyscale December 10, 2025 04:32

fix doc test

28ad89a

Signed-off-by: abrar <abrar@anyscale.com>

harshit-anyscale approved these changes Dec 10, 2025

View reviewed changes

ray-gardener Bot added the serve Ray Serve Related Issue label Dec 10, 2025

abrarsheikh added 2 commits December 11, 2025 05:27

Merge branch 'master' of github.com:ray-project/ray into 56633-abrar-…

56960d3

…batch

process sub batches concurrently

957ee85

Signed-off-by: abrar <abrar@anyscale.com>

cursor Bot reviewed Dec 11, 2025

View reviewed changes

Comment thread python/ray/serve/batching.py Outdated

abrarsheikh added 2 commits December 11, 2025 06:14

capture right context

670fbf9

Signed-off-by: abrar <abrar@anyscale.com>

Merge branch 'master' of github.com:ray-project/ray into 56633-abrar-…

b899ee4

…batch

abrarsheikh requested a review from akyang-anyscale December 16, 2025 04:48

harshit-anyscale approved these changes Dec 16, 2025

View reviewed changes

Comment thread python/ray/serve/batching.py Outdated

remove extra code

7db4db7

Signed-off-by: abrar <abrar@anyscale.com>

harshit-anyscale approved these changes Dec 18, 2025

View reviewed changes

akyang-anyscale approved these changes Dec 18, 2025

View reviewed changes

abrarsheikh merged commit 1599fb7 into master Dec 18, 2025
6 checks passed

abrarsheikh deleted the 56633-abrar-batch branch December 18, 2025 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Serve] Make batching work with multiplexing#59334

[Serve] Make batching work with multiplexing#59334
abrarsheikh merged 7 commits into
masterfrom
56633-abrar-batch

abrarsheikh commented Dec 10, 2025 •

edited

Loading

gemini-code-assist Bot commented Dec 10, 2025

harshit-anyscale left a comment

Uh oh!

Uh oh!

akyang-anyscale Dec 18, 2025

abrarsheikh Dec 18, 2025

Uh oh!

Labels

3 participants


		## Using model multiplexing with batching

		You can combine model multiplexing with the `@serve.batch` decorator for efficient batched inference. When you use both features together, Ray Serve automatically splits batches by model ID to ensure each batch contains only requests for the same model. This prevents issues where a single batch would contain requests targeting different models.

Uh oh!

Conversation

abrarsheikh commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gemini-code-assist Bot commented Dec 10, 2025

harshit-anyscale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

akyang-anyscale Dec 18, 2025

Choose a reason for hiding this comment

abrarsheikh Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

abrarsheikh commented Dec 10, 2025 •

edited

Loading