Skip to content

[Serve] Make batching work with multiplexing#59334

Merged
abrarsheikh merged 7 commits into
masterfrom
56633-abrar-batch
Dec 18, 2025
Merged

[Serve] Make batching work with multiplexing#59334
abrarsheikh merged 7 commits into
masterfrom
56633-abrar-batch

Conversation

@abrarsheikh

@abrarsheikh abrarsheikh commented Dec 10, 2025

Copy link
Copy Markdown
Contributor

fixes #56633

  • Add documentation
  • update get_multiplexed_model_id to see if we are batch context first
  • update logic
  • add tests
  • does not introduce any backwards incompatibility, previously the system did not provide any guarantee about contents of a batch and now we are add a constraint that guarantees each batch contains requests for same model.
  • execute sub batches concurrently

The thing I dislike about this implementation is that it does not fill the batch in the case where the replica is responsible for > 2 models and incoming traffic is equally distributed between those models. Becasue the current implementation fills the batch first, then divides them.

Metric Baseline (42905 reqs) Master (27526 reqs) Δ Change (Master − Baseline)
Requests 42,905 27,526 −15,379
Fails 0 0 0
Median (ms) 290 300 +10 ms
95%ile (ms) 560 570 +10 ms
99%ile (ms) 620 640 +20 ms
Average (ms) 327.41 332.96 +5.55 ms
Min (ms) 61 80 +19 ms
Max (ms) 764 802 +38 ms
Avg Size (bytes) 13 13 0
Current RPS 299 293 −6
Current Failures/s 0 0 0
Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh requested review from a team as code owners December 10, 2025 04:26
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Dec 10, 2025
Signed-off-by: abrar <abrar@anyscale.com>

@harshit-anyscale harshit-anyscale left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, except model_1.pt file is added but has no changes

@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label Dec 10, 2025
Comment thread python/ray/serve/batching.py Outdated
Comment thread python/ray/serve/batching.py Outdated
Signed-off-by: abrar <abrar@anyscale.com>

## Using model multiplexing with batching

You can combine model multiplexing with the `@serve.batch` decorator for efficient batched inference. When you use both features together, Ray Serve automatically splits batches by model ID to ensure each batch contains only requests for the same model. This prevents issues where a single batch would contain requests targeting different models.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I understand this description is that Serve will treat each model's batch independently, i.e. waiting to reach the max_batch_size or the timeout before firing for each model, but in reality, it waits for the max_batch_size or timeout across all models. For example if our max_batch_size=8, Serve will process sub batches of size [1, 4, 3] instead of waiting for each model to have 8 request.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right.

@abrarsheikh abrarsheikh merged commit 1599fb7 into master Dec 18, 2025
6 checks passed
@abrarsheikh abrarsheikh deleted the 56633-abrar-batch branch December 18, 2025 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

3 participants