[Docs] Document max_tasks_in_flight_per_actor vs max_concurrent_batches#60477
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds valuable documentation explaining the interaction between max_concurrent_batches and max_tasks_in_flight_per_actor in vLLM batch inference, which will be very helpful for users. The explanation of the parameters, how they work together, and the troubleshooting guide for the autoscaling warning are clear and well-structured. I've added a couple of minor suggestions to improve consistency within the documentation. Overall, this is a great addition to the Ray Data documentation.
| **max_concurrent_batches** (default: 8) | ||
| The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 32``. |
There was a problem hiding this comment.
The recommendation for batch_size here is _batch_size >= 32_, but the docstring for vLLMEngineProcessorConfig in python/ray/data/llm.py suggests _batch_size >= 64_ for max_concurrent_batches to be effective. To maintain consistency between the documentation and the code's docstrings, it would be better to align this recommendation.
| **max_concurrent_batches** (default: 8) | |
| The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 32``. | |
| **max_concurrent_batches** (default: 8) | |
| The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 64``. |
There was a problem hiding this comment.
Updated to batch_size >= 64 to match the public API docstring in llm.py:127.
| .. code-block:: python | ||
|
|
||
| config = vLLMEngineProcessorConfig( | ||
| model_source="meta-llama/Llama-3.1-8B-Instruct", |
There was a problem hiding this comment.
For consistency with other examples in this document, could you please use unsloth/Llama-3.1-8B-Instruct as the model_source? The literalinclude just above this snippet also uses the unsloth model, so this change would make the new section more consistent.
| model_source="meta-llama/Llama-3.1-8B-Instruct", | |
| model_source="unsloth/Llama-3.1-8B-Instruct", |
There was a problem hiding this comment.
Changed to unsloth/Llama-3.1-8B-Instruct for consistency with the rest of the document.
c2c32ac to
939967c
Compare
Adds documentation explaining the interaction between max_concurrent_batches and max_tasks_in_flight_per_actor parameters for vLLM batch inference. The new section includes: Parameter descriptions with verified default values Explanation of how they work together Troubleshooting guidance for the autoscaling warning Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Issue: ray-project#60421
e9c30b1 to
d8d07bd
Compare
Adds documentation explaining the interaction between max_concurrent_batches and max_tasks_in_flight_per_actor parameters for vLLM batch inference. Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>
8541dc7 to
7e00a7f
Compare
jeffreywang88
left a comment
There was a problem hiding this comment.
Thanks for the contribution!
| The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 64``. | ||
|
|
||
| ``max_tasks_in_flight_per_actor``, experimental, default: 4 | ||
| How many tasks Ray Data can queue per actor before waiting for results. This enables task prefetching so there are always tasks ready when the actor finishes one. Access through the ``experimental`` dict. |
There was a problem hiding this comment.
| How many tasks Ray Data can queue per actor before waiting for results. This enables task prefetching so there are always tasks ready when the actor finishes one. Access through the ``experimental`` dict. | |
| The number of tasks Ray Data can queue per actor before waiting for results. This enables task prefetching so there are always tasks ready when the actor finishes one. Access through the ``experimental`` dict. |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| ``max_concurrent_batches``, default: 8 | ||
| The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 64``. |
There was a problem hiding this comment.
| The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 64``. | |
| The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. |
The optimal batch size depends specifically on the workload.
| If ``max_tasks_in_flight_per_actor``, which defaults to 4, is less than ``max_concurrent_batches``, which defaults to 8, the actor can't reach full concurrency because there aren't enough queued tasks to fill all concurrent slots. | ||
|
|
||
| To maximize throughput, increase ``max_tasks_in_flight_per_actor`` to keep the actor's task queue saturated. |
There was a problem hiding this comment.
| If ``max_tasks_in_flight_per_actor``, which defaults to 4, is less than ``max_concurrent_batches``, which defaults to 8, the actor can't reach full concurrency because there aren't enough queued tasks to fill all concurrent slots. | |
| To maximize throughput, increase ``max_tasks_in_flight_per_actor`` to keep the actor's task queue saturated. | |
| With ``max_tasks_in_flight_per_actor < max_concurrent_batches``, Ray Data actors are undersaturated. To maximize throughput, increase ``max_tasks_in_flight_per_actor`` to keep the actor's task queue saturated. |
| Troubleshooting the autoscaling warning | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| You may see this warning: | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| Actor Pool configuration will not allow it to scale up: | ||
| configured utilization threshold (175%) couldn't be reached with | ||
| configured max_concurrency=8 and max_tasks_in_flight_per_actor=4 | ||
| (max utilization will be 50%) | ||
|
|
||
| This appears when ``max_tasks_in_flight_per_actor / max_concurrent_batches`` is below Ray Data's utilization threshold. With the defaults, the ratio is 4 to 8, or 50%, so you can't reach the threshold. | ||
|
|
||
| To silence this warning, set ``max_tasks_in_flight_per_actor`` high enough to exceed the 175% threshold: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| config = vLLMEngineProcessorConfig( | ||
| model_source="unsloth/Llama-3.1-8B-Instruct", | ||
| max_concurrent_batches=8, | ||
| # 16/8 = 200%, which exceeds the 175% threshold | ||
| experimental={"max_tasks_in_flight_per_actor": 16}, | ||
| ) | ||
|
|
||
| .. note:: | ||
| This warning is informational and doesn't prevent execution. For most vLLM workloads, setting ``max_tasks_in_flight_per_actor`` equal to ``max_concurrent_batches`` is sufficient to achieve full throughput, even if the warning still appears. For example, set both to 8. | ||
|
|
There was a problem hiding this comment.
Great insight! Here's the full warning message:
026-01-28 10:04:06,208ksWARNING default_actor_autoscaler.py:241 -- ⚠️ Actor Pool configuration of the ActorPoolMapOperator[MapBatches(vLLMEngineStageUDF)] will not allow it to scale up: configured utilization threshold (175.0%) couldn't be reached with configured max_concurrency=8 and max_tasks_in_flight_per_actor=4 (max utilization will be max_tasks_in_flight_per_actor / max_concurrency = 50%)
However, this doesn't matter for fixed size pools which vLLMEngineStageUDF uses. Autoscaling will never kick in because we have min_size == max_size == initial_size, but it's still a good practice to have higher max_tasks_in_flight_per_actor. We should adjust its default.
ray/python/ray/data/_internal/compute.py
Lines 144 to 146 in 221a193
We shouldn't raise this warning if fixed size pools are used in Ray Data. I'm working on a fix at the moment.
There was a problem hiding this comment.
Let's do these:
- Adjust the default value from 4 to 16.
- Remove this block of comment.
- Adjust the default values in the "Understanding the parameters" section above.
|
Could you please fix microcheck as well? |
…60569) ## Description The autoscaling validation warning was incorrectly raised for fixed-size actor pools (`min_size == max_size`). These pools don't scale up, so the warning doesn't apply. ## Related issues Context: #60477 (comment) ## Additional information After this change, when we run `python -m pytest -v -s test_vllm_engine_proc.py::test_generation_model`, we no longer observe autoscaling warnings in the log. Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Updated DEFAULT_MAX_TASKS_IN_FLIGHT to 16 in the source code to improve out-of-the-box throughput. Refined documentation and example code to reflect the new default, applied Google Style Guide fixes (active voice, backticks), and removed the obsolete troubleshooting section. Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>
jeffreywang88
left a comment
There was a problem hiding this comment.
LGTM, thank you again for the contribution! Kicking off release tests.
|
Thank you very much @jeffreywang-anyscale |
|
@jeffreywang-anyscale Just checking if there’s anything else I can contribute to this PR, or if it’s ready to be merged |
|
Retrying premerge tests |
…rent_batches (ray-project#60477) Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
…rent_batches (ray-project#60477) Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>
…rent_batches (#60477) Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…rent_batches (#60477) Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>
…ay-project#60569) ## Description The autoscaling validation warning was incorrectly raised for fixed-size actor pools (`min_size == max_size`). These pools don't scale up, so the warning doesn't apply. ## Related issues Context: ray-project#60477 (comment) ## Additional information After this change, when we run `python -m pytest -v -s test_vllm_engine_proc.py::test_generation_model`, we no longer observe autoscaling warnings in the log. Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…rent_batches (ray-project#60477) Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…rent_batches (ray-project#60477) Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>
Adds documentation explaining the interaction between max_concurrent_batches and max_tasks_in_flight_per_actor parameters for vLLM batch inference. This addresses user confusion from a Slack thread where setting max_concurrent_batches=8 only resulted in 4 tasks running (because max_tasks_in_flight_per_actor defaults to 4).
The new section includes:
Parameter descriptions with verified default values from source code
Explanation of how they work together
Troubleshooting guidance for the autoscaling warning
Related Issue 60421