Skip to content

[Docs] Document max_tasks_in_flight_per_actor vs max_concurrent_batches#60477

Merged
kouroshHakha merged 7 commits into
ray-project:masterfrom
Partth101:docs/batch-concurrency-tuning
Feb 6, 2026
Merged

[Docs] Document max_tasks_in_flight_per_actor vs max_concurrent_batches#60477
kouroshHakha merged 7 commits into
ray-project:masterfrom
Partth101:docs/batch-concurrency-tuning

Conversation

@Partth101

Copy link
Copy Markdown
Contributor

Adds documentation explaining the interaction between max_concurrent_batches and max_tasks_in_flight_per_actor parameters for vLLM batch inference. This addresses user confusion from a Slack thread where setting max_concurrent_batches=8 only resulted in 4 tasks running (because max_tasks_in_flight_per_actor defaults to 4).

The new section includes:

Parameter descriptions with verified default values from source code
Explanation of how they work together
Troubleshooting guidance for the autoscaling warning

Related Issue 60421

@Partth101 Partth101 requested a review from a team as a code owner January 24, 2026 17:22

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds valuable documentation explaining the interaction between max_concurrent_batches and max_tasks_in_flight_per_actor in vLLM batch inference, which will be very helpful for users. The explanation of the parameters, how they work together, and the troubleshooting guide for the autoscaling warning are clear and well-structured. I've added a couple of minor suggestions to improve consistency within the documentation. Overall, this is a great addition to the Ray Data documentation.

Comment thread doc/source/data/working-with-llms.rst Outdated
Comment on lines +346 to +347
**max_concurrent_batches** (default: 8)
The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 32``.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The recommendation for batch_size here is _batch_size >= 32_, but the docstring for vLLMEngineProcessorConfig in python/ray/data/llm.py suggests _batch_size >= 64_ for max_concurrent_batches to be effective. To maintain consistency between the documentation and the code's docstrings, it would be better to align this recommendation.

Suggested change
**max_concurrent_batches** (default: 8)
The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 32``.
**max_concurrent_batches** (default: 8)
The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 64``.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to batch_size >= 64 to match the public API docstring in llm.py:127.

Comment thread doc/source/data/working-with-llms.rst Outdated
.. code-block:: python

config = vLLMEngineProcessorConfig(
model_source="meta-llama/Llama-3.1-8B-Instruct",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other examples in this document, could you please use unsloth/Llama-3.1-8B-Instruct as the model_source? The literalinclude just above this snippet also uses the unsloth model, so this change would make the new section more consistent.

Suggested change
model_source="meta-llama/Llama-3.1-8B-Instruct",
model_source="unsloth/Llama-3.1-8B-Instruct",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to unsloth/Llama-3.1-8B-Instruct for consistency with the rest of the document.

@Partth101 Partth101 force-pushed the docs/batch-concurrency-tuning branch from c2c32ac to 939967c Compare January 24, 2026 17:30
Adds documentation explaining the interaction between max_concurrent_batches and max_tasks_in_flight_per_actor parameters for vLLM batch inference.

The new section includes:
Parameter descriptions with verified default values
Explanation of how they work together
Troubleshooting guidance for the autoscaling warning

Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>

Issue: ray-project#60421
@Partth101 Partth101 force-pushed the docs/batch-concurrency-tuning branch from e9c30b1 to d8d07bd Compare January 24, 2026 17:52
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue docs An issue or change related to documentation core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jan 24, 2026
Adds documentation explaining the interaction between max_concurrent_batches and max_tasks_in_flight_per_actor parameters for vLLM batch inference.

Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>

@jeffreywang88 jeffreywang88 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Comment thread doc/source/data/working-with-llms.rst Outdated
The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 64``.

``max_tasks_in_flight_per_actor``, experimental, default: 4
How many tasks Ray Data can queue per actor before waiting for results. This enables task prefetching so there are always tasks ready when the actor finishes one. Access through the ``experimental`` dict.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
How many tasks Ray Data can queue per actor before waiting for results. This enables task prefetching so there are always tasks ready when the actor finishes one. Access through the ``experimental`` dict.
The number of tasks Ray Data can queue per actor before waiting for results. This enables task prefetching so there are always tasks ready when the actor finishes one. Access through the ``experimental`` dict.
Comment thread doc/source/data/working-with-llms.rst Outdated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``max_concurrent_batches``, default: 8
The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 64``.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 64``.
The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency.

The optimal batch size depends specifically on the workload.

Comment thread doc/source/data/working-with-llms.rst Outdated
Comment on lines +360 to +362
If ``max_tasks_in_flight_per_actor``, which defaults to 4, is less than ``max_concurrent_batches``, which defaults to 8, the actor can't reach full concurrency because there aren't enough queued tasks to fill all concurrent slots.

To maximize throughput, increase ``max_tasks_in_flight_per_actor`` to keep the actor's task queue saturated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If ``max_tasks_in_flight_per_actor``, which defaults to 4, is less than ``max_concurrent_batches``, which defaults to 8, the actor can't reach full concurrency because there aren't enough queued tasks to fill all concurrent slots.
To maximize throughput, increase ``max_tasks_in_flight_per_actor`` to keep the actor's task queue saturated.
With ``max_tasks_in_flight_per_actor < max_concurrent_batches``, Ray Data actors are undersaturated. To maximize throughput, increase ``max_tasks_in_flight_per_actor`` to keep the actor's task queue saturated.
Comment thread doc/source/data/working-with-llms.rst Outdated
Comment on lines +369 to +396
Troubleshooting the autoscaling warning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You may see this warning:

.. code-block:: text

Actor Pool configuration will not allow it to scale up:
configured utilization threshold (175%) couldn't be reached with
configured max_concurrency=8 and max_tasks_in_flight_per_actor=4
(max utilization will be 50%)

This appears when ``max_tasks_in_flight_per_actor / max_concurrent_batches`` is below Ray Data's utilization threshold. With the defaults, the ratio is 4 to 8, or 50%, so you can't reach the threshold.

To silence this warning, set ``max_tasks_in_flight_per_actor`` high enough to exceed the 175% threshold:

.. code-block:: python

config = vLLMEngineProcessorConfig(
model_source="unsloth/Llama-3.1-8B-Instruct",
max_concurrent_batches=8,
# 16/8 = 200%, which exceeds the 175% threshold
experimental={"max_tasks_in_flight_per_actor": 16},
)

.. note::
This warning is informational and doesn't prevent execution. For most vLLM workloads, setting ``max_tasks_in_flight_per_actor`` equal to ``max_concurrent_batches`` is sufficient to achieve full throughput, even if the warning still appears. For example, set both to 8.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great insight! Here's the full warning message:

026-01-28 10:04:06,208ksWARNING default_actor_autoscaler.py:241 -- ⚠️  Actor Pool configuration of the ActorPoolMapOperator[MapBatches(vLLMEngineStageUDF)] will not allow it to scale up: configured utilization threshold (175.0%) couldn't be reached with configured max_concurrency=8 and max_tasks_in_flight_per_actor=4 (max utilization will be max_tasks_in_flight_per_actor / max_concurrency = 50%) 

However, this doesn't matter for fixed size pools which vLLMEngineStageUDF uses. Autoscaling will never kick in because we have min_size == max_size == initial_size, but it's still a good practice to have higher max_tasks_in_flight_per_actor. We should adjust its default.

**config.get_concurrency(autoscaling_enabled=False),

return {"size": self.concurrency}

min_size = size
max_size = size
initial_size = size

We shouldn't raise this warning if fixed size pools are used in Ray Data. I'm working on a fix at the moment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do these:

  1. Adjust the default value
    DEFAULT_MAX_TASKS_IN_FLIGHT = 4
    from 4 to 16.
  2. Remove this block of comment.
  3. Adjust the default values in the "Understanding the parameters" section above.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the fix to bypass the warnings: #60569.

@jeffreywang88

Copy link
Copy Markdown
Contributor

Could you please fix microcheck as well?

bveeramani pushed a commit that referenced this pull request Jan 28, 2026
…60569)

## Description
The autoscaling validation warning was incorrectly raised for fixed-size
actor pools (`min_size == max_size`). These pools don't scale up, so the
warning doesn't apply.

## Related issues
Context:
#60477 (comment)

## Additional information
After this change, when we run `python -m pytest -v -s
test_vllm_engine_proc.py::test_generation_model`, we no longer observe
autoscaling warnings in the log.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Partth101 and others added 2 commits January 28, 2026 23:09
Updated DEFAULT_MAX_TASKS_IN_FLIGHT to 16 in the source code to improve out-of-the-box throughput.

Refined documentation and example code to reflect the new default, applied Google Style Guide fixes (active voice, backticks), and removed the obsolete troubleshooting section.

Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>
@jeffreywang88 jeffreywang88 added go add ONLY when ready to merge, run all tests and removed serve Ray Serve Related Issue labels Jan 29, 2026

@jeffreywang88 jeffreywang88 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you again for the contribution! Kicking off release tests.

@Partth101

Copy link
Copy Markdown
Contributor Author

Thank you very much @jeffreywang-anyscale

@kouroshHakha kouroshHakha enabled auto-merge (squash) January 29, 2026 20:26
@github-actions github-actions Bot disabled auto-merge February 1, 2026 16:12
@Partth101

Copy link
Copy Markdown
Contributor Author

@jeffreywang-anyscale Just checking if there’s anything else I can contribute to this PR, or if it’s ready to be merged

@jeffreywang88

Copy link
Copy Markdown
Contributor

Retrying premerge tests

@kouroshHakha kouroshHakha enabled auto-merge (squash) February 4, 2026 02:17
@github-actions github-actions Bot disabled auto-merge February 5, 2026 18:26

@bveeramani bveeramani left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp

@kouroshHakha kouroshHakha merged commit cdb8706 into ray-project:master Feb 6, 2026
6 checks passed
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…rent_batches (ray-project#60477)


Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>
Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…rent_batches (ray-project#60477)


Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>
Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
…rent_batches (#60477)

Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>
Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
…rent_batches (#60477)

Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>
Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…ay-project#60569)

## Description
The autoscaling validation warning was incorrectly raised for fixed-size
actor pools (`min_size == max_size`). These pools don't scale up, so the
warning doesn't apply.

## Related issues
Context:
ray-project#60477 (comment)

## Additional information
After this change, when we run `python -m pytest -v -s
test_vllm_engine_proc.py::test_generation_model`, we no longer observe
autoscaling warnings in the log.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…rent_batches (ray-project#60477)

Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>
Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
Aydin-ab pushed a commit to kunling-anyscale/ray that referenced this pull request Feb 20, 2026
…rent_batches (ray-project#60477)

Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>
Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

4 participants