[Docs] Document max_tasks_in_flight_per_actor vs max_concurrent_batches by Partth101 · Pull Request #60477 · ray-project/ray

Partth101 · 2026-01-24T17:22:30Z

Adds documentation explaining the interaction between max_concurrent_batches and max_tasks_in_flight_per_actor parameters for vLLM batch inference. This addresses user confusion from a Slack thread where setting max_concurrent_batches=8 only resulted in 4 tasks running (because max_tasks_in_flight_per_actor defaults to 4).

The new section includes:

Parameter descriptions with verified default values from source code
Explanation of how they work together
Troubleshooting guidance for the autoscaling warning

Related Issue 60421

gemini-code-assist

Code Review

This pull request adds valuable documentation explaining the interaction between max_concurrent_batches and max_tasks_in_flight_per_actor in vLLM batch inference, which will be very helpful for users. The explanation of the parameters, how they work together, and the troubleshooting guide for the autoscaling warning are clear and well-structured. I've added a couple of minor suggestions to improve consistency within the documentation. Overall, this is a great addition to the Ray Data documentation.

gemini-code-assist · 2026-01-24T17:23:53Z

+**max_concurrent_batches** (default: 8)
+    The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 32``.


The recommendation for batch_size here is _batch_size >= 32_, but the docstring for vLLMEngineProcessorConfig in python/ray/data/llm.py suggests _batch_size >= 64_ for max_concurrent_batches to be effective. To maintain consistency between the documentation and the code's docstrings, it would be better to align this recommendation.

Suggested change

**max_concurrent_batches** (default: 8)

The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 32``.

**max_concurrent_batches** (default: 8)

The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 64``.

Updated to batch_size >= 64 to match the public API docstring in llm.py:127.

gemini-code-assist · 2026-01-24T17:23:53Z

+.. code-block:: python
+
+    config = vLLMEngineProcessorConfig(
+        model_source="meta-llama/Llama-3.1-8B-Instruct",


For consistency with other examples in this document, could you please use unsloth/Llama-3.1-8B-Instruct as the model_source? The literalinclude just above this snippet also uses the unsloth model, so this change would make the new section more consistent.

Suggested change

model_source="meta-llama/Llama-3.1-8B-Instruct",

model_source="unsloth/Llama-3.1-8B-Instruct",

Changed to unsloth/Llama-3.1-8B-Instruct for consistency with the rest of the document.

Adds documentation explaining the interaction between max_concurrent_batches and max_tasks_in_flight_per_actor parameters for vLLM batch inference. The new section includes: Parameter descriptions with verified default values Explanation of how they work together Troubleshooting guidance for the autoscaling warning Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Issue: ray-project#60421

Adds documentation explaining the interaction between max_concurrent_batches and max_tasks_in_flight_per_actor parameters for vLLM batch inference. Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>

jeffreywang88

Thanks for the contribution!

jeffreywang88 · 2026-01-27T16:54:06Z

+    The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 64``.
+
+``max_tasks_in_flight_per_actor``, experimental, default: 4
+    How many tasks Ray Data can queue per actor before waiting for results. This enables task prefetching so there are always tasks ready when the actor finishes one. Access through the ``experimental`` dict.


Suggested change

How many tasks Ray Data can queue per actor before waiting for results. This enables task prefetching so there are always tasks ready when the actor finishes one. Access through the ``experimental`` dict.

The number of tasks Ray Data can queue per actor before waiting for results. This enables task prefetching so there are always tasks ready when the actor finishes one. Access through the ``experimental`` dict.

jeffreywang88 · 2026-01-27T16:54:45Z

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``max_concurrent_batches``, default: 8
+    The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 64``.


Suggested change

The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 64``.

The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency.

The optimal batch size depends specifically on the workload.

jeffreywang88 · 2026-01-28T17:15:00Z

+If ``max_tasks_in_flight_per_actor``, which defaults to 4, is less than ``max_concurrent_batches``, which defaults to 8, the actor can't reach full concurrency because there aren't enough queued tasks to fill all concurrent slots.
+
+To maximize throughput, increase ``max_tasks_in_flight_per_actor`` to keep the actor's task queue saturated.


Suggested change

If ``max_tasks_in_flight_per_actor``, which defaults to 4, is less than ``max_concurrent_batches``, which defaults to 8, the actor can't reach full concurrency because there aren't enough queued tasks to fill all concurrent slots.

To maximize throughput, increase ``max_tasks_in_flight_per_actor`` to keep the actor's task queue saturated.

With ``max_tasks_in_flight_per_actor < max_concurrent_batches``, Ray Data actors are undersaturated. To maximize throughput, increase ``max_tasks_in_flight_per_actor`` to keep the actor's task queue saturated.

jeffreywang88 · 2026-01-28T18:24:29Z

+Troubleshooting the autoscaling warning
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You may see this warning:
+
+.. code-block:: text
+
+    Actor Pool configuration will not allow it to scale up:
+    configured utilization threshold (175%) couldn't be reached with
+    configured max_concurrency=8 and max_tasks_in_flight_per_actor=4
+    (max utilization will be 50%)
+
+This appears when ``max_tasks_in_flight_per_actor / max_concurrent_batches`` is below Ray Data's utilization threshold. With the defaults, the ratio is 4 to 8, or 50%, so you can't reach the threshold.
+
+To silence this warning, set ``max_tasks_in_flight_per_actor`` high enough to exceed the 175% threshold:
+
+.. code-block:: python
+
+    config = vLLMEngineProcessorConfig(
+        model_source="unsloth/Llama-3.1-8B-Instruct",
+        max_concurrent_batches=8,
+        # 16/8 = 200%, which exceeds the 175% threshold
+        experimental={"max_tasks_in_flight_per_actor": 16},
+    )
+
+.. note::
+    This warning is informational and doesn't prevent execution. For most vLLM workloads, setting ``max_tasks_in_flight_per_actor`` equal to ``max_concurrent_batches`` is sufficient to achieve full throughput, even if the warning still appears. For example, set both to 8.
+


Great insight! Here's the full warning message:

026-01-28 10:04:06,208ksWARNING default_actor_autoscaler.py:241 -- ⚠️ Actor Pool configuration of the ActorPoolMapOperator[MapBatches(vLLMEngineStageUDF)] will not allow it to scale up: configured utilization threshold (175.0%) couldn't be reached with configured max_concurrency=8 and max_tasks_in_flight_per_actor=4 (max utilization will be max_tasks_in_flight_per_actor / max_concurrency = 50%)

However, this doesn't matter for fixed size pools which vLLMEngineStageUDF uses. Autoscaling will never kick in because we have min_size == max_size == initial_size, but it's still a good practice to have higher max_tasks_in_flight_per_actor. We should adjust its default.

ray/python/ray/llm/_internal/batch/processor/vllm_engine_proc.py

Line 287 in 221a193

**config.get_concurrency(autoscaling_enabled=False),

ray/python/ray/llm/_internal/batch/processor/base.py

Line 127 in 221a193

return {"size": self.concurrency}

ray/python/ray/data/_internal/compute.py

Lines 144 to 146 in 221a193

min_size = size

max_size = size

initial_size = size

We shouldn't raise this warning if fixed size pools are used in Ray Data. I'm working on a fix at the moment.

Let's do these:

Adjust the default value

ray/python/ray/llm/_internal/batch/processor/base.py

Line 22 in 221a193

DEFAULT_MAX_TASKS_IN_FLIGHT = 4

from 4 to 16.

Remove this block of comment.

Adjust the default values in the "Understanding the parameters" section above.

Here's the fix to bypass the warnings: #60569.

jeffreywang88 · 2026-01-28T18:49:52Z

Could you please fix microcheck as well?

…60569) ## Description The autoscaling validation warning was incorrectly raised for fixed-size actor pools (`min_size == max_size`). These pools don't scale up, so the warning doesn't apply. ## Related issues Context: #60477 (comment) ## Additional information After this change, when we run `python -m pytest -v -s test_vllm_engine_proc.py::test_generation_model`, we no longer observe autoscaling warnings in the log. Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Updated DEFAULT_MAX_TASKS_IN_FLIGHT to 16 in the source code to improve out-of-the-box throughput. Refined documentation and example code to reflect the new default, applied Google Style Guide fixes (active voice, backticks), and removed the obsolete troubleshooting section. Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>

jeffreywang88

LGTM, thank you again for the contribution! Kicking off release tests.

Partth101 · 2026-01-29T17:02:54Z

Thank you very much @jeffreywang-anyscale

Partth101 · 2026-02-01T16:14:04Z

@jeffreywang-anyscale Just checking if there’s anything else I can contribute to this PR, or if it’s ready to be merged

jeffreywang88 · 2026-02-02T03:15:53Z

Retrying premerge tests

bveeramani

Stamp

…rent_batches (ray-project#60477) Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>

…rent_batches (ray-project#60477) Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>

…rent_batches (#60477) Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…rent_batches (#60477) Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>

…ay-project#60569) ## Description The autoscaling validation warning was incorrectly raised for fixed-size actor pools (`min_size == max_size`). These pools don't scale up, so the warning doesn't apply. ## Related issues Context: ray-project#60477 (comment) ## Additional information After this change, when we run `python -m pytest -v -s test_vllm_engine_proc.py::test_generation_model`, we no longer observe autoscaling warnings in the log. Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>

…rent_batches (ray-project#60477) Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>

…rent_batches (ray-project#60477) Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>

Partth101 requested a review from a team as a code owner January 24, 2026 17:22

gemini-code-assist Bot reviewed Jan 24, 2026

View reviewed changes

Partth101 force-pushed the docs/batch-concurrency-tuning branch from c2c32ac to 939967c Compare January 24, 2026 17:30

Partth101 force-pushed the docs/batch-concurrency-tuning branch from e9c30b1 to d8d07bd Compare January 24, 2026 17:52

ray-gardener Bot added serve Ray Serve Related Issue docs An issue or change related to documentation core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jan 24, 2026

docs(data): tune vLLM concurrent batches and fix style

7e00a7f

Adds documentation explaining the interaction between max_concurrent_batches and max_tasks_in_flight_per_actor parameters for vLLM batch inference. Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>

Partth101 force-pushed the docs/batch-concurrency-tuning branch from 8541dc7 to 7e00a7f Compare January 24, 2026 21:26

Partth101 mentioned this pull request Jan 25, 2026

[data][llm] Explain the behavior of max_tasks_in_flight_per_actor vs. max_concurrent_batches and how to tune them in the documentation #60421

Closed

abrarsheikh requested a review from a team January 27, 2026 07:05

jeffreywang88 requested changes Jan 28, 2026

View reviewed changes

jeffreywang88 mentioned this pull request Jan 28, 2026

[data] Skip upscaling validation warning for fixed-size actor pools #60569

Merged

Partth101 and others added 2 commits January 28, 2026 23:09

Merge branch 'master' into docs/batch-concurrency-tuning

4ce9dce

jeffreywang88 added go add ONLY when ready to merge, run all tests and removed serve Ray Serve Related Issue labels Jan 29, 2026

jeffreywang88 approved these changes Jan 29, 2026

View reviewed changes

kouroshHakha approved these changes Jan 29, 2026

View reviewed changes

kouroshHakha enabled auto-merge (squash) January 29, 2026 20:26

Merge branch 'master' into docs/batch-concurrency-tuning

c06e2ae

github-actions Bot disabled auto-merge February 1, 2026 16:12

Merge branch 'master' into docs/batch-concurrency-tuning

892a350

kouroshHakha enabled auto-merge (squash) February 4, 2026 02:17

Merge branch 'master' into docs/batch-concurrency-tuning

08078a6

github-actions Bot disabled auto-merge February 5, 2026 18:26

bveeramani approved these changes Feb 5, 2026

View reviewed changes

kouroshHakha merged commit cdb8706 into ray-project:master Feb 6, 2026
6 checks passed

elliot-barn pushed a commit that referenced this pull request Feb 9, 2026

[Docs][data.llm] Document max_tasks_in_flight_per_actor vs max_concur…

afcaeb0

…rent_batches (#60477) Signed-off-by: Parth Ghayal <parthmghayal@gmail.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Docs] Document max_tasks_in_flight_per_actor vs max_concurrent_batches#60477

[Docs] Document max_tasks_in_flight_per_actor vs max_concurrent_batches#60477
kouroshHakha merged 7 commits into
ray-project:masterfrom
Partth101:docs/batch-concurrency-tuning

Partth101 commented Jan 24, 2026

gemini-code-assist Bot left a comment

gemini-code-assist Bot Jan 24, 2026

Partth101 Jan 24, 2026

gemini-code-assist Bot Jan 24, 2026

Partth101 Jan 24, 2026

jeffreywang88 left a comment

jeffreywang88 Jan 27, 2026

jeffreywang88 Jan 27, 2026

jeffreywang88 Jan 28, 2026

jeffreywang88 Jan 28, 2026

jeffreywang88 Jan 28, 2026

jeffreywang88 Jan 28, 2026

jeffreywang88 commented Jan 28, 2026

jeffreywang88 left a comment

Partth101 commented Jan 29, 2026

Partth101 commented Feb 1, 2026

jeffreywang88 commented Feb 2, 2026

bveeramani left a comment

Uh oh!

Labels

4 participants

		max_concurrent_batches (default: 8)
		The number of batches that can execute concurrently within a single vLLM engine actor. This overlaps batch processing to hide tail latency. Works well for ``batch_size >= 32``.

	model_source="meta-llama/Llama-3.1-8B-Instruct",
	model_source="unsloth/Llama-3.1-8B-Instruct",

	How many tasks Ray Data can queue per actor before waiting for results. This enables task prefetching so there are always tasks ready when the actor finishes one. Access through the ``experimental`` dict.
	The number of tasks Ray Data can queue per actor before waiting for results. This enables task prefetching so there are always tasks ready when the actor finishes one. Access through the ``experimental`` dict.

		If ``max_tasks_in_flight_per_actor``, which defaults to 4, is less than ``max_concurrent_batches``, which defaults to 8, the actor can't reach full concurrency because there aren't enough queued tasks to fill all concurrent slots.

		To maximize throughput, increase ``max_tasks_in_flight_per_actor`` to keep the actor's task queue saturated.

Uh oh!

Conversation

Partth101 commented Jan 24, 2026

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot Jan 24, 2026

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gemini-code-assist Bot Jan 24, 2026

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffreywang88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffreywang88 commented Jan 28, 2026

jeffreywang88 left a comment

Choose a reason for hiding this comment

Partth101 commented Jan 29, 2026

Partth101 commented Feb 1, 2026

jeffreywang88 commented Feb 2, 2026

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

4 participants