Skip to content

[data][llm] Enable autoscaling GPU stages#61130

Merged
kouroshHakha merged 2 commits into
ray-project:masterfrom
jeffreywang88:autoscaling-ray-data-llm
Feb 19, 2026
Merged

[data][llm] Enable autoscaling GPU stages#61130
kouroshHakha merged 2 commits into
ray-project:masterfrom
jeffreywang88:autoscaling-ray-data-llm

Conversation

@jeffreywang88

Copy link
Copy Markdown
Contributor

Description

To support preemption, the entire Ray Data LLM pipeline must be auto-scalable. Currently, GPU stages rely on fixed actor pools which limits elasticity.

In Ray Data, resources are allocated in two phases to prevent starvation:

  • Phase 1: Reservation: Each operator gets reservation_ratio * global_limits / num_ops (default 50%), clamped to min_resource_usage (from min_size) and max_resource_usage (from max_size). Upstream operators are prioritized if resources are insufficient because downstream ops can wait for upstream ops to complete and release resources.

    default_reserved = limits.scale(self._reservation_ratio / (len(eligible_ops)))
    for index, op in enumerate(eligible_ops):
    # Reserve at least half of the default reserved resources for the outputs.
    # This makes sure that we will have enough budget to pull blocks from the
    # op.
    reserved_for_outputs = ExecutionResources(
    0, 0, max(default_reserved.object_store_memory / 2, 1)
    )
    reserved_for_tasks = default_reserved.subtract(reserved_for_outputs)
    min_resource_usage, max_resource_usage = op.min_max_resource_requirements()
    if min_resource_usage is not None:
    reserved_for_tasks = reserved_for_tasks.max(min_resource_usage)
    if max_resource_usage is not None:
    reserved_for_tasks = reserved_for_tasks.min(max_resource_usage)

  • Phase 2: Shared Allocation: Remaining resources are allocated in reverse topological order (downstream first). Each operator receives remaining_shared / (num_ops - i), with borrowing allowed for operators below min_scheduling_resources. Total allocation is capped at max_resource_usage.

    for i, op in enumerate(reversed(eligible_ops)):
    # By default, divide the remaining shared resources equally.
    op_shared = remaining_shared.scale(1.0 / (len(eligible_ops) - i))
    # But if the op's budget is less than `min_scheduling_resources`,
    # it will be useless. So we'll let the downstream operator
    # borrow some resources from the upstream operator, if remaining_shared
    # is still enough.
    to_borrow = (
    op.min_scheduling_resources()
    .subtract(self._op_budgets[op].add(op_shared))
    .max(ExecutionResources.zero())
    )

All operators are guaranteed at least min(reservation_minimum, min_replicas) resources, and downstream operators get priority in shared allocation. Tune via RAY_DATA_OP_RESERVATION_RATIO (default: 0.5). Starvation is not a concern in chained processors.

os.environ.get("RAY_DATA_OP_RESERVATION_RATIO", "0.5")

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Screenshot 2026-02-17 at 6 30 18 PM
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang88 jeffreywang88 requested a review from a team as a code owner February 18, 2026 02:31
@jeffreywang88 jeffreywang88 added the go add ONLY when ready to merge, run all tests label Feb 18, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request enables autoscaling for GPU stages in the Ray Data LLM pipeline, which is crucial for supporting preemption and improving elasticity. The changes involve updating documentation and code comments to reflect this new behavior, as well as modifying the get_concurrency method call in vllm_engine_proc.py and sglang_engine_proc.py to enable autoscaling. A new test case test_vllm_autoscaling_no_starvation has also been added to verify that chained vLLMEngineProcessor instances with autoscaling concurrency can run without starving each other. The changes are well-aligned with the objective of enabling autoscaling for GPU stages and improving the overall elasticity of the LLM pipeline.

@ray-gardener ray-gardener Bot added the community-contribution Contributed by the community label Feb 18, 2026
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang88 jeffreywang88 force-pushed the autoscaling-ray-data-llm branch from e5d1d2e to dee5c51 Compare February 18, 2026 07:45
@kouroshHakha kouroshHakha merged commit 66b2c8b into ray-project:master Feb 19, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code-assisted community-contribution Contributed by the community go add ONLY when ready to merge, run all tests

2 participants