[data][llm] Enable autoscaling GPU stages by jeffreywang88 · Pull Request #61130 · ray-project/ray

jeffreywang88 · 2026-02-18T02:31:08Z

Description

To support preemption, the entire Ray Data LLM pipeline must be auto-scalable. Currently, GPU stages rely on fixed actor pools which limits elasticity.

In Ray Data, resources are allocated in two phases to prevent starvation:

Phase 1: Reservation: Each operator gets reservation_ratio * global_limits / num_ops (default 50%), clamped to min_resource_usage (from min_size) and max_resource_usage (from max_size). Upstream operators are prioritized if resources are insufficient because downstream ops can wait for upstream ops to complete and release resources.

ray/python/ray/data/_internal/execution/resource_manager.py

Lines 803 to 819 in f1a1039

    
           default_reserved = limits.scale(self._reservation_ratio / (len(eligible_ops))) 
        
           for index, op in enumerate(eligible_ops): 
        
               # Reserve at least half of the default reserved resources for the outputs. 
        
               # This makes sure that we will have enough budget to pull blocks from the 
        
               # op. 
        
               reserved_for_outputs = ExecutionResources( 
        
                   0, 0, max(default_reserved.object_store_memory / 2, 1) 
        
               ) 
        
               reserved_for_tasks = default_reserved.subtract(reserved_for_outputs) 
        
               min_resource_usage, max_resource_usage = op.min_max_resource_requirements() 
        
               if min_resource_usage is not None: 
        
                   reserved_for_tasks = reserved_for_tasks.max(min_resource_usage) 
        
               if max_resource_usage is not None: 
        
                   reserved_for_tasks = reserved_for_tasks.min(max_resource_usage)

Phase 2: Shared Allocation: Remaining resources are allocated in reverse topological order (downstream first). Each operator receives remaining_shared / (num_ops - i), with borrowing allowed for operators below min_scheduling_resources. Total allocation is capped at max_resource_usage.

ray/python/ray/data/_internal/execution/resource_manager.py

Lines 967 to 978 in f1a1039

    
           for i, op in enumerate(reversed(eligible_ops)): 
        
               # By default, divide the remaining shared resources equally. 
        
               op_shared = remaining_shared.scale(1.0 / (len(eligible_ops) - i)) 
        
               # But if the op's budget is less than `min_scheduling_resources`, 
        
               # it will be useless. So we'll let the downstream operator 
        
               # borrow some resources from the upstream operator, if remaining_shared 
        
               # is still enough. 
        
               to_borrow = ( 
        
                   op.min_scheduling_resources() 
        
                   .subtract(self._op_budgets[op].add(op_shared)) 
        
                   .max(ExecutionResources.zero()) 
        
               )

All operators are guaranteed at least min(reservation_minimum, min_replicas) resources, and downstream operators get priority in shared allocation. Tune via RAY_DATA_OP_RESERVATION_RATIO (default: 0.5). Starvation is not a concern in chained processors.

ray/python/ray/data/context.py

Line 210 in f1a1039

os.environ.get("RAY_DATA_OP_RESERVATION_RATIO", "0.5")

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

gemini-code-assist

Code Review

The pull request enables autoscaling for GPU stages in the Ray Data LLM pipeline, which is crucial for supporting preemption and improving elasticity. The changes involve updating documentation and code comments to reflect this new behavior, as well as modifying the get_concurrency method call in vllm_engine_proc.py and sglang_engine_proc.py to enable autoscaling. A new test case test_vllm_autoscaling_no_starvation has also been added to verify that chained vLLMEngineProcessor instances with autoscaling concurrency can run without starving each other. The changes are well-aligned with the objective of enabling autoscaling for GPU stages and improving the overall elasticity of the LLM pipeline.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Enable autoscaling GPU stages

daee1fb

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang88 requested a review from a team as a code owner February 18, 2026 02:31

jeffreywang88 added the go add ONLY when ready to merge, run all tests label Feb 18, 2026

gemini-code-assist Bot reviewed Feb 18, 2026

View reviewed changes

ray-gardener Bot added the community-contribution Contributed by the community label Feb 18, 2026

Fix tests

dee5c51

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang88 force-pushed the autoscaling-ray-data-llm branch from e5d1d2e to dee5c51 Compare February 18, 2026 07:45

kouroshHakha approved these changes Feb 19, 2026

View reviewed changes

kouroshHakha merged commit 66b2c8b into ray-project:master Feb 19, 2026
6 checks passed

claude Bot added the claude-code-assisted label Feb 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data][llm] Enable autoscaling GPU stages#61130

[data][llm] Enable autoscaling GPU stages#61130
kouroshHakha merged 2 commits into
ray-project:masterfrom
jeffreywang88:autoscaling-ray-data-llm

jeffreywang88 commented Feb 18, 2026

gemini-code-assist Bot left a comment

Uh oh!

Labels

2 participants

	default_reserved = limits.scale(self._reservation_ratio / (len(eligible_ops)))
	for index, op in enumerate(eligible_ops):
	# Reserve at least half of the default reserved resources for the outputs.
	# This makes sure that we will have enough budget to pull blocks from the
	# op.
	reserved_for_outputs = ExecutionResources(
	0, 0, max(default_reserved.object_store_memory / 2, 1)
	)

	reserved_for_tasks = default_reserved.subtract(reserved_for_outputs)

	min_resource_usage, max_resource_usage = op.min_max_resource_requirements()

	if min_resource_usage is not None:
	reserved_for_tasks = reserved_for_tasks.max(min_resource_usage)
	if max_resource_usage is not None:
	reserved_for_tasks = reserved_for_tasks.min(max_resource_usage)

	for i, op in enumerate(reversed(eligible_ops)):
	# By default, divide the remaining shared resources equally.
	op_shared = remaining_shared.scale(1.0 / (len(eligible_ops) - i))
	# But if the op's budget is less than `min_scheduling_resources`,
	# it will be useless. So we'll let the downstream operator
	# borrow some resources from the upstream operator, if remaining_shared
	# is still enough.
	to_borrow = (
	op.min_scheduling_resources()
	.subtract(self._op_budgets[op].add(op_shared))
	.max(ExecutionResources.zero())
	)

Uh oh!

Conversation

jeffreywang88 commented Feb 18, 2026

Description

Related issues

Additional information

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Labels

2 participants