[Serve][LLM] Add `topology` field to LLMConfig to support multi-host with TPUs by ryanaoleary · Pull Request #61906 · ray-project/ray

ryanaoleary · 2026-03-20T10:37:28Z

Description

This PR enables robust multi-host TPU deployments for vLLM on Ray Serve. Previously, deploying on TPU slices required users to manually calculate host counts and replicate bundle dictionaries - with no guarantee that in multi-slice environments their deployment would run atomically on one, co-located slice..

We now utilize Ray Core's native SlicePlacementGroup utility to ensure the created PG is atomically pinned to a co-located TPU slice, mirroring the behavior in Ray Train.

Related issues

#57137

Additional information

Related PR: vllm-project/tpu-inference#1461

…h topology Signed-off-by: ryanaoleary <ryanaoleary@google.com>

gemini-code-assist

Code Review

This pull request adds support for multi-slice TPU deployments with vLLM by leveraging SlicePlacementGroup. The changes are well-structured and include necessary updates to configuration validation, placement group creation logic, and comprehensive tests. I've identified a few areas for improvement to enhance code quality, maintainability, and correctness. My main feedback is to utilize a newly added property to avoid logic duplication and inconsistency. I've also included suggestions for refactoring and improving type hints.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary · 2026-03-20T22:16:13Z

cc: @abrarsheikh @eicherseiji @lorriexingfang

This is a follow-up to vllm-project/tpu-inference#1461 which was merged to support Ray with vllm-tpu. Currently, if users run their Ray deployment in a multi-slice environment (i.e. they might have multiple multi-host slices in their Ray cluster), the placement group that is created for a vLLM deployment is only constructed using the TPU resource request, which can be spread across multiple slices. This will result in the workload timing out since it's required by the framework that all the hosts in the physical slice run the same code. In a single-slice environment the status-quo works, which we verified when testing the linked PR.

This PR solves the multi-slice issue by utilizing the ray.io/tpu-slice-name a unique identifier that's injected when the Pod is created, to construct a placement group that spans only the co-located TPUs on a slice. This is very similar to support we already added to Ray Train (ref). If users specify an accelerator type and a topology in their placement group config, we call the TPU utility to create a SlicePlacementGroup and pass that to the vLLM engine instead.

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary · 2026-03-21T02:40:49Z

Previously to get it to work with a single-slice, multi-host TPU slice we'd have to do something like:

llm_config = LLMConfig(
    accelerator_type="TPU-V6E",
    model_loading_config={
        "model_id": config["model_id"],
        "model_source": config["model_source"],
    },
    engine_kwargs=tpu_engine_config,
    placement_group_config=dict(
        bundles=[{"TPU": config["tpu_chips"], "GPU": 0}] * config["num_hosts"],
        strategy="SPREAD", 
    ),
    runtime_env={
        "env_vars": {
            "VLLM_USE_V1": "1",
            "JAX_PLATFORMS": "",
            "TPU_BACKEND_TYPE": "jax",
            "TPU_MULTIHOST_BACKEND": "ray",
            "RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO": "0",
            "HUGGING_FACE_HUB_TOKEN": os.environ.get("HF_TOKEN", ""),
        }
    },
)

The above only supports single-slice, but with this PR we support multi-slice. The new API will look like:

llm_config = LLMConfig(
    accelerator_type="TPU-V6E",
    topology="4x4",
    model_loading_config={
        "model_id": config["model_id"],
        "model_source": config["model_source"],
    },
    engine_kwargs=tpu_engine_config,
    runtime_env={
        "env_vars": {
            "VLLM_USE_V1": "1",
            "JAX_PLATFORMS": "",
            "TPU_BACKEND_TYPE": "jax",
            "TPU_MULTIHOST_BACKEND": "ray",
            "RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO": "0",
            "HUGGING_FACE_HUB_TOKEN": os.environ.get("HF_TOKEN", ""),
        }
    },
)

Typically we request 1 TPU chip per bundle, to specify consuming the whole host per bundle we'd add:

placement_group_config=dict(
        bundles=[{"TPU": 4}],
        strategy="SPREAD", 
    ),

The number of bundles is multiplied to fill the whole topology automatically. TPU bundles have to request the same # of resources per bundle.

ryanaoleary · 2026-03-21T02:45:42Z

cc: @andrewsykim

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

kouroshHakha

Hey @ryanaoleary — thanks for the PR, the feature itself makes sense and the SlicePlacementGroup approach is the right one. Before we get into the detailed code feedback (posted as inline comments below), I have two architectural questions I'd like to resolve first:

1. Hardware abstraction pattern

This PR adds TPU-specific branches into 6+ locations across vllm_models.py (use_tpu, placement_bundles, get_initialization_kwargs, get_or_create_pg, _tpu_topology, _create_tpu_placement_group, _tpu_slice_pg_wrapper). The file is already ~20% hardware-specific conditionals for GPU/CPU, and this grows it further. Every new accelerator type would need to touch all the same places.

Before we merge more hardware-specific paths, I'd like to agree on a pattern — something like an AcceleratorBackend strategy where each hardware type owns its own bundle generation, PG creation, and executor backend selection, rather than growing the if/elif chains. This could land as a preceding refactor PR or be incorporated here. What are your thoughts?

2. CI plan for TPU tests

The new tests are tagged "gpu" in BUILD.bazel, but there's no TPU CI step in llm.rayci.yml and no tpu instance type in Buildkite. As-is these will either get picked up by GPU CI and fail (no TPU hardware), or never run.

Are you running these somewhere in Google-owned CI today?
Is there a plan to add TPU test infrastructure to Ray's Buildkite?
How do we ensure we don't regress on TPU support going forward?

At minimum the BUILD tag needs to change from "gpu" to something that won't break existing GPU CI.

Note

This review was co-written with AI assistance (Claude Code).

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary · 2026-04-25T01:04:57Z

@kouroshHakha thanks for all the reviews, I believe all outstanding comments have been resolved at this point

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

kouroshHakha

LGTM, thanks for your contribution @ryanaoleary

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 8b4f017. Configure here.}

cursor · 2026-04-25T01:32:02Z

            )

            logger.info(f"Using new placement group {pg}. {placement_group_table(pg)}")
+


SlicePlacementGroup is unreachable in Serve deployments

Medium Severity

In Ray Serve deployments, LLMServer.get_deployment_options uses engine_config.placement_bundles to set placement_group_bundles for the deployment. Serve creates a regular placement group from those bundles, and the replica runs inside it. When get_or_create_pg is called from within the replica, get_current_placement_group() finds the Serve-managed PG and returns it immediately — so TPUAccelerator.create_placement_group (which calls slice_placement_group) is never reached. This means the SlicePlacementGroup co-location guarantee, the primary goal of this PR for multi-slice environments, is bypassed in Serve deployments. The placement_bundles property also doesn't account for topology when generating default bundles (it uses num_devices from TP×PP), creating a mismatch with what slice_placement_group would compute from the topology.

Additional Locations (1)

python/ray/llm/_internal/serve/engines/vllm/vllm_models.py#L249-L281

^{Reviewed by Cursor Bugbot for commit 8b4f017. Configure here.}

I think this is valid actually and I should address it in a follow-up. This PR enables creating the configs directly with the new API, but did not update the Serve deployment code to utilize the placement group created by the backend. Since the replica creates its own PG before allowing the config to do so, we skip the SlicePlacementGroup code.

We just need to modify LLMServer.get_deployment_options to pop placement_group_bundles and placement_group_strategy when it's a TPU deployment with topology set because the PG lifecycle is managed by the config, not the deployment.

Created follow-up PR here: #62941

…with TPUs (ray-project#61906) Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com>

## Description This PR addresses a TODO comment from #61906 to replace `GPUType` across files with `AcceleratorType`. We currently alias the type and made the switch to the latter since multiple accelerator types (GPU, TPU, etc.) are now supported with Ray LLM through extensible `AcceleratorBackend`s. ## Related issues ## Additional information Signed-off-by: ryanaoleary <ryanaoleary@google.com>

…ckend (#62941) ## Description This PR enables the `SlicePlacementGroup` reservation logic for TPU accelerator backends to be called in Ray Serve when creating a Serve deployment. This is a follow-up to #61906 which added the new accelerator config logic and `topology` API to Serve `LLMConfig` and `VLLMEngine`. Specific Changes: - Added `requires_deferred_placement_group` property to AcceleratorBackend interface, which returns True for TPU backend when `topology` is set. This is used to check whether to set `pg_bundles` in `LLMServer` which are used to create a standard Ray PG. This results in the `LLMConfig` calling our backend-specific reservation logic when the Serve deployment starts. - Added unit tests and end-to-end mocked deployment tests to verify the two-step PG reservation works properly for Serve deployments ## Related issues #57137 ## Additional information This comment describes the rationale for this PR in more detail: #61906 (comment) --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com>

…with TPUs (ray-project#61906) Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

## Description This PR addresses a TODO comment from ray-project#61906 to replace `GPUType` across files with `AcceleratorType`. We currently alias the type and made the switch to the latter since multiple accelerator types (GPU, TPU, etc.) are now supported with Ray LLM through extensible `AcceleratorBackend`s. ## Related issues ## Additional information Signed-off-by: ryanaoleary <ryanaoleary@google.com>

…ckend (ray-project#62941) ## Description This PR enables the `SlicePlacementGroup` reservation logic for TPU accelerator backends to be called in Ray Serve when creating a Serve deployment. This is a follow-up to ray-project#61906 which added the new accelerator config logic and `topology` API to Serve `LLMConfig` and `VLLMEngine`. Specific Changes: - Added `requires_deferred_placement_group` property to AcceleratorBackend interface, which returns True for TPU backend when `topology` is set. This is used to check whether to set `pg_bundles` in `LLMServer` which are used to create a standard Ray PG. This results in the `LLMConfig` calling our backend-specific reservation logic when the Serve deployment starts. - Added unit tests and end-to-end mocked deployment tests to verify the two-step PG reservation works properly for Serve deployments ## Related issues ray-project#57137 ## Additional information This comment describes the rationale for this PR in more detail: ray-project#61906 (comment) --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary and others added 2 commits March 20, 2026 10:35

[Serve] Scheduling using SlicePlacementGroup for vLLM deployments wit…

7881e46

…h topology Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Merge branch 'master' into ray-serve-slice-pg

7c3a7d8

gemini-code-assist Bot reviewed Mar 20, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated

Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated

Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated

ryanaoleary and others added 4 commits March 20, 2026 03:44

Update python/ray/llm/_internal/serve/engines/vllm/vllm_models.py

417e298

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>

Update python/ray/llm/_internal/serve/engines/vllm/vllm_models.py

12fad91

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>

Update python/ray/llm/_internal/serve/engines/vllm/vllm_models.py

2e478af

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>

lint

ff43ead

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary changed the title ~~[Serve] Scheduling using SlicePlacementGroup for vLLM deployments wit…~~ Mar 20, 2026

ryanaoleary marked this pull request as ready for review March 20, 2026 22:08

ryanaoleary requested a review from a team as a code owner March 20, 2026 22:08

cursor Bot reviewed Mar 20, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated

Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py

ray-gardener Bot added serve Ray Serve Related Issue llm community-contribution Contributed by the community labels Mar 21, 2026

Move topology to top-level field

19193b1

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary changed the title ~~[Serve] Scheduling using SlicePlacementGroup for vLLM deployments when topology is set for multi-host groups~~ Mar 21, 2026

Add topology to LLMConfig

b4d0685

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

cursor Bot reviewed Mar 21, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated

Small fix to get requested topology correctly

373ed06

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

cursor Bot reviewed Mar 21, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated

Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated

ryanaoleary added 2 commits March 21, 2026 07:39

Save reference to Slice PG so we can garbage collect

746c5da

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

fix GPU legacy default behavior for TPUs

4c8fbe5

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

cursor Bot reviewed Mar 21, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py

Ensure backend set for TPU mode is ray

df09350

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

cursor Bot reviewed Mar 22, 2026

View reviewed changes

Comment thread python/ray/llm/tests/BUILD.bazel

kouroshHakha reviewed Mar 23, 2026

View reviewed changes

ryanaoleary added 5 commits April 24, 2026 23:36

change remote init to abstract method

ae7e844

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

remove unnecessary getattr and make more readable

1f58bdd

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Move accelerator type formatting logic all to accelerators.py

8e5f92b

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Add conftest file for TPU tests

6b9f67f

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

update validation tests

5e661e0

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

cursor Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/core/configs/accelerators.py

move test, renmae, and fix key error in default_bundles

f3efc20

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Rename function

de13bf2

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

cursor Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/core/configs/accelerators.py

ensure old slice pg is shutdown before new one created

8b4f017

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

kouroshHakha approved these changes Apr 25, 2026

View reviewed changes

kouroshHakha changed the title ~~[Serve] Add topology field to LLMConfig to support multi-host with TPUs~~ Apr 25, 2026

kouroshHakha enabled auto-merge (squash) April 25, 2026 01:31

cursor Bot reviewed Apr 25, 2026

View reviewed changes

kouroshHakha merged commit 2674b71 into ray-project:master Apr 25, 2026
7 checks passed

ryanaoleary mentioned this pull request Apr 25, 2026

[Serve] Defer PG creation for TPU Serve deployments to accelerator backend #62941

Merged

ryanaoleary mentioned this pull request Apr 27, 2026

[LLM] Replace GPUType alias with AcceleratorType #62978

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Serve][LLM] Add `topology` field to LLMConfig to support multi-host with TPUs#61906

[Serve][LLM] Add `topology` field to LLMConfig to support multi-host with TPUs#61906
kouroshHakha merged 60 commits into
ray-project:masterfrom
ryanaoleary:ray-serve-slice-pg

ryanaoleary commented Mar 20, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanaoleary commented Mar 20, 2026

ryanaoleary commented Mar 21, 2026 •

edited

Loading

ryanaoleary commented Mar 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanaoleary commented Apr 25, 2026

Uh oh!

kouroshHakha left a comment

cursor Bot left a comment

cursor Bot Apr 25, 2026

ryanaoleary Apr 25, 2026

ryanaoleary Apr 25, 2026

Uh oh!

Labels

4 participants

		)

		logger.info(f"Using new placement group {pg}. {placement_group_table(pg)}")

Uh oh!

Conversation

ryanaoleary commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanaoleary commented Mar 20, 2026

ryanaoleary commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ryanaoleary commented Mar 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanaoleary commented Apr 25, 2026

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

cursor Bot left a comment

Choose a reason for hiding this comment

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

SlicePlacementGroup is unreachable in Serve deployments

ryanaoleary Apr 25, 2026

Choose a reason for hiding this comment

ryanaoleary Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Labels

4 participants

ryanaoleary commented Mar 20, 2026 •

edited

Loading

ryanaoleary commented Mar 21, 2026 •

edited

Loading