Skip to content

[Serve][LLM] Add topology field to LLMConfig to support multi-host with TPUs#61906

Merged
kouroshHakha merged 60 commits into
ray-project:masterfrom
ryanaoleary:ray-serve-slice-pg
Apr 25, 2026
Merged

[Serve][LLM] Add topology field to LLMConfig to support multi-host with TPUs#61906
kouroshHakha merged 60 commits into
ray-project:masterfrom
ryanaoleary:ray-serve-slice-pg

Conversation

@ryanaoleary

@ryanaoleary ryanaoleary commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

Description

This PR enables robust multi-host TPU deployments for vLLM on Ray Serve. Previously, deploying on TPU slices required users to manually calculate host counts and replicate bundle dictionaries - with no guarantee that in multi-slice environments their deployment would run atomically on one, co-located slice..

We now utilize Ray Core's native SlicePlacementGroup utility to ensure the created PG is atomically pinned to a co-located TPU slice, mirroring the behavior in Ray Train.

Related issues

#57137

Additional information

Related PR: vllm-project/tpu-inference#1461

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for multi-slice TPU deployments with vLLM by leveraging SlicePlacementGroup. The changes are well-structured and include necessary updates to configuration validation, placement group creation logic, and comprehensive tests. I've identified a few areas for improvement to enhance code quality, maintainability, and correctness. My main feedback is to utilize a newly added property to avoid logic duplication and inconsistency. I've also included suggestions for refactoring and improving type hints.

Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated
Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated
Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated
ryanaoleary and others added 4 commits March 20, 2026 03:44
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary changed the title [Serve] Scheduling using SlicePlacementGroup for vLLM deployments wit… Mar 20, 2026
@ryanaoleary ryanaoleary marked this pull request as ready for review March 20, 2026 22:08
@ryanaoleary ryanaoleary requested a review from a team as a code owner March 20, 2026 22:08
Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated
Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py
@ryanaoleary

Copy link
Copy Markdown
Contributor Author

cc: @abrarsheikh @eicherseiji @lorriexingfang

This is a follow-up to vllm-project/tpu-inference#1461 which was merged to support Ray with vllm-tpu. Currently, if users run their Ray deployment in a multi-slice environment (i.e. they might have multiple multi-host slices in their Ray cluster), the placement group that is created for a vLLM deployment is only constructed using the TPU resource request, which can be spread across multiple slices. This will result in the workload timing out since it's required by the framework that all the hosts in the physical slice run the same code. In a single-slice environment the status-quo works, which we verified when testing the linked PR.

This PR solves the multi-slice issue by utilizing the ray.io/tpu-slice-name a unique identifier that's injected when the Pod is created, to construct a placement group that spans only the co-located TPUs on a slice. This is very similar to support we already added to Ray Train (ref). If users specify an accelerator type and a topology in their placement group config, we call the TPU utility to create a SlicePlacementGroup and pass that to the vLLM engine instead.

@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue llm community-contribution Contributed by the community labels Mar 21, 2026
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@ryanaoleary

ryanaoleary commented Mar 21, 2026

Copy link
Copy Markdown
Contributor Author

Previously to get it to work with a single-slice, multi-host TPU slice we'd have to do something like:

llm_config = LLMConfig(
    accelerator_type="TPU-V6E",
    model_loading_config={
        "model_id": config["model_id"],
        "model_source": config["model_source"],
    },
    engine_kwargs=tpu_engine_config,
    placement_group_config=dict(
        bundles=[{"TPU": config["tpu_chips"], "GPU": 0}] * config["num_hosts"],
        strategy="SPREAD", 
    ),
    runtime_env={
        "env_vars": {
            "VLLM_USE_V1": "1",
            "JAX_PLATFORMS": "",
            "TPU_BACKEND_TYPE": "jax",
            "TPU_MULTIHOST_BACKEND": "ray",
            "RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO": "0",
            "HUGGING_FACE_HUB_TOKEN": os.environ.get("HF_TOKEN", ""),
        }
    },
)

The above only supports single-slice, but with this PR we support multi-slice. The new API will look like:

llm_config = LLMConfig(
    accelerator_type="TPU-V6E",
    topology="4x4",
    model_loading_config={
        "model_id": config["model_id"],
        "model_source": config["model_source"],
    },
    engine_kwargs=tpu_engine_config,
    runtime_env={
        "env_vars": {
            "VLLM_USE_V1": "1",
            "JAX_PLATFORMS": "",
            "TPU_BACKEND_TYPE": "jax",
            "TPU_MULTIHOST_BACKEND": "ray",
            "RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO": "0",
            "HUGGING_FACE_HUB_TOKEN": os.environ.get("HF_TOKEN", ""),
        }
    },
)

Typically we request 1 TPU chip per bundle, to specify consuming the whole host per bundle we'd add:

placement_group_config=dict(
        bundles=[{"TPU": 4}],
        strategy="SPREAD", 
    ),

The number of bundles is multiplied to fill the whole topology automatically. TPU bundles have to request the same # of resources per bundle.

@ryanaoleary ryanaoleary changed the title [Serve] Scheduling using SlicePlacementGroup for vLLM deployments when topology is set for multi-host groups Mar 21, 2026
@ryanaoleary

Copy link
Copy Markdown
Contributor Author
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated
Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Comment thread python/ray/llm/tests/BUILD.bazel

@kouroshHakha kouroshHakha left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ryanaoleary — thanks for the PR, the feature itself makes sense and the SlicePlacementGroup approach is the right one. Before we get into the detailed code feedback (posted as inline comments below), I have two architectural questions I'd like to resolve first:

1. Hardware abstraction pattern

This PR adds TPU-specific branches into 6+ locations across vllm_models.py (use_tpu, placement_bundles, get_initialization_kwargs, get_or_create_pg, _tpu_topology, _create_tpu_placement_group, _tpu_slice_pg_wrapper). The file is already ~20% hardware-specific conditionals for GPU/CPU, and this grows it further. Every new accelerator type would need to touch all the same places.

Before we merge more hardware-specific paths, I'd like to agree on a pattern — something like an AcceleratorBackend strategy where each hardware type owns its own bundle generation, PG creation, and executor backend selection, rather than growing the if/elif chains. This could land as a preceding refactor PR or be incorporated here. What are your thoughts?

2. CI plan for TPU tests

The new tests are tagged "gpu" in BUILD.bazel, but there's no TPU CI step in llm.rayci.yml and no tpu instance type in Buildkite. As-is these will either get picked up by GPU CI and fail (no TPU hardware), or never run.

  • Are you running these somewhere in Google-owned CI today?
  • Is there a plan to add TPU test infrastructure to Ray's Buildkite?
  • How do we ensure we don't regress on TPU support going forward?

At minimum the BUILD tag needs to change from "gpu" to something that won't break existing GPU CI.


Note

This review was co-written with AI assistance (Claude Code).

Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated
Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated
Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated
Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py
Comment thread python/ray/llm/_internal/serve/engines/vllm/vllm_models.py Outdated
Comment thread python/ray/llm/tests/BUILD.bazel
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Comment thread python/ray/llm/_internal/serve/core/configs/accelerators.py
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@ryanaoleary

Copy link
Copy Markdown
Contributor Author

@kouroshHakha thanks for all the reviews, I believe all outstanding comments have been resolved at this point

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Comment thread python/ray/llm/_internal/serve/core/configs/accelerators.py
Signed-off-by: ryanaoleary <ryanaoleary@google.com>

@kouroshHakha kouroshHakha left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your contribution @ryanaoleary

@kouroshHakha kouroshHakha changed the title [Serve] Add topology field to LLMConfig to support multi-host with TPUs Apr 25, 2026
@kouroshHakha kouroshHakha enabled auto-merge (squash) April 25, 2026 01:31

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 8b4f017. Configure here.

)

logger.info(f"Using new placement group {pg}. {placement_group_table(pg)}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SlicePlacementGroup is unreachable in Serve deployments

Medium Severity

In Ray Serve deployments, LLMServer.get_deployment_options uses engine_config.placement_bundles to set placement_group_bundles for the deployment. Serve creates a regular placement group from those bundles, and the replica runs inside it. When get_or_create_pg is called from within the replica, get_current_placement_group() finds the Serve-managed PG and returns it immediately — so TPUAccelerator.create_placement_group (which calls slice_placement_group) is never reached. This means the SlicePlacementGroup co-location guarantee, the primary goal of this PR for multi-slice environments, is bypassed in Serve deployments. The placement_bundles property also doesn't account for topology when generating default bundles (it uses num_devices from TP×PP), creating a mismatch with what slice_placement_group would compute from the topology.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8b4f017. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is valid actually and I should address it in a follow-up. This PR enables creating the configs directly with the new API, but did not update the Serve deployment code to utilize the placement group created by the backend. Since the replica creates its own PG before allowing the config to do so, we skip the SlicePlacementGroup code.

We just need to modify LLMServer.get_deployment_options to pop placement_group_bundles and placement_group_strategy when it's a TPU deployment with topology set because the PG lifecycle is managed by the config, not the deployment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created follow-up PR here: #62941

@kouroshHakha kouroshHakha merged commit 2674b71 into ray-project:master Apr 25, 2026
7 checks passed
pushpavanthar pushed a commit to pushpavanthar/ray that referenced this pull request Apr 25, 2026
…with TPUs (ray-project#61906)

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com>
ryanaoleary added a commit that referenced this pull request Apr 28, 2026
## Description
This PR addresses a TODO comment from
#61906 to replace `GPUType`
across files with `AcceleratorType`. We currently alias the type and
made the switch to the latter since multiple accelerator types (GPU,
TPU, etc.) are now supported with Ray LLM through extensible
`AcceleratorBackend`s.

## Related issues

## Additional information

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
ryanaoleary added a commit that referenced this pull request Apr 28, 2026
…ckend (#62941)

## Description
This PR enables the `SlicePlacementGroup` reservation logic for TPU
accelerator backends to be called in Ray Serve when creating a Serve
deployment. This is a follow-up to
#61906 which added the new
accelerator config logic and `topology` API to Serve `LLMConfig` and
`VLLMEngine`.

Specific Changes:
- Added `requires_deferred_placement_group` property to
AcceleratorBackend interface, which returns True for TPU backend when
`topology` is set. This is used to check whether to set `pg_bundles` in
`LLMServer` which are used to create a standard Ray PG. This results in
the `LLMConfig` calling our backend-specific reservation logic when the
Serve deployment starts.
- Added unit tests and end-to-end mocked deployment tests to verify the
two-step PG reservation works properly for Serve deployments

## Related issues
#57137

## Additional information
This comment describes the rationale for this PR in more detail:
#61906 (comment)

---------

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…with TPUs (ray-project#61906)

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
## Description
This PR addresses a TODO comment from
ray-project#61906 to replace `GPUType`
across files with `AcceleratorType`. We currently alias the type and
made the switch to the latter since multiple accelerator types (GPU,
TPU, etc.) are now supported with Ray LLM through extensible
`AcceleratorBackend`s.

## Related issues

## Additional information

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…ckend (ray-project#62941)

## Description
This PR enables the `SlicePlacementGroup` reservation logic for TPU
accelerator backends to be called in Ray Serve when creating a Serve
deployment. This is a follow-up to
ray-project#61906 which added the new
accelerator config logic and `topology` API to Serve `LLMConfig` and
`VLLMEngine`.

Specific Changes:
- Added `requires_deferred_placement_group` property to
AcceleratorBackend interface, which returns True for TPU backend when
`topology` is set. This is used to check whether to set `pg_bundles` in
`LLMServer` which are used to create a standard Ray PG. This results in
the `LLMConfig` calling our backend-specific reservation logic when the
Serve deployment starts.
- Added unit tests and end-to-end mocked deployment tests to verify the
two-step PG reservation works properly for Serve deployments

## Related issues
ray-project#57137

## Additional information
This comment describes the rationale for this PR in more detail:
ray-project#61906 (comment)

---------

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests llm serve Ray Serve Related Issue

4 participants