[Serve][LLM] Add topology field to LLMConfig to support multi-host with TPUs#61906
Conversation
…h topology Signed-off-by: ryanaoleary <ryanaoleary@google.com>
There was a problem hiding this comment.
Code Review
This pull request adds support for multi-slice TPU deployments with vLLM by leveraging SlicePlacementGroup. The changes are well-structured and include necessary updates to configuration validation, placement group creation logic, and comprehensive tests. I've identified a few areas for improvement to enhance code quality, maintainability, and correctness. My main feedback is to utilize a newly added property to avoid logic duplication and inconsistency. I've also included suggestions for refactoring and improving type hints.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
|
cc: @abrarsheikh @eicherseiji @lorriexingfang This is a follow-up to vllm-project/tpu-inference#1461 which was merged to support Ray with vllm-tpu. Currently, if users run their Ray deployment in a multi-slice environment (i.e. they might have multiple multi-host slices in their Ray cluster), the placement group that is created for a vLLM deployment is only constructed using the TPU resource request, which can be spread across multiple slices. This will result in the workload timing out since it's required by the framework that all the hosts in the physical slice run the same code. In a single-slice environment the status-quo works, which we verified when testing the linked PR. This PR solves the multi-slice issue by utilizing the |
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
|
Previously to get it to work with a single-slice, multi-host TPU slice we'd have to do something like: The above only supports single-slice, but with this PR we support multi-slice. The new API will look like: Typically we request 1 TPU chip per bundle, to specify consuming the whole host per bundle we'd add: The number of bundles is multiplied to fill the whole topology automatically. TPU bundles have to request the same # of resources per bundle. |
topology is set for multi-host groups|
cc: @andrewsykim |
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
kouroshHakha
left a comment
There was a problem hiding this comment.
Hey @ryanaoleary — thanks for the PR, the feature itself makes sense and the SlicePlacementGroup approach is the right one. Before we get into the detailed code feedback (posted as inline comments below), I have two architectural questions I'd like to resolve first:
1. Hardware abstraction pattern
This PR adds TPU-specific branches into 6+ locations across vllm_models.py (use_tpu, placement_bundles, get_initialization_kwargs, get_or_create_pg, _tpu_topology, _create_tpu_placement_group, _tpu_slice_pg_wrapper). The file is already ~20% hardware-specific conditionals for GPU/CPU, and this grows it further. Every new accelerator type would need to touch all the same places.
Before we merge more hardware-specific paths, I'd like to agree on a pattern — something like an AcceleratorBackend strategy where each hardware type owns its own bundle generation, PG creation, and executor backend selection, rather than growing the if/elif chains. This could land as a preceding refactor PR or be incorporated here. What are your thoughts?
2. CI plan for TPU tests
The new tests are tagged "gpu" in BUILD.bazel, but there's no TPU CI step in llm.rayci.yml and no tpu instance type in Buildkite. As-is these will either get picked up by GPU CI and fail (no TPU hardware), or never run.
- Are you running these somewhere in Google-owned CI today?
- Is there a plan to add TPU test infrastructure to Ray's Buildkite?
- How do we ensure we don't regress on TPU support going forward?
At minimum the BUILD tag needs to change from "gpu" to something that won't break existing GPU CI.
Note
This review was co-written with AI assistance (Claude Code).
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
|
@kouroshHakha thanks for all the reviews, I believe all outstanding comments have been resolved at this point |
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
kouroshHakha
left a comment
There was a problem hiding this comment.
LGTM, thanks for your contribution @ryanaoleary
topology field to LLMConfig to support multi-host with TPUsThere was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 8b4f017. Configure here.
| ) | ||
|
|
||
| logger.info(f"Using new placement group {pg}. {placement_group_table(pg)}") | ||
|
|
There was a problem hiding this comment.
SlicePlacementGroup is unreachable in Serve deployments
Medium Severity
In Ray Serve deployments, LLMServer.get_deployment_options uses engine_config.placement_bundles to set placement_group_bundles for the deployment. Serve creates a regular placement group from those bundles, and the replica runs inside it. When get_or_create_pg is called from within the replica, get_current_placement_group() finds the Serve-managed PG and returns it immediately — so TPUAccelerator.create_placement_group (which calls slice_placement_group) is never reached. This means the SlicePlacementGroup co-location guarantee, the primary goal of this PR for multi-slice environments, is bypassed in Serve deployments. The placement_bundles property also doesn't account for topology when generating default bundles (it uses num_devices from TP×PP), creating a mismatch with what slice_placement_group would compute from the topology.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 8b4f017. Configure here.
There was a problem hiding this comment.
I think this is valid actually and I should address it in a follow-up. This PR enables creating the configs directly with the new API, but did not update the Serve deployment code to utilize the placement group created by the backend. Since the replica creates its own PG before allowing the config to do so, we skip the SlicePlacementGroup code.
We just need to modify LLMServer.get_deployment_options to pop placement_group_bundles and placement_group_strategy when it's a TPU deployment with topology set because the PG lifecycle is managed by the config, not the deployment.
…with TPUs (ray-project#61906) Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com>
## Description This PR addresses a TODO comment from #61906 to replace `GPUType` across files with `AcceleratorType`. We currently alias the type and made the switch to the latter since multiple accelerator types (GPU, TPU, etc.) are now supported with Ray LLM through extensible `AcceleratorBackend`s. ## Related issues ## Additional information Signed-off-by: ryanaoleary <ryanaoleary@google.com>
…ckend (#62941) ## Description This PR enables the `SlicePlacementGroup` reservation logic for TPU accelerator backends to be called in Ray Serve when creating a Serve deployment. This is a follow-up to #61906 which added the new accelerator config logic and `topology` API to Serve `LLMConfig` and `VLLMEngine`. Specific Changes: - Added `requires_deferred_placement_group` property to AcceleratorBackend interface, which returns True for TPU backend when `topology` is set. This is used to check whether to set `pg_bundles` in `LLMServer` which are used to create a standard Ray PG. This results in the `LLMConfig` calling our backend-specific reservation logic when the Serve deployment starts. - Added unit tests and end-to-end mocked deployment tests to verify the two-step PG reservation works properly for Serve deployments ## Related issues #57137 ## Additional information This comment describes the rationale for this PR in more detail: #61906 (comment) --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com>
…with TPUs (ray-project#61906) Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## Description This PR addresses a TODO comment from ray-project#61906 to replace `GPUType` across files with `AcceleratorType`. We currently alias the type and made the switch to the latter since multiple accelerator types (GPU, TPU, etc.) are now supported with Ray LLM through extensible `AcceleratorBackend`s. ## Related issues ## Additional information Signed-off-by: ryanaoleary <ryanaoleary@google.com>
…ckend (ray-project#62941) ## Description This PR enables the `SlicePlacementGroup` reservation logic for TPU accelerator backends to be called in Ray Serve when creating a Serve deployment. This is a follow-up to ray-project#61906 which added the new accelerator config logic and `topology` API to Serve `LLMConfig` and `VLLMEngine`. Specific Changes: - Added `requires_deferred_placement_group` property to AcceleratorBackend interface, which returns True for TPU backend when `topology` is set. This is used to check whether to set `pg_bundles` in `LLMServer` which are used to create a standard Ray PG. This results in the `LLMConfig` calling our backend-specific reservation logic when the Serve deployment starts. - Added unit tests and end-to-end mocked deployment tests to verify the two-step PG reservation works properly for Serve deployments ## Related issues ray-project#57137 ## Additional information This comment describes the rationale for this PR in more detail: ray-project#61906 (comment) --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com>


Description
This PR enables robust multi-host TPU deployments for vLLM on Ray Serve. Previously, deploying on TPU slices required users to manually calculate host counts and replicate bundle dictionaries - with no guarantee that in multi-slice environments their deployment would run atomically on one, co-located slice..
We now utilize Ray Core's native
SlicePlacementGrouputility to ensure the created PG is atomically pinned to a co-located TPU slice, mirroring the behavior in Ray Train.Related issues
#57137
Additional information
Related PR: vllm-project/tpu-inference#1461