Stop setting CUDA_VISIBLE_DEVICES internally in vLLM, add device_ids arg#45026
Conversation
Stop hiding GPUs from vLLM worker processes via CUDA_VISIBLE_DEVICES.
Instead, workers address GPUs by their real physical device IDs
(e.g., torch.device("cuda:2") directly). This enables correct
GPU-to-NIC mapping for RDMA/NIXL transfers and fixes DeepGEMM
MegaMoE initialization in multi-GPU scenarios.
Core mechanism: executors populate assigned_gpu_ids on ParallelConfig
with physical GPU IDs. Workers index into this list with local_rank
to resolve their device. A process-level mapping in the platform
interface ensures device_id_to_physical_device_id() works without CVD.
Also fixes a correctness bug in the MLA CUTLASS kernel that hardcoded
device 0 for SM count queries regardless of the worker's actual device.
Co-authored-by: Claude <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Pop it in worker_base.init_worker (same pattern as shared_worker_lock) so the shared config stays immutable across workers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Allows specifying physical GPU IDs without CUDA_VISIBLE_DEVICES, e.g. `vllm serve model --device-ids 2,3,5,7`. This preserves full GPU topology visibility needed for GPU-NIC affinity (RDMA/NIXL) and DeepGEMM MegaMoE initialization. Composes with CUDA_VISIBLE_DEVICES: when both are set, --device-ids values are interpreted as indices into the CVD-visible set. E.g. CVD=4,5,6,7 --device-ids 0,1 uses physical GPUs 4 and 5. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Workers use logical CUDA indices for torch.device (CVD stays set for process isolation in multi-instance deployments). assigned_gpu_ids is purely a topology lookup table for NIC affinity and P2P checks. - Stop popping CUDA_VISIBLE_DEVICES in multiproc_executor - Workers always use local_rank for torch.device, not physical IDs - Replace inline physical-device-ID logic in custom_all_reduce and quick_all_reduce with device_id_to_physical_device_id() - Remove device_index param threading through parallel_state - Convert set_device_control_env_var from context manager to plain fn - Add bounds check for --device-ids vs CVD composition Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
local_rank is the logical rank within the node (for IPC, shared memory, port offsets). device_index is the physical CUDA device ordinal, computed from assigned_gpu_ids[local_rank] when --device-ids is used. Without this separation, all processes on a node use cuda:0 when CVD is not set. - Add device_index param to GroupCoordinator, StatelessGroupCoordinator, init_world_group, init_model_parallel_group, init_distributed_environment, _init_process_group_for_split_group, _init_elastic_ep_world - Store device_index on world group; propagate to all parallel groups - KV connectors use device_index for device selection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Instead of threading device_index through every init_model_parallel_group and init_stateless_group call, child GroupCoordinators now read it directly from the world group. Only init_world_group needs the param. Assert that the world group exists when creating child groups rather than silently falling back to local_rank. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> # Conflicts: # vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py
|
This pull request has merge conflicts that must be resolved before it can be |
|
@kouroshHakha sure, holding off for now |
|
in particular, this test failed with the following error: |
|
ok so we replaced Plausibly it was passing before because:
I think the question is why is |
|
@tlrmchlsmth I think we should just remove the assertion I applied this patch and reran the failed tests locally and it passed. |
|
So the nnodes is I think purely a mp concept, I think?? In the ray world the scheduling should be handled by ray's scheduler, and theoretically we can have worlds where nodes have different gpu counts (4, 8, etc). This test fails in particular because nnodes is defaulted to 1 and then local_count becomes 8//1=8. |
|
Maybe let's apply the patch and I will rerun the full ray test suite again? (I don't think I can push over your pr) |
|
I'm OK applying the patch but I feel that
This should be allowed for the MP backend as well Let's get it landed and then revisit this in follow up |
The local_world_size assertion fires incorrectly on multi-node Ray setups because nnodes is only set for the "mp" backend. With Ray, nnodes stays 1 so local_world_size equals the full world_size (e.g. 8), while assigned_physical_gpu_ids correctly contains only the per-node GPUs (e.g. 4). Guard the assertion to skip Ray/external_launcher since the local_rank bound check already validates the mapping. Co-authored-by: kourosh hakhamaneshi <kouroshHakha@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
|
new release build here: https://buildkite.com/ray-project/release/builds/97596 |
|
Our team is checking this cause it might be a new upstream regression: But this looks like it might have to do with this PR: |
…ally, add --device-ids Re-applies the latest net diff of vllm-project#45026 ("Stop setting CUDA_VISIBLE_DEVICES internally in vLLM, add device_ids arg") on top of releases/v0.23.0, for validation via Ray LLM release tests. This replaces the previous backport, which was cut from an earlier revision of the PR. The net PR diff (merge-base(main, head)..head; head efdcc25, base 35e4dd4; 24 files) was applied with a 3-way merge. Only vllm/engine/arg_utils.py conflicted: the PR hunk carries the create_diffusion_config() method as context, which does not exist in 0.23.0. Resolved by keeping only the added _resolve_device_ids() method and dropping the create_diffusion_config context (DiffusionConfig is not present in 0.23.0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
kouroshHakha
left a comment
There was a problem hiding this comment.
ok all ray tests fully passed: ray-project/ray#64189
|
@AndreasKaratzas I don't see how any of the test failures could be related to this PR. They may be popping up here just because I enabled |
…arg (vllm-project#45026) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Codex <codex@openai.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: kourosh hakhamaneshi <kouroshHakha@users.noreply.github.com>
…arg (vllm-project#45026) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Codex <codex@openai.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: kourosh hakhamaneshi <kouroshHakha@users.noreply.github.com>
…arg (vllm-project#45026) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Codex <codex@openai.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: kourosh hakhamaneshi <kouroshHakha@users.noreply.github.com>
…arg (vllm-project#45026) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Codex <codex@openai.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: kourosh hakhamaneshi <kouroshHakha@users.noreply.github.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>
This PR changes the way vLLM interacts with the
CUDA_VISIBLE_DEVICES(CVD) environment variable:--device-idsargument so that users don't need to setThis has a few benefits:
CUDA_VISIBLE_DEVICES#32569)One thing this implies is that we need to decouple the concept of "local rank" from "device id" inside of vLLM, so there are a lot of changes caused by that.