Skip to content

Stop setting CUDA_VISIBLE_DEVICES internally in vLLM, add device_ids arg#45026

Merged
vllm-bot merged 42 commits into
mainfrom
no_set_cvd
Jun 20, 2026
Merged

Stop setting CUDA_VISIBLE_DEVICES internally in vLLM, add device_ids arg#45026
vllm-bot merged 42 commits into
mainfrom
no_set_cvd

Conversation

@tlrmchlsmth

@tlrmchlsmth tlrmchlsmth commented Jun 9, 2026

Copy link
Copy Markdown
Member

This PR changes the way vLLM interacts with the CUDA_VISIBLE_DEVICES (CVD) environment variable:

  • With this PR vLLM no longer sets CVD to control which GPU a worker uses
  • This PR adds a --device-ids argument so that users don't need to set

This has a few benefits:

One thing this implies is that we need to decouple the concept of "local rank" from "device id" inside of vLLM, so there are a lot of changes caused by that.

tlrmchlsmth and others added 11 commits June 5, 2026 10:53
Stop hiding GPUs from vLLM worker processes via CUDA_VISIBLE_DEVICES.
Instead, workers address GPUs by their real physical device IDs
(e.g., torch.device("cuda:2") directly). This enables correct
GPU-to-NIC mapping for RDMA/NIXL transfers and fixes DeepGEMM
MegaMoE initialization in multi-GPU scenarios.

Core mechanism: executors populate assigned_gpu_ids on ParallelConfig
with physical GPU IDs. Workers index into this list with local_rank
to resolve their device. A process-level mapping in the platform
interface ensures device_id_to_physical_device_id() works without CVD.

Also fixes a correctness bug in the MLA CUTLASS kernel that hardcoded
device 0 for SM count queries regardless of the worker's actual device.

Co-authored-by: Claude <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Pop it in worker_base.init_worker (same pattern as shared_worker_lock)
so the shared config stays immutable across workers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Allows specifying physical GPU IDs without CUDA_VISIBLE_DEVICES,
e.g. `vllm serve model --device-ids 2,3,5,7`. This preserves full
GPU topology visibility needed for GPU-NIC affinity (RDMA/NIXL) and
DeepGEMM MegaMoE initialization.

Composes with CUDA_VISIBLE_DEVICES: when both are set, --device-ids
values are interpreted as indices into the CVD-visible set. E.g.
CVD=4,5,6,7 --device-ids 0,1 uses physical GPUs 4 and 5.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Workers use logical CUDA indices for torch.device (CVD stays set
for process isolation in multi-instance deployments). assigned_gpu_ids
is purely a topology lookup table for NIC affinity and P2P checks.

- Stop popping CUDA_VISIBLE_DEVICES in multiproc_executor
- Workers always use local_rank for torch.device, not physical IDs
- Replace inline physical-device-ID logic in custom_all_reduce and
  quick_all_reduce with device_id_to_physical_device_id()
- Remove device_index param threading through parallel_state
- Convert set_device_control_env_var from context manager to plain fn
- Add bounds check for --device-ids vs CVD composition

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
local_rank is the logical rank within the node (for IPC, shared memory,
port offsets). device_index is the physical CUDA device ordinal, computed
from assigned_gpu_ids[local_rank] when --device-ids is used. Without this
separation, all processes on a node use cuda:0 when CVD is not set.

- Add device_index param to GroupCoordinator, StatelessGroupCoordinator,
  init_world_group, init_model_parallel_group, init_distributed_environment,
  _init_process_group_for_split_group, _init_elastic_ep_world
- Store device_index on world group; propagate to all parallel groups
- KV connectors use device_index for device selection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Instead of threading device_index through every init_model_parallel_group
and init_stateless_group call, child GroupCoordinators now read it
directly from the world group. Only init_world_group needs the param.

Assert that the world group exists when creating child groups rather
than silently falling back to local_rank.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

# Conflicts:
#	vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py
@tlrmchlsmth tlrmchlsmth changed the title Stop setting CUDA_VISIBLE_DEVICES internally in vLLM Jun 9, 2026
@mergify

mergify Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@tlrmchlsmth

Copy link
Copy Markdown
Member Author

@kouroshHakha sure, holding off for now

@kouroshHakha

Copy link
Copy Markdown
Collaborator

in particular, this test failed with the following error:

 File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/v1/executor/ray_executor_v2.py", line 163, in initialize_worker
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)     super().__init__(
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 626, in __init__
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)     self.worker.init_device()
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 331, in init_device
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)     self.worker.init_device()  # type: ignore
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)     ^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 285, in init_device
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)     assert self.parallel_config.local_world_size <= len(
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947) AssertionError: local_world_size (8) exceeds assigned_physical_gpu_ids count (4)
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947) (RayWorkerProc pid=11700, ip=10.0.242.125) WARNING 06-18 12:41:51 [worker_base.py:306] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
@tlrmchlsmth

tlrmchlsmth commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

ok so we replaced
assert self.parallel_config.local_world_size <= visible_device_count
with:
assert self.parallel_config.local_world_size <= len(assigned_physical_gpu_ids)

Plausibly it was passing before because:

  • The test runs on a machine with 8 GPUs
  • RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 is set so vLLM sees all 8

I think the question is why is self.parallel_config.local_world_size equal to 8? It should be four in this test. Looks like self.vllm_config.parallel_config.nnodes is not set when using ray, but should it be?

@kouroshHakha

Copy link
Copy Markdown
Collaborator

@tlrmchlsmth I think we should just remove the assertion

I applied this patch and reran the failed tests locally and it passed.

--- gpu_worker.orig.py	2026-06-18 14:05:57.743622184 -0700
+++ gpu_worker.new.py	2026-06-18 14:06:17.070738084 -0700
@@ -282,13 +282,25 @@
                     f"local_rank {self.local_rank} is out of bounds for "
                     f"assigned_physical_gpu_ids {assigned_physical_gpu_ids}"
                 )
-                assert self.parallel_config.local_world_size <= len(
-                    assigned_physical_gpu_ids
-                ), (
-                    f"local_world_size ({self.parallel_config.local_world_size})"
-                    " exceeds assigned_physical_gpu_ids count "
-                    f"({len(assigned_physical_gpu_ids)})"
-                )
+                # NOTE(patch pr45026): local_world_size is derived from
+                # parallel_config.nnodes, which is only set for the "mp"
+                # multi-node backend. With the "ray"/"external_launcher"
+                # backends nnodes stays 1, so local_world_size collapses to
+                # the full world_size and this check wrongly fires on
+                # cross-node deployments. assigned_physical_gpu_ids is already
+                # per-node and the local_rank bound above fully validates the
+                # mapping for these backends, so skip the check for them.
+                if parallel_config.distributed_executor_backend not in (
+                    "ray",
+                    "external_launcher",
+                ):
+                    assert self.parallel_config.local_world_size <= len(
+                        assigned_physical_gpu_ids
+                    ), (
+                        f"local_world_size ({self.parallel_config.local_world_size})"
+                        " exceeds assigned_physical_gpu_ids count "
+                        f"({len(assigned_physical_gpu_ids)})"
+                    )
             else:
                 assert self.local_rank < torch.accelerator.device_count(), (
                     f"DP adjusted local rank {self.local_rank} is out of "
@kouroshHakha

Copy link
Copy Markdown
Collaborator

So the nnodes is I think purely a mp concept, I think?? In the ray world the scheduling should be handled by ray's scheduler, and theoretically we can have worlds where nodes have different gpu counts (4, 8, etc). This test fails in particular because nnodes is defaulted to 1 and then local_count becomes 8//1=8.

@kouroshHakha

Copy link
Copy Markdown
Collaborator

Maybe let's apply the patch and I will rerun the full ray test suite again? (I don't think I can push over your pr)

@tlrmchlsmth

tlrmchlsmth commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

I'm OK applying the patch but I feel that parallel_config.nnodes should return the number of nodes in all cases

theoretically we can have worlds where nodes have different gpu counts (4, 8, etc)

This should be allowed for the MP backend as well

Let's get it landed and then revisit this in follow up

tlrmchlsmth and others added 2 commits June 18, 2026 18:06
The local_world_size assertion fires incorrectly on multi-node Ray
setups because nnodes is only set for the "mp" backend. With Ray,
nnodes stays 1 so local_world_size equals the full world_size (e.g. 8),
while assigned_physical_gpu_ids correctly contains only the per-node
GPUs (e.g. 4). Guard the assertion to skip Ray/external_launcher since
the local_rank bound check already validates the mapping.

Co-authored-by: kourosh hakhamaneshi <kouroshHakha@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@tlrmchlsmth tlrmchlsmth removed the ready-run-all-tests Trigger CI with all tests for wide-ranging PRs label Jun 18, 2026
@kouroshHakha

Copy link
Copy Markdown
Collaborator
@AndreasKaratzas

Copy link
Copy Markdown
Member
jeffreywang88 added a commit to jeffreywang88/vllm that referenced this pull request Jun 19, 2026
…ally, add --device-ids

Re-applies the latest net diff of vllm-project#45026 ("Stop setting
CUDA_VISIBLE_DEVICES internally in vLLM, add device_ids arg") on top of
releases/v0.23.0, for validation via Ray LLM release tests. This replaces
the previous backport, which was cut from an earlier revision of the PR.

The net PR diff (merge-base(main, head)..head; head efdcc25, base
35e4dd4; 24 files) was applied with a 3-way merge. Only
vllm/engine/arg_utils.py conflicted: the PR hunk carries the
create_diffusion_config() method as context, which does not exist in
0.23.0. Resolved by keeping only the added _resolve_device_ids() method
and dropping the create_diffusion_config context (DiffusionConfig is not
present in 0.23.0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

@kouroshHakha kouroshHakha left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok all ray tests fully passed: ray-project/ray#64189

@tlrmchlsmth

Copy link
Copy Markdown
Member Author

@AndreasKaratzas I don't see how any of the test failures could be related to this PR. They may be popping up here just because I enabled ready-run-all-tests

@vllm-bot vllm-bot merged commit ebfbcfe into main Jun 20, 2026
197 of 204 checks passed
@vllm-bot vllm-bot deleted the no_set_cvd branch June 20, 2026 20:38
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 20, 2026
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Jun 21, 2026
…arg (vllm-project#45026)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: kourosh hakhamaneshi <kouroshHakha@users.noreply.github.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…arg (vllm-project#45026)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: kourosh hakhamaneshi <kouroshHakha@users.noreply.github.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…arg (vllm-project#45026)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: kourosh hakhamaneshi <kouroshHakha@users.noreply.github.com>
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
…arg (vllm-project#45026)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: kourosh hakhamaneshi <kouroshHakha@users.noreply.github.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend kv-connector nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

8 participants