Stop setting CUDA_VISIBLE_DEVICES internally in vLLM, add device_ids arg by tlrmchlsmth · Pull Request #45026 · vllm-project/vllm

tlrmchlsmth · 2026-06-09T14:35:39Z

This PR changes the way vLLM interacts with the CUDA_VISIBLE_DEVICES (CVD) environment variable:

With this PR vLLM no longer sets CVD to control which GPU a worker uses
This PR adds a --device-ids argument so that users don't need to set

This has a few benefits:

vLLM currently does not support the UUID format for device IDs which is problematic for using it with MIG (should fix [Feature]: Support GPU UUID in CUDA_VISIBLE_DEVICES #32569)
There is currently no way to use external DP load balancing with DeepGEMM MegaMoE, because setting CVD for each individual rank isolates GPUs and prevents NCCL from initializing (fixes [Bug]: External DP Load Balancing plus DeepGEMM MegaMoE #44556)

One thing this implies is that we need to decouple the concept of "local rank" from "device id" inside of vLLM, so there are a lot of changes caused by that.

Stop hiding GPUs from vLLM worker processes via CUDA_VISIBLE_DEVICES. Instead, workers address GPUs by their real physical device IDs (e.g., torch.device("cuda:2") directly). This enables correct GPU-to-NIC mapping for RDMA/NIXL transfers and fixes DeepGEMM MegaMoE initialization in multi-GPU scenarios. Core mechanism: executors populate assigned_gpu_ids on ParallelConfig with physical GPU IDs. Workers index into this list with local_rank to resolve their device. A process-level mapping in the platform interface ensures device_id_to_physical_device_id() works without CVD. Also fixes a correctness bug in the MLA CUTLASS kernel that hardcoded device 0 for SM count queries regardless of the worker's actual device. Co-authored-by: Claude <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Pop it in worker_base.init_worker (same pattern as shared_worker_lock) so the shared config stays immutable across workers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Allows specifying physical GPU IDs without CUDA_VISIBLE_DEVICES, e.g. `vllm serve model --device-ids 2,3,5,7`. This preserves full GPU topology visibility needed for GPU-NIC affinity (RDMA/NIXL) and DeepGEMM MegaMoE initialization. Composes with CUDA_VISIBLE_DEVICES: when both are set, --device-ids values are interpreted as indices into the CVD-visible set. E.g. CVD=4,5,6,7 --device-ids 0,1 uses physical GPUs 4 and 5. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Workers use logical CUDA indices for torch.device (CVD stays set for process isolation in multi-instance deployments). assigned_gpu_ids is purely a topology lookup table for NIC affinity and P2P checks. - Stop popping CUDA_VISIBLE_DEVICES in multiproc_executor - Workers always use local_rank for torch.device, not physical IDs - Replace inline physical-device-ID logic in custom_all_reduce and quick_all_reduce with device_id_to_physical_device_id() - Remove device_index param threading through parallel_state - Convert set_device_control_env_var from context manager to plain fn - Add bounds check for --device-ids vs CVD composition Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

local_rank is the logical rank within the node (for IPC, shared memory, port offsets). device_index is the physical CUDA device ordinal, computed from assigned_gpu_ids[local_rank] when --device-ids is used. Without this separation, all processes on a node use cuda:0 when CVD is not set. - Add device_index param to GroupCoordinator, StatelessGroupCoordinator, init_world_group, init_model_parallel_group, init_distributed_environment, _init_process_group_for_split_group, _init_elastic_ep_world - Store device_index on world group; propagate to all parallel groups - KV connectors use device_index for device selection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Instead of threading device_index through every init_model_parallel_group and init_stateless_group call, child GroupCoordinators now read it directly from the world group. Only init_world_group needs the param. Assert that the world group exists when creating child groups rather than silently falling back to local_rank. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> # Conflicts: # vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py

mergify · 2026-06-09T14:36:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth · 2026-06-18T19:25:49Z

@kouroshHakha sure, holding off for now

kouroshHakha · 2026-06-18T19:44:19Z

in particular, this test failed with the following error:

 File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/v1/executor/ray_executor_v2.py", line 163, in initialize_worker
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)     super().__init__(
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 626, in __init__
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)     self.worker.init_device()
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 331, in init_device
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)     self.worker.init_device()  # type: ignore
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)     ^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 285, in init_device
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)     assert self.parallel_config.local_world_size <= len(
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947) AssertionError: local_world_size (8) exceeds assigned_physical_gpu_ids count (4)
(ServeReplica:default:LLMServer:opt-1_3b pid=11691, ip=10.0.197.164) (EngineCore pid=11947) (RayWorkerProc pid=11700, ip=10.0.242.125) WARNING 06-18 12:41:51 [worker_base.py:306] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.

tlrmchlsmth · 2026-06-18T21:43:01Z

ok so we replaced
assert self.parallel_config.local_world_size <= visible_device_count
with:
assert self.parallel_config.local_world_size <= len(assigned_physical_gpu_ids)

Plausibly it was passing before because:

The test runs on a machine with 8 GPUs
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 is set so vLLM sees all 8

I think the question is why is self.parallel_config.local_world_size equal to 8? It should be four in this test. Looks like self.vllm_config.parallel_config.nnodes is not set when using ray, but should it be?

kouroshHakha · 2026-06-18T21:51:41Z

@tlrmchlsmth I think we should just remove the assertion

I applied this patch and reran the failed tests locally and it passed.

--- gpu_worker.orig.py	2026-06-18 14:05:57.743622184 -0700
+++ gpu_worker.new.py	2026-06-18 14:06:17.070738084 -0700
@@ -282,13 +282,25 @@
                     f"local_rank {self.local_rank} is out of bounds for "
                     f"assigned_physical_gpu_ids {assigned_physical_gpu_ids}"
                 )
-                assert self.parallel_config.local_world_size <= len(
-                    assigned_physical_gpu_ids
-                ), (
-                    f"local_world_size ({self.parallel_config.local_world_size})"
-                    " exceeds assigned_physical_gpu_ids count "
-                    f"({len(assigned_physical_gpu_ids)})"
-                )
+                # NOTE(patch pr45026): local_world_size is derived from
+                # parallel_config.nnodes, which is only set for the "mp"
+                # multi-node backend. With the "ray"/"external_launcher"
+                # backends nnodes stays 1, so local_world_size collapses to
+                # the full world_size and this check wrongly fires on
+                # cross-node deployments. assigned_physical_gpu_ids is already
+                # per-node and the local_rank bound above fully validates the
+                # mapping for these backends, so skip the check for them.
+                if parallel_config.distributed_executor_backend not in (
+                    "ray",
+                    "external_launcher",
+                ):
+                    assert self.parallel_config.local_world_size <= len(
+                        assigned_physical_gpu_ids
+                    ), (
+                        f"local_world_size ({self.parallel_config.local_world_size})"
+                        " exceeds assigned_physical_gpu_ids count "
+                        f"({len(assigned_physical_gpu_ids)})"
+                    )
             else:
                 assert self.local_rank < torch.accelerator.device_count(), (
                     f"DP adjusted local rank {self.local_rank} is out of "

kouroshHakha · 2026-06-18T21:54:20Z

So the nnodes is I think purely a mp concept, I think?? In the ray world the scheduling should be handled by ray's scheduler, and theoretically we can have worlds where nodes have different gpu counts (4, 8, etc). This test fails in particular because nnodes is defaulted to 1 and then local_count becomes 8//1=8.

kouroshHakha · 2026-06-18T21:54:55Z

Maybe let's apply the patch and I will rerun the full ray test suite again? (I don't think I can push over your pr)

tlrmchlsmth · 2026-06-18T22:05:02Z

I'm OK applying the patch but I feel that parallel_config.nnodes should return the number of nodes in all cases

theoretically we can have worlds where nodes have different gpu counts (4, 8, etc)

This should be allowed for the MP backend as well

Let's get it landed and then revisit this in follow up

The local_world_size assertion fires incorrectly on multi-node Ray setups because nnodes is only set for the "mp" backend. With Ray, nnodes stays 1 so local_world_size equals the full world_size (e.g. 8), while assigned_physical_gpu_ids correctly contains only the per-node GPUs (e.g. 4). Guard the assertion to skip Ray/external_launcher since the local_rank bound check already validates the mapping. Co-authored-by: kourosh hakhamaneshi <kouroshHakha@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

kouroshHakha · 2026-06-18T23:01:41Z

new release build here: https://buildkite.com/ray-project/release/builds/97596

AndreasKaratzas · 2026-06-18T23:43:24Z

Our team is checking this cause it might be a new upstream regression:
https://buildkite.com/vllm/ci/builds/73005/canvas?jid=019edcca-7d95-4ac2-b8a1-6d59c889a5b1&tab=output#L7129

But this looks like it might have to do with this PR:
https://buildkite.com/vllm/ci/builds/73005/canvas?jid=019edd19-cb50-4b8b-abb9-f343794717ba&tab=output

…ally, add --device-ids Re-applies the latest net diff of vllm-project#45026 ("Stop setting CUDA_VISIBLE_DEVICES internally in vLLM, add device_ids arg") on top of releases/v0.23.0, for validation via Ray LLM release tests. This replaces the previous backport, which was cut from an earlier revision of the PR. The net PR diff (merge-base(main, head)..head; head efdcc25, base 35e4dd4; 24 files) was applied with a 3-way merge. Only vllm/engine/arg_utils.py conflicted: the PR hunk carries the create_diffusion_config() method as context, which does not exist in 0.23.0. Resolved by keeping only the added _resolve_device_ids() method and dropping the create_diffusion_config context (DiffusionConfig is not present in 0.23.0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

kouroshHakha

ok all ray tests fully passed: ray-project/ray#64189

tlrmchlsmth · 2026-06-19T17:05:32Z

@AndreasKaratzas I don't see how any of the test failures could be related to this PR. They may be popping up here just because I enabled ready-run-all-tests

…arg (vllm-project#45026) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Codex <codex@openai.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: kourosh hakhamaneshi <kouroshHakha@users.noreply.github.com>

…arg (vllm-project#45026) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Codex <codex@openai.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: kourosh hakhamaneshi <kouroshHakha@users.noreply.github.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

tlrmchlsmth and others added 11 commits June 5, 2026 10:53

Clean up ray_executor assigned_gpu_ids: move import to top level

8ee5608

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Add unit tests for --device-ids composability with CVD

ef3c7f4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

dont hash device id

7161a16

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

fixup

1f23251

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Merge remote-tracking branch 'origin/main' into no_set_cvd

17c9a35

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> # Conflicts: # vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py

tlrmchlsmth requested review from ApostaC, NickLucche, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, njhill, orozery, robertgshaw2-redhat, xuechendi, yewentao256 and youkaichao as code owners June 9, 2026 14:35

tlrmchlsmth changed the title ~~Stop setting CUDA_VISIBLE_DEVICES internally in vLLM~~ Jun 9, 2026

mergify Bot added nvidia v1 labels Jun 9, 2026

github-project-automation Bot added this to NVIDIA Jun 9, 2026

mergify Bot added the needs-rebase label Jun 9, 2026

tlrmchlsmth and others added 2 commits June 18, 2026 18:06

Merge branch 'main' into no_set_cvd

efdcc25

tlrmchlsmth removed the ready-run-all-tests Trigger CI with all tests for wide-ranging PRs label Jun 18, 2026

jeffreywang88 mentioned this pull request Jun 19, 2026

[DO NOT MERGE][llm][ci] Test vllm's CUDA_VISIBLE_DEVICES fix ray-project/ray#64189

Closed

kouroshHakha approved these changes Jun 19, 2026

View reviewed changes

tlrmchlsmth and others added 2 commits June 19, 2026 10:58

Merge branch 'main' into no_set_cvd

e3a87b9

Merge branch 'main' into no_set_cvd

75e2f9d

Merge branch 'main' into no_set_cvd

c500bfa

mgoin approved these changes Jun 20, 2026

View reviewed changes

vllm-bot merged commit ebfbcfe into main Jun 20, 2026
197 of 204 checks passed

vllm-bot deleted the no_set_cvd branch June 20, 2026 20:38

github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 20, 2026

tlrmchlsmth mentioned this pull request Jun 20, 2026

[MegaMoE] Support external DP load balancing with DeepGEMM MegaMoE #44555

Closed

mganczarenko mentioned this pull request Jul 1, 2026

[XPU][Bugfix] Fix GroupCoordinator device_index #47295

Open

4 tasks

priyamkarn mentioned this pull request Jul 1, 2026

[Bugfix][CPU] Fix data-parallel EngineCore processes all binding to the same NUMA node #47336

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Stop setting CUDA_VISIBLE_DEVICES internally in vLLM, add device_ids arg#45026

Stop setting CUDA_VISIBLE_DEVICES internally in vLLM, add device_ids arg#45026
vllm-bot merged 42 commits into
mainfrom
no_set_cvd

tlrmchlsmth commented Jun 9, 2026 •

edited

Loading

mergify Bot commented Jun 9, 2026

tlrmchlsmth commented Jun 18, 2026

kouroshHakha commented Jun 18, 2026

tlrmchlsmth commented Jun 18, 2026 •

edited

Loading

kouroshHakha commented Jun 18, 2026

kouroshHakha commented Jun 18, 2026

kouroshHakha commented Jun 18, 2026

tlrmchlsmth commented Jun 18, 2026 •

edited

Loading

kouroshHakha commented Jun 18, 2026

AndreasKaratzas commented Jun 18, 2026

kouroshHakha left a comment

tlrmchlsmth commented Jun 19, 2026

Uh oh!

Labels

8 participants

Uh oh!

Uh oh!

Conversation

tlrmchlsmth commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mergify Bot commented Jun 9, 2026

tlrmchlsmth commented Jun 18, 2026

kouroshHakha commented Jun 18, 2026

tlrmchlsmth commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

kouroshHakha commented Jun 18, 2026

kouroshHakha commented Jun 18, 2026

kouroshHakha commented Jun 18, 2026

tlrmchlsmth commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

kouroshHakha commented Jun 18, 2026

AndreasKaratzas commented Jun 18, 2026

kouroshHakha left a comment

Choose a reason for hiding this comment

tlrmchlsmth commented Jun 19, 2026

Uh oh!

Labels

8 participants

tlrmchlsmth commented Jun 9, 2026 •

edited

Loading

tlrmchlsmth commented Jun 18, 2026 •

edited

Loading

tlrmchlsmth commented Jun 18, 2026 •

edited

Loading