[core] Fix accelerator detection on NVIDIA Blackwell consumer GPUs#63322
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8f2c1572cc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request improves NVIDIA GPU name parsing to support consumer cards like the RTX 5090 and prevents Blackwell-class GPUs from being misidentified as TPUs by verifying the presence of a TPU environment variable before trusting /dev/accel* device files. Feedback suggests refining the TPU detection logic to allow falling through to alternative detection methods (e.g., /dev/vfio) if the environment check fails, rather than returning zero immediately.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 8f2c1572ccd4b95e938ce5a53eea01f2dd4f91c5. Configure here.
8f2c157 to
704624c
Compare
|
Thanks @micah-yong-ai |
Signed-off-by: Micah <micah@applied.co>
Head branch was pushed to by a user without write access
704624c to
3b8db4b
Compare
…ay-project#63322) ## Description Closes ray-project#45302. Ray 2.50.x fails to register any GPU resources on hosts with NVIDIA Blackwell-class consumer GPUs (e.g. RTX 5090, driver 570.x): ``` TimeoutError: Placement group creation timed out. Make sure your cluster has enough resources. Error: No available node types can fulfill resource request {'GPU': 1.0} ``` Two independent bugs interact: **1. TPU false positive on `/dev/accel*`.** NVIDIA driver 570.x (Blackwell) creates `/dev/accel/accel0` on the host. `TPUAcceleratorManager.get_current_node_num_accelerators` uses `glob.glob("/dev/accel*")` to detect TPU chips and reports `TPU == 1`, which then steals the resource slot from the NVIDIA detector and `GPU` is never registered. Evidence chain on an RTX 5090 host (driver 570.211.01): | Layer | Sees GPU? | Detail | |-------|-----------|--------| | `nvidia-smi` | ✅ | `NVIDIA GeForce RTX 5090, 32607 MiB, 570.211.01` | | `torch.cuda` | ✅ | `device_count()=1, is_available()=True` | | Ray `NvidiaGPUAcceleratorManager` (pynvml) | ✅ | `get_current_node_num_accelerators()=1` | | Ray `TPUAcceleratorManager` | **false-positive 1** | `glob("/dev/accel*")` matches NVIDIA device file | | `ray.cluster_resources()` | **no GPU** | `{'TPU': 1.0, 'CPU': 24.0, ...}` | Fix: only count `/dev/accel*` as TPU chips when `TPU_ACCELERATOR_TYPE` is set in the environment. Real TPU VMs (GCE / GKE) always set this env var (the constant is already defined as `GKE_TPU_ACCELERATOR_TYPE_ENV_VAR` at the top of `tpu.py`). The `/dev/vfio/*` fallback for non-GKE TPU hosts is preserved. **2. NVIDIA GPU name regex captures only `"G"` on consumer cards.** `NVIDIA_GPU_NAME_PATTERN = re.compile(r"\w+\s+([A-Z0-9]+)")` was designed for datacenter cards (`"Tesla V100-SXM2-16GB"` → `"V100"`, `"NVIDIA A100-SXM4-40GB"` → `"A100"`). On a consumer card name like `"NVIDIA GeForce RTX 5090"` the regex stops at the lowercase `e` in `GeForce` and captures just `"G"`, producing a useless `accelerator_type:G` label. Fix: when the existing regex returns a result of length ≤1, fall back to a hyphen-joined product name. `"NVIDIA GeForce RTX 5090"` → `"GeForce-RTX-5090"`. The original `TODO(Alex)` comment noted this exact concern — this PR addresses it without regressing the Tesla/datacenter behavior. ### After the fix ```python >>> ray.cluster_resources() {'GPU': 1.0, 'accelerator_type:GeForce-RTX-5090': 1.0, 'CPU': 24.0, ...} ``` ## Test plan ``` pytest python/ray/tests/accelerators/test_tpu.py \ python/ray/tests/accelerators/test_nvidia_gpu.py # 71 passed locally ``` New/updated cases: - `test_autodetect_num_tpus_accel_ignored_without_tpu_env` — exercises the NVIDIA-Blackwell false-positive scenario. - `test_set_tpu_visible_ids_and_bounds` now sets `TPU_ACCELERATOR_TYPE` inside its cleared env block (matches real TPU VMs). - `test_gpu_name_to_accelerator_type` parametrized over `Tesla V100-SXM2-16GB`, `Tesla K80`, `NVIDIA A100-SXM4-40GB`, `NVIDIA H100 80GB HBM3`, `NVIDIA GeForce RTX 5090`, `NVIDIA GeForce RTX 4090`, `None`, and `""`. ## Additional information This problem will get more common as Blackwell-class hardware (consumer RTX 5xxx, plus B200 datacenter cards which also ship with driver 570.x) reaches more users. The same patch has already been validated end-to-end in an Applied Intuition fork running ray==2.50.1; opening here so the fix benefits everyone. Signed-off-by: Micah <micah@applied.co> Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>

Description
Closes #45302.
Ray 2.50.x fails to register any GPU resources on hosts with NVIDIA Blackwell-class consumer GPUs (e.g. RTX 5090, driver 570.x):
Two independent bugs interact:
1. TPU false positive on
/dev/accel*. NVIDIA driver 570.x (Blackwell) creates/dev/accel/accel0on the host.TPUAcceleratorManager.get_current_node_num_acceleratorsusesglob.glob("/dev/accel*")to detect TPU chips and reportsTPU == 1, which then steals the resource slot from the NVIDIA detector andGPUis never registered.Evidence chain on an RTX 5090 host (driver 570.211.01):
nvidia-smiNVIDIA GeForce RTX 5090, 32607 MiB, 570.211.01torch.cudadevice_count()=1, is_available()=TrueNvidiaGPUAcceleratorManager(pynvml)get_current_node_num_accelerators()=1TPUAcceleratorManagerglob("/dev/accel*")matches NVIDIA device fileray.cluster_resources(){'TPU': 1.0, 'CPU': 24.0, ...}Fix: only count
/dev/accel*as TPU chips whenTPU_ACCELERATOR_TYPEis set in the environment. Real TPU VMs (GCE / GKE) always set this env var (the constant is already defined asGKE_TPU_ACCELERATOR_TYPE_ENV_VARat the top oftpu.py). The/dev/vfio/*fallback for non-GKE TPU hosts is preserved.2. NVIDIA GPU name regex captures only
"G"on consumer cards.NVIDIA_GPU_NAME_PATTERN = re.compile(r"\w+\s+([A-Z0-9]+)")was designed for datacenter cards ("Tesla V100-SXM2-16GB"→"V100","NVIDIA A100-SXM4-40GB"→"A100"). On a consumer card name like"NVIDIA GeForce RTX 5090"the regex stops at the lowercaseeinGeForceand captures just"G", producing a uselessaccelerator_type:Glabel.Fix: when the existing regex returns a result of length ≤1, fall back to a hyphen-joined product name.
"NVIDIA GeForce RTX 5090"→"GeForce-RTX-5090". The originalTODO(Alex)comment noted this exact concern — this PR addresses it without regressing the Tesla/datacenter behavior.After the fix
Test plan
New/updated cases:
test_autodetect_num_tpus_accel_ignored_without_tpu_env— exercises the NVIDIA-Blackwell false-positive scenario.test_set_tpu_visible_ids_and_boundsnow setsTPU_ACCELERATOR_TYPEinside its cleared env block (matches real TPU VMs).test_gpu_name_to_accelerator_typeparametrized overTesla V100-SXM2-16GB,Tesla K80,NVIDIA A100-SXM4-40GB,NVIDIA H100 80GB HBM3,NVIDIA GeForce RTX 5090,NVIDIA GeForce RTX 4090,None, and"".Additional information
This problem will get more common as Blackwell-class hardware (consumer RTX 5xxx, plus B200 datacenter cards which also ship with driver 570.x) reaches more users. The same patch has already been validated end-to-end in an Applied Intuition fork running ray==2.50.1; opening here so the fix benefits everyone.