[core] Fix accelerator detection on NVIDIA Blackwell consumer GPUs by micah-yong-ai · Pull Request #63322 · ray-project/ray

micah-yong-ai · 2026-05-13T17:35:01Z

Description

Ray 2.50.x fails to register any GPU resources on hosts with NVIDIA Blackwell-class consumer GPUs (e.g. RTX 5090, driver 570.x):

TimeoutError: Placement group creation timed out. Make sure your cluster has enough resources.
Error: No available node types can fulfill resource request {'GPU': 1.0}

Two independent bugs interact:

1. TPU false positive on /dev/accel*. NVIDIA driver 570.x (Blackwell) creates /dev/accel/accel0 on the host. TPUAcceleratorManager.get_current_node_num_accelerators uses glob.glob("/dev/accel*") to detect TPU chips and reports TPU == 1, which then steals the resource slot from the NVIDIA detector and GPU is never registered.

Evidence chain on an RTX 5090 host (driver 570.211.01):

Layer	Sees GPU?	Detail
`nvidia-smi`	✅	`NVIDIA GeForce RTX 5090, 32607 MiB, 570.211.01`
`torch.cuda`	✅	`device_count()=1, is_available()=True`
Ray `NvidiaGPUAcceleratorManager` (pynvml)	✅	`get_current_node_num_accelerators()=1`
Ray `TPUAcceleratorManager`	false-positive 1	`glob("/dev/accel*")` matches NVIDIA device file
`ray.cluster_resources()`	no GPU	`{'TPU': 1.0, 'CPU': 24.0, ...}`

Fix: only count /dev/accel* as TPU chips when TPU_ACCELERATOR_TYPE is set in the environment. Real TPU VMs (GCE / GKE) always set this env var (the constant is already defined as GKE_TPU_ACCELERATOR_TYPE_ENV_VAR at the top of tpu.py). The /dev/vfio/* fallback for non-GKE TPU hosts is preserved.

2. NVIDIA GPU name regex captures only "G" on consumer cards. NVIDIA_GPU_NAME_PATTERN = re.compile(r"\w+\s+([A-Z0-9]+)") was designed for datacenter cards ("Tesla V100-SXM2-16GB" → "V100", "NVIDIA A100-SXM4-40GB" → "A100"). On a consumer card name like "NVIDIA GeForce RTX 5090" the regex stops at the lowercase e in GeForce and captures just "G", producing a useless accelerator_type:G label.

Fix: when the existing regex returns a result of length ≤1, fall back to a hyphen-joined product name. "NVIDIA GeForce RTX 5090" → "GeForce-RTX-5090". The original TODO(Alex) comment noted this exact concern — this PR addresses it without regressing the Tesla/datacenter behavior.

After the fix

>>> ray.cluster_resources()
{'GPU': 1.0,
 'accelerator_type:GeForce-RTX-5090': 1.0,
 'CPU': 24.0,
 ...}

Test plan

pytest python/ray/tests/accelerators/test_tpu.py \
       python/ray/tests/accelerators/test_nvidia_gpu.py
# 71 passed locally

New/updated cases:

test_autodetect_num_tpus_accel_ignored_without_tpu_env — exercises the NVIDIA-Blackwell false-positive scenario.
test_set_tpu_visible_ids_and_bounds now sets TPU_ACCELERATOR_TYPE inside its cleared env block (matches real TPU VMs).
test_gpu_name_to_accelerator_type parametrized over Tesla V100-SXM2-16GB, Tesla K80, NVIDIA A100-SXM4-40GB, NVIDIA H100 80GB HBM3, NVIDIA GeForce RTX 5090, NVIDIA GeForce RTX 4090, None, and "".

Additional information

This problem will get more common as Blackwell-class hardware (consumer RTX 5xxx, plus B200 datacenter cards which also ship with driver 570.x) reaches more users. The same patch has already been validated end-to-end in an Applied Intuition fork running ray==2.50.1; opening here so the fix benefits everyone.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f2c1572cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

gemini-code-assist

Code Review

This pull request improves NVIDIA GPU name parsing to support consumer cards like the RTX 5090 and prevents Blackwell-class GPUs from being misidentified as TPUs by verifying the presence of a TPU environment variable before trusting /dev/accel* device files. Feedback suggests refining the TPU detection logic to allow falling through to alternative detection methods (e.g., /dev/vfio) if the environment check fails, rather than returning zero immediately.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 8f2c1572ccd4b95e938ce5a53eea01f2dd4f91c5. Configure here.}

edoakes · 2026-05-13T21:00:18Z

Thanks @micah-yong-ai

Signed-off-by: Micah <micah@applied.co>

…ay-project#63322) ## Description Closes ray-project#45302. Ray 2.50.x fails to register any GPU resources on hosts with NVIDIA Blackwell-class consumer GPUs (e.g. RTX 5090, driver 570.x): ``` TimeoutError: Placement group creation timed out. Make sure your cluster has enough resources. Error: No available node types can fulfill resource request {'GPU': 1.0} ``` Two independent bugs interact: **1. TPU false positive on `/dev/accel*`.** NVIDIA driver 570.x (Blackwell) creates `/dev/accel/accel0` on the host. `TPUAcceleratorManager.get_current_node_num_accelerators` uses `glob.glob("/dev/accel*")` to detect TPU chips and reports `TPU == 1`, which then steals the resource slot from the NVIDIA detector and `GPU` is never registered. Evidence chain on an RTX 5090 host (driver 570.211.01): | Layer | Sees GPU? | Detail | |-------|-----------|--------| | `nvidia-smi` | ✅ | `NVIDIA GeForce RTX 5090, 32607 MiB, 570.211.01` | | `torch.cuda` | ✅ | `device_count()=1, is_available()=True` | | Ray `NvidiaGPUAcceleratorManager` (pynvml) | ✅ | `get_current_node_num_accelerators()=1` | | Ray `TPUAcceleratorManager` | **false-positive 1** | `glob("/dev/accel*")` matches NVIDIA device file | | `ray.cluster_resources()` | **no GPU** | `{'TPU': 1.0, 'CPU': 24.0, ...}` | Fix: only count `/dev/accel*` as TPU chips when `TPU_ACCELERATOR_TYPE` is set in the environment. Real TPU VMs (GCE / GKE) always set this env var (the constant is already defined as `GKE_TPU_ACCELERATOR_TYPE_ENV_VAR` at the top of `tpu.py`). The `/dev/vfio/*` fallback for non-GKE TPU hosts is preserved. **2. NVIDIA GPU name regex captures only `"G"` on consumer cards.** `NVIDIA_GPU_NAME_PATTERN = re.compile(r"\w+\s+([A-Z0-9]+)")` was designed for datacenter cards (`"Tesla V100-SXM2-16GB"` → `"V100"`, `"NVIDIA A100-SXM4-40GB"` → `"A100"`). On a consumer card name like `"NVIDIA GeForce RTX 5090"` the regex stops at the lowercase `e` in `GeForce` and captures just `"G"`, producing a useless `accelerator_type:G` label. Fix: when the existing regex returns a result of length ≤1, fall back to a hyphen-joined product name. `"NVIDIA GeForce RTX 5090"` → `"GeForce-RTX-5090"`. The original `TODO(Alex)` comment noted this exact concern — this PR addresses it without regressing the Tesla/datacenter behavior. ### After the fix ```python >>> ray.cluster_resources() {'GPU': 1.0, 'accelerator_type:GeForce-RTX-5090': 1.0, 'CPU': 24.0, ...} ``` ## Test plan ``` pytest python/ray/tests/accelerators/test_tpu.py \ python/ray/tests/accelerators/test_nvidia_gpu.py # 71 passed locally ``` New/updated cases: - `test_autodetect_num_tpus_accel_ignored_without_tpu_env` — exercises the NVIDIA-Blackwell false-positive scenario. - `test_set_tpu_visible_ids_and_bounds` now sets `TPU_ACCELERATOR_TYPE` inside its cleared env block (matches real TPU VMs). - `test_gpu_name_to_accelerator_type` parametrized over `Tesla V100-SXM2-16GB`, `Tesla K80`, `NVIDIA A100-SXM4-40GB`, `NVIDIA H100 80GB HBM3`, `NVIDIA GeForce RTX 5090`, `NVIDIA GeForce RTX 4090`, `None`, and `""`. ## Additional information This problem will get more common as Blackwell-class hardware (consumer RTX 5xxx, plus B200 datacenter cards which also ship with driver 570.x) reaches more users. The same patch has already been validated end-to-end in an Applied Intuition fork running ray==2.50.1; opening here so the fix benefits everyone. Signed-off-by: Micah <micah@applied.co> Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>

micah-yong-ai requested a review from a team as a code owner May 13, 2026 17:35

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

Comment thread python/ray/_private/accelerators/tpu.py Outdated

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Comment thread python/ray/_private/accelerators/tpu.py Outdated

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread python/ray/_private/accelerators/tpu.py Outdated

micah-yong-ai force-pushed the micah/fix-blackwell-gpu-detection-upstream branch from 8f2c157 to 704624c Compare May 13, 2026 17:48

micah-yong-ai mentioned this pull request May 13, 2026

[Core] Incorrectly detected TPU on a HPU-only node. #45302

Closed

ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels May 13, 2026

edoakes approved these changes May 13, 2026

View reviewed changes

edoakes added the go add ONLY when ready to merge, run all tests label May 13, 2026

edoakes enabled auto-merge (squash) May 13, 2026 21:00

Fix accelerator detection on NVIDIA Blackwell consumer GPUs

3b8db4b

Signed-off-by: Micah <micah@applied.co>

auto-merge was automatically disabled May 13, 2026 21:19
Head branch was pushed to by a user without write access

micah-yong-ai force-pushed the micah/fix-blackwell-gpu-detection-upstream branch from 704624c to 3b8db4b Compare May 13, 2026 21:19

edoakes merged commit 5d7bc7a into ray-project:master May 14, 2026
6 checks passed

andrewsykim added the backport-candidate Label to identify PRs that should be considered for backport to older versions. label May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Fix accelerator detection on NVIDIA Blackwell consumer GPUs#63322

[core] Fix accelerator detection on NVIDIA Blackwell consumer GPUs#63322
edoakes merged 1 commit into
ray-project:masterfrom
micah-yong-ai:micah/fix-blackwell-gpu-detection-upstream

micah-yong-ai commented May 13, 2026 •

edited

Loading

chatgpt-codex-connector Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

cursor Bot left a comment

Uh oh!

edoakes commented May 13, 2026

Uh oh!

Labels

3 participants

Uh oh!

Conversation

micah-yong-ai commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

After the fix

Test plan

Additional information

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

edoakes commented May 13, 2026

Uh oh!

Labels

3 participants

micah-yong-ai commented May 13, 2026 •

edited

Loading