Skip to content

[core] Fix accelerator detection on NVIDIA Blackwell consumer GPUs#63322

Merged
edoakes merged 1 commit into
ray-project:masterfrom
micah-yong-ai:micah/fix-blackwell-gpu-detection-upstream
May 14, 2026
Merged

[core] Fix accelerator detection on NVIDIA Blackwell consumer GPUs#63322
edoakes merged 1 commit into
ray-project:masterfrom
micah-yong-ai:micah/fix-blackwell-gpu-detection-upstream

Conversation

@micah-yong-ai

@micah-yong-ai micah-yong-ai commented May 13, 2026

Copy link
Copy Markdown
Contributor

Description

Closes #45302.

Ray 2.50.x fails to register any GPU resources on hosts with NVIDIA Blackwell-class consumer GPUs (e.g. RTX 5090, driver 570.x):

TimeoutError: Placement group creation timed out. Make sure your cluster has enough resources.
Error: No available node types can fulfill resource request {'GPU': 1.0}

Two independent bugs interact:

1. TPU false positive on /dev/accel*. NVIDIA driver 570.x (Blackwell) creates /dev/accel/accel0 on the host. TPUAcceleratorManager.get_current_node_num_accelerators uses glob.glob("/dev/accel*") to detect TPU chips and reports TPU == 1, which then steals the resource slot from the NVIDIA detector and GPU is never registered.

Evidence chain on an RTX 5090 host (driver 570.211.01):

Layer Sees GPU? Detail
nvidia-smi NVIDIA GeForce RTX 5090, 32607 MiB, 570.211.01
torch.cuda device_count()=1, is_available()=True
Ray NvidiaGPUAcceleratorManager (pynvml) get_current_node_num_accelerators()=1
Ray TPUAcceleratorManager false-positive 1 glob("/dev/accel*") matches NVIDIA device file
ray.cluster_resources() no GPU {'TPU': 1.0, 'CPU': 24.0, ...}

Fix: only count /dev/accel* as TPU chips when TPU_ACCELERATOR_TYPE is set in the environment. Real TPU VMs (GCE / GKE) always set this env var (the constant is already defined as GKE_TPU_ACCELERATOR_TYPE_ENV_VAR at the top of tpu.py). The /dev/vfio/* fallback for non-GKE TPU hosts is preserved.

2. NVIDIA GPU name regex captures only "G" on consumer cards. NVIDIA_GPU_NAME_PATTERN = re.compile(r"\w+\s+([A-Z0-9]+)") was designed for datacenter cards ("Tesla V100-SXM2-16GB""V100", "NVIDIA A100-SXM4-40GB""A100"). On a consumer card name like "NVIDIA GeForce RTX 5090" the regex stops at the lowercase e in GeForce and captures just "G", producing a useless accelerator_type:G label.

Fix: when the existing regex returns a result of length ≤1, fall back to a hyphen-joined product name. "NVIDIA GeForce RTX 5090""GeForce-RTX-5090". The original TODO(Alex) comment noted this exact concern — this PR addresses it without regressing the Tesla/datacenter behavior.

After the fix

>>> ray.cluster_resources()
{'GPU': 1.0,
 'accelerator_type:GeForce-RTX-5090': 1.0,
 'CPU': 24.0,
 ...}

Test plan

pytest python/ray/tests/accelerators/test_tpu.py \
       python/ray/tests/accelerators/test_nvidia_gpu.py
# 71 passed locally

New/updated cases:

  • test_autodetect_num_tpus_accel_ignored_without_tpu_env — exercises the NVIDIA-Blackwell false-positive scenario.
  • test_set_tpu_visible_ids_and_bounds now sets TPU_ACCELERATOR_TYPE inside its cleared env block (matches real TPU VMs).
  • test_gpu_name_to_accelerator_type parametrized over Tesla V100-SXM2-16GB, Tesla K80, NVIDIA A100-SXM4-40GB, NVIDIA H100 80GB HBM3, NVIDIA GeForce RTX 5090, NVIDIA GeForce RTX 4090, None, and "".

Additional information

This problem will get more common as Blackwell-class hardware (consumer RTX 5xxx, plus B200 datacenter cards which also ship with driver 570.x) reaches more users. The same patch has already been validated end-to-end in an Applied Intuition fork running ray==2.50.1; opening here so the fix benefits everyone.

@micah-yong-ai micah-yong-ai requested a review from a team as a code owner May 13, 2026 17:35

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f2c1572cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/ray/_private/accelerators/tpu.py Outdated

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves NVIDIA GPU name parsing to support consumer cards like the RTX 5090 and prevents Blackwell-class GPUs from being misidentified as TPUs by verifying the presence of a TPU environment variable before trusting /dev/accel* device files. Feedback suggests refining the TPU detection logic to allow falling through to alternative detection methods (e.g., /dev/vfio) if the environment check fails, rather than returning zero immediately.

Comment thread python/ray/_private/accelerators/tpu.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 8f2c1572ccd4b95e938ce5a53eea01f2dd4f91c5. Configure here.

Comment thread python/ray/_private/accelerators/tpu.py Outdated
@micah-yong-ai micah-yong-ai force-pushed the micah/fix-blackwell-gpu-detection-upstream branch from 8f2c157 to 704624c Compare May 13, 2026 17:48
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels May 13, 2026
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label May 13, 2026
@edoakes edoakes enabled auto-merge (squash) May 13, 2026 21:00
@edoakes

edoakes commented May 13, 2026

Copy link
Copy Markdown
Collaborator
Signed-off-by: Micah <micah@applied.co>
auto-merge was automatically disabled May 13, 2026 21:19

Head branch was pushed to by a user without write access

@micah-yong-ai micah-yong-ai force-pushed the micah/fix-blackwell-gpu-detection-upstream branch from 704624c to 3b8db4b Compare May 13, 2026 21:19
@edoakes edoakes merged commit 5d7bc7a into ray-project:master May 14, 2026
6 checks passed
TruongQuangPhat pushed a commit to cyhapun/ray-fix-issue that referenced this pull request May 27, 2026
…ay-project#63322)

## Description

Closes ray-project#45302.

Ray 2.50.x fails to register any GPU resources on hosts with NVIDIA
Blackwell-class consumer GPUs (e.g. RTX 5090, driver 570.x):

```
TimeoutError: Placement group creation timed out. Make sure your cluster has enough resources.
Error: No available node types can fulfill resource request {'GPU': 1.0}
```

Two independent bugs interact:

**1. TPU false positive on `/dev/accel*`.** NVIDIA driver 570.x
(Blackwell) creates `/dev/accel/accel0` on the host.
`TPUAcceleratorManager.get_current_node_num_accelerators` uses
`glob.glob("/dev/accel*")` to detect TPU chips and reports `TPU == 1`,
which then steals the resource slot from the NVIDIA detector and `GPU`
is never registered.

Evidence chain on an RTX 5090 host (driver 570.211.01):

| Layer | Sees GPU? | Detail |
|-------|-----------|--------|
| `nvidia-smi` | ✅ | `NVIDIA GeForce RTX 5090, 32607 MiB, 570.211.01` |
| `torch.cuda` | ✅ | `device_count()=1, is_available()=True` |
| Ray `NvidiaGPUAcceleratorManager` (pynvml) | ✅ |
`get_current_node_num_accelerators()=1` |
| Ray `TPUAcceleratorManager` | **false-positive 1** |
`glob("/dev/accel*")` matches NVIDIA device file |
| `ray.cluster_resources()` | **no GPU** | `{'TPU': 1.0, 'CPU': 24.0,
...}` |

Fix: only count `/dev/accel*` as TPU chips when `TPU_ACCELERATOR_TYPE`
is set in the environment. Real TPU VMs (GCE / GKE) always set this env
var (the constant is already defined as
`GKE_TPU_ACCELERATOR_TYPE_ENV_VAR` at the top of `tpu.py`). The
`/dev/vfio/*` fallback for non-GKE TPU hosts is preserved.

**2. NVIDIA GPU name regex captures only `"G"` on consumer cards.**
`NVIDIA_GPU_NAME_PATTERN = re.compile(r"\w+\s+([A-Z0-9]+)")` was
designed for datacenter cards (`"Tesla V100-SXM2-16GB"` → `"V100"`,
`"NVIDIA A100-SXM4-40GB"` → `"A100"`). On a consumer card name like
`"NVIDIA GeForce RTX 5090"` the regex stops at the lowercase `e` in
`GeForce` and captures just `"G"`, producing a useless
`accelerator_type:G` label.

Fix: when the existing regex returns a result of length ≤1, fall back to
a hyphen-joined product name. `"NVIDIA GeForce RTX 5090"` →
`"GeForce-RTX-5090"`. The original `TODO(Alex)` comment noted this exact
concern — this PR addresses it without regressing the Tesla/datacenter
behavior.

### After the fix

```python
>>> ray.cluster_resources()
{'GPU': 1.0,
 'accelerator_type:GeForce-RTX-5090': 1.0,
 'CPU': 24.0,
 ...}
```

## Test plan

```
pytest python/ray/tests/accelerators/test_tpu.py \
       python/ray/tests/accelerators/test_nvidia_gpu.py
# 71 passed locally
```

New/updated cases:
- `test_autodetect_num_tpus_accel_ignored_without_tpu_env` — exercises
the NVIDIA-Blackwell false-positive scenario.
- `test_set_tpu_visible_ids_and_bounds` now sets `TPU_ACCELERATOR_TYPE`
inside its cleared env block (matches real TPU VMs).
- `test_gpu_name_to_accelerator_type` parametrized over `Tesla
V100-SXM2-16GB`, `Tesla K80`, `NVIDIA A100-SXM4-40GB`, `NVIDIA H100 80GB
HBM3`, `NVIDIA GeForce RTX 5090`, `NVIDIA GeForce RTX 4090`, `None`, and
`""`.

## Additional information

This problem will get more common as Blackwell-class hardware (consumer
RTX 5xxx, plus B200 datacenter cards which also ship with driver 570.x)
reaches more users. The same patch has already been validated end-to-end
in an Applied Intuition fork running ray==2.50.1; opening here so the
fix benefits everyone.

Signed-off-by: Micah <micah@applied.co>
Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
@andrewsykim andrewsykim added the backport-candidate Label to identify PRs that should be considered for backport to older versions. label May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-candidate Label to identify PRs that should be considered for backport to older versions. community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

3 participants