Skip to content

[core] Avoid FabricManager stall on NVLink systems in GpuProfilingManager#63312

Merged
edoakes merged 4 commits into
ray-project:masterfrom
aschuh-hf:copilot/fix-gpu-profilling-manager
May 28, 2026
Merged

[core] Avoid FabricManager stall on NVLink systems in GpuProfilingManager#63312
edoakes merged 4 commits into
ray-project:masterfrom
aschuh-hf:copilot/fix-gpu-profilling-manager

Conversation

@aschuh-hf

Copy link
Copy Markdown
Contributor

Fixes #63243

On NVLink/NVSwitch nodes (e.g. A100 SXM4), bare nvidia-smi triggers a blocking RPC to the FabricManager daemon. If another process (e.g. dynolog) holds the NVML lock, this stalls 15–20 s—long enough to exceed the raylet's dashboard agent startup timeout and prevent ray.init() from succeeding.

Changes

  • node_has_gpus(): replace bare nvidia-smi with nvidia-smi --query-gpu=name --format=csv,noheader, which queries NVML directly (~0.1 s) without touching the FabricManager.
# Before
subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL)

# After
subprocess.check_output(
    ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"],
    stderr=subprocess.DEVNULL,
)
  • enabled property: move binary-presence checks before node_has_gpus() so the GPU detection call is skipped entirely on nodes without dynolog installed.
# Before: node_has_gpus() always called first
return self.node_has_gpus() and self._dynolog_bin is not None and self._dyno_bin is not None

# After: short-circuits before the GPU check when dynolog is absent
return self._dynolog_bin is not None and self._dyno_bin is not None and self.node_has_gpus()
  • Tests: added test_node_has_gpus_uses_query_gpu_flag (asserts exact nvidia-smi flags) and test_enabled_does_not_call_node_has_gpus_when_dynolog_missing (asserts short-circuit behavior).
…led check

Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/3c45c1be-69c9-47ad-b79f-de26fcf1debc

Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>
@aschuh-hf aschuh-hf requested a review from a team as a code owner May 12, 2026 22:41

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes GPU detection in the GpuProfilingManager by reordering the enabled property to short-circuit the GPU check when required binaries are missing and updating the nvidia-smi command with specific query flags to avoid stalls. Review feedback suggests adding a timeout to the subprocess.check_output call for better robustness against system hangs. Additionally, it was noted that the class constructor might still trigger the GPU check unconditionally, which could interfere with the intended optimization and cause test failures.

Comment thread python/ray/dashboard/modules/reporter/tests/test_gpu_profiler_manager.py Outdated
Comment thread python/ray/dashboard/modules/reporter/gpu_profile_manager.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 9df1a13. Configure here.

Comment thread python/ray/dashboard/modules/reporter/tests/test_gpu_profiler_manager.py Outdated
Copilot AI and others added 2 commits May 12, 2026 22:49
…node_has_gpus

Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/caeb8627-92ac-439e-a391-1e8432ce3f73

Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>
Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/13d3ab99-fcc1-4fd5-96de-e8766b8873c4

Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels May 13, 2026
@github-actions

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 27, 2026
@aschuh-hf

Copy link
Copy Markdown
Contributor Author

I believe this PR is ready and can be merged.

@edoakes edoakes added the go add ONLY when ready to merge, run all tests label May 27, 2026
Comment thread python/ray/dashboard/modules/reporter/gpu_profile_manager.py Outdated
try:
subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL)
subprocess.check_output(
["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it that these parameters skip the fabricmanager?

make sure to leave a comment here indicating why the "magic parameters" were chosen

@aschuh-hf aschuh-hf May 28, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code worked on debugging this issue on our on-premise GPU server. It measured runtime and looked at various system logs [including strace] to determine that these parameters by-pass the blocking communication with the fabric manager. I myself ran "time nvidia-smi" with and without these arguments, and the time difference was massive without any downsides.

@aschuh-hf aschuh-hf May 28, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause analysis is documented in the linked issue #63243.

See also: https://share.google/aimode/cS0kazViPLaAK3fH9

Comment on lines +81 to +84
elif not self.node_has_gpus():
logger.warning(
"[GpuProfilingManager] No GPUs found on this node, GPU profiling will not be setup."
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably still want this log line, or it might even make sense to reverse these (if there are no GPUs, then why even check for dynolog?)

is there a dependency between the dynolog_bin and node_has_gpus? or are you just trying to skip the nvidia-smi call?

@aschuh-hf aschuh-hf May 28, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that when the user has dynolog not installed, the GpuProfilingManager is anyway disabled, regardless if there is a GPU available. The check if the node has a GPU is what may cause problems. The check if dynolog is installed does not. Hence the switched order.

When a user does not need the GpuProfilingManager, they can simply use a container image without this dependency installed and thus completely bypass the init overhead of the GpuProfilingManager. When they did install and set up dynolog, only then we may assume they care about using the GpuProfilingManager.

Would you prefer that a separate warning alerts the user that dynolog_bin was not found, and that this is the primary reason why the GPU profiling is disabled?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the warning that dynolog is not available is already shown. So the user will always get a reason why GPU profiling is disabled:

  1. dynolog must be installed and found
  2. A CUDA device must be available
@github-actions github-actions Bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels May 28, 2026
…econds

Increases the timeout for the nvidia-smi call in the node_has_gpus method
from 2 to 10 seconds to accommodate slower GPU detection on systems with
NVLink/NVSwitch configurations where the FabricManager RPC may take longer.
@edoakes edoakes merged commit 21cd062 into ray-project:master May 28, 2026
4 of 5 checks passed
@edoakes

edoakes commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Thanks for the contribution @aschuh-hf

edoakes added a commit that referenced this pull request May 29, 2026
Minor followup to: #63312

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
…ager (ray-project#63312)

Fixes ray-project#63243

On NVLink/NVSwitch nodes (e.g. A100 SXM4), bare `nvidia-smi` triggers a
blocking RPC to the FabricManager daemon. If another process (e.g.
`dynolog`) holds the NVML lock, this stalls 15–20 s—long enough to
exceed the raylet's dashboard agent startup timeout and prevent
`ray.init()` from succeeding.

## Changes

- **`node_has_gpus()`**: replace bare `nvidia-smi` with `nvidia-smi
--query-gpu=name --format=csv,noheader`, which queries NVML directly
(~0.1 s) without touching the FabricManager.

```python
# Before
subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL)

# After
subprocess.check_output(
    ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"],
    stderr=subprocess.DEVNULL,
)
```

- **`enabled` property**: move binary-presence checks before
`node_has_gpus()` so the GPU detection call is skipped entirely on nodes
without `dynolog` installed.

```python
# Before: node_has_gpus() always called first
return self.node_has_gpus() and self._dynolog_bin is not None and self._dyno_bin is not None

# After: short-circuits before the GPU check when dynolog is absent
return self._dynolog_bin is not None and self._dyno_bin is not None and self.node_has_gpus()
```

- **Tests**: added `test_node_has_gpus_uses_query_gpu_flag` (asserts
exact nvidia-smi flags) and
`test_enabled_does_not_call_node_has_gpus_when_dynolog_missing` (asserts
short-circuit behavior).

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
Minor followup to: ray-project#63312

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
…ager (ray-project#63312)

Fixes ray-project#63243

On NVLink/NVSwitch nodes (e.g. A100 SXM4), bare `nvidia-smi` triggers a
blocking RPC to the FabricManager daemon. If another process (e.g.
`dynolog`) holds the NVML lock, this stalls 15–20 s—long enough to
exceed the raylet's dashboard agent startup timeout and prevent
`ray.init()` from succeeding.

## Changes

- **`node_has_gpus()`**: replace bare `nvidia-smi` with `nvidia-smi
--query-gpu=name --format=csv,noheader`, which queries NVML directly
(~0.1 s) without touching the FabricManager.

```python
# Before
subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL)

# After
subprocess.check_output(
    ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"],
    stderr=subprocess.DEVNULL,
)
```

- **`enabled` property**: move binary-presence checks before
`node_has_gpus()` so the GPU detection call is skipped entirely on nodes
without `dynolog` installed.

```python
# Before: node_has_gpus() always called first
return self.node_has_gpus() and self._dynolog_bin is not None and self._dyno_bin is not None

# After: short-circuits before the GPU check when dynolog is absent
return self._dynolog_bin is not None and self._dyno_bin is not None and self.node_has_gpus()
```

- **Tests**: added `test_node_has_gpus_uses_query_gpu_flag` (asserts
exact nvidia-smi flags) and
`test_enabled_does_not_call_node_has_gpus_when_dynolog_missing` (asserts
short-circuit behavior).

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
Minor followup to: ray-project#63312

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

3 participants