[core] Avoid FabricManager stall on NVLink systems in GpuProfilingManager by aschuh-hf · Pull Request #63312 · ray-project/ray

aschuh-hf · 2026-05-12T22:41:03Z

On NVLink/NVSwitch nodes (e.g. A100 SXM4), bare nvidia-smi triggers a blocking RPC to the FabricManager daemon. If another process (e.g. dynolog) holds the NVML lock, this stalls 15–20 s—long enough to exceed the raylet's dashboard agent startup timeout and prevent ray.init() from succeeding.

Changes

node_has_gpus(): replace bare nvidia-smi with nvidia-smi --query-gpu=name --format=csv,noheader, which queries NVML directly (~0.1 s) without touching the FabricManager.

# Before
subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL)

# After
subprocess.check_output(
    ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"],
    stderr=subprocess.DEVNULL,
)

enabled property: move binary-presence checks before node_has_gpus() so the GPU detection call is skipped entirely on nodes without dynolog installed.

# Before: node_has_gpus() always called first
return self.node_has_gpus() and self._dynolog_bin is not None and self._dyno_bin is not None

# After: short-circuits before the GPU check when dynolog is absent
return self._dynolog_bin is not None and self._dyno_bin is not None and self.node_has_gpus()

Tests: added test_node_has_gpus_uses_query_gpu_flag (asserts exact nvidia-smi flags) and test_enabled_does_not_call_node_has_gpus_when_dynolog_missing (asserts short-circuit behavior).

…led check Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/3c45c1be-69c9-47ad-b79f-de26fcf1debc Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request optimizes GPU detection in the GpuProfilingManager by reordering the enabled property to short-circuit the GPU check when required binaries are missing and updating the nvidia-smi command with specific query flags to avoid stalls. Review feedback suggests adding a timeout to the subprocess.check_output call for better robustness against system hangs. Additionally, it was noted that the class constructor might still trigger the GPU check unconditionally, which could interfere with the intended optimization and cause test failures.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 9df1a13. Configure here.}

…node_has_gpus Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/caeb8627-92ac-439e-a391-1e8432ce3f73 Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>

Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/13d3ab99-fcc1-4fd5-96de-e8766b8873c4 Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>

github-actions · 2026-05-27T13:34:17Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

aschuh-hf · 2026-05-27T14:27:25Z

I believe this PR is ready and can be merged.

edoakes · 2026-05-27T19:01:46Z

        try:
-            subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL)
+            subprocess.check_output(
+                ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"],


why is it that these parameters skip the fabricmanager?

make sure to leave a comment here indicating why the "magic parameters" were chosen

Claude Code worked on debugging this issue on our on-premise GPU server. It measured runtime and looked at various system logs [including strace] to determine that these parameters by-pass the blocking communication with the fabric manager. I myself ran "time nvidia-smi" with and without these arguments, and the time difference was massive without any downsides.

The root cause analysis is documented in the linked issue #63243.

See also: https://share.google/aimode/cS0kazViPLaAK3fH9

edoakes · 2026-05-27T19:02:36Z

+        elif not self.node_has_gpus():
+            logger.warning(
+                "[GpuProfilingManager] No GPUs found on this node, GPU profiling will not be setup."
+            )


we probably still want this log line, or it might even make sense to reverse these (if there are no GPUs, then why even check for dynolog?)

is there a dependency between the dynolog_bin and node_has_gpus? or are you just trying to skip the nvidia-smi call?

The idea is that when the user has dynolog not installed, the GpuProfilingManager is anyway disabled, regardless if there is a GPU available. The check if the node has a GPU is what may cause problems. The check if dynolog is installed does not. Hence the switched order.

When a user does not need the GpuProfilingManager, they can simply use a container image without this dependency installed and thus completely bypass the init overhead of the GpuProfilingManager. When they did install and set up dynolog, only then we may assume they care about using the GpuProfilingManager.

Would you prefer that a separate warning alerts the user that dynolog_bin was not found, and that this is the primary reason why the GPU profiling is disabled?

Actually, the warning that dynolog is not available is already shown. So the user will always get a reason why GPU profiling is disabled:

dynolog must be installed and found

A CUDA device must be available

…econds Increases the timeout for the nvidia-smi call in the node_has_gpus method from 2 to 10 seconds to accommodate slower GPU detection on systems with NVLink/NVSwitch configurations where the FabricManager RPC may take longer.

edoakes · 2026-05-28T22:50:43Z

Thanks for the contribution @aschuh-hf

Minor followup to: #63312 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ager (ray-project#63312) Fixes ray-project#63243 On NVLink/NVSwitch nodes (e.g. A100 SXM4), bare `nvidia-smi` triggers a blocking RPC to the FabricManager daemon. If another process (e.g. `dynolog`) holds the NVML lock, this stalls 15–20 s—long enough to exceed the raylet's dashboard agent startup timeout and prevent `ray.init()` from succeeding. ## Changes - **`node_has_gpus()`**: replace bare `nvidia-smi` with `nvidia-smi --query-gpu=name --format=csv,noheader`, which queries NVML directly (~0.1 s) without touching the FabricManager. ```python # Before subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL) # After subprocess.check_output( ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"], stderr=subprocess.DEVNULL, ) ``` - **`enabled` property**: move binary-presence checks before `node_has_gpus()` so the GPU detection call is skipped entirely on nodes without `dynolog` installed. ```python # Before: node_has_gpus() always called first return self.node_has_gpus() and self._dynolog_bin is not None and self._dyno_bin is not None # After: short-circuits before the GPU check when dynolog is absent return self._dynolog_bin is not None and self._dyno_bin is not None and self.node_has_gpus() ``` - **Tests**: added `test_node_has_gpus_uses_query_gpu_flag` (asserts exact nvidia-smi flags) and `test_enabled_does_not_call_node_has_gpus_when_dynolog_missing` (asserts short-circuit behavior). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

Minor followup to: ray-project#63312 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ager (ray-project#63312) Fixes ray-project#63243 On NVLink/NVSwitch nodes (e.g. A100 SXM4), bare `nvidia-smi` triggers a blocking RPC to the FabricManager daemon. If another process (e.g. `dynolog`) holds the NVML lock, this stalls 15–20 s—long enough to exceed the raylet's dashboard agent startup timeout and prevent `ray.init()` from succeeding. ## Changes - **`node_has_gpus()`**: replace bare `nvidia-smi` with `nvidia-smi --query-gpu=name --format=csv,noheader`, which queries NVML directly (~0.1 s) without touching the FabricManager. ```python # Before subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL) # After subprocess.check_output( ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"], stderr=subprocess.DEVNULL, ) ``` - **`enabled` property**: move binary-presence checks before `node_has_gpus()` so the GPU detection call is skipped entirely on nodes without `dynolog` installed. ```python # Before: node_has_gpus() always called first return self.node_has_gpus() and self._dynolog_bin is not None and self._dyno_bin is not None # After: short-circuits before the GPU check when dynolog is absent return self._dynolog_bin is not None and self._dyno_bin is not None and self.node_has_gpus() ``` - **Tests**: added `test_node_has_gpus_uses_query_gpu_flag` (asserts exact nvidia-smi flags) and `test_enabled_does_not_call_node_has_gpus_when_dynolog_missing` (asserts short-circuit behavior). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

Minor followup to: ray-project#63312 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fix(GpuProfilingManager): use --query-gpu flag and short-circuit enab…

9df1a13

…led check Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/3c45c1be-69c9-47ad-b79f-de26fcf1debc Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>

aschuh-hf requested a review from a team as a code owner May 12, 2026 22:41

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

Comment thread python/ray/dashboard/modules/reporter/tests/test_gpu_profiler_manager.py Outdated

Comment thread python/ray/dashboard/modules/reporter/gpu_profile_manager.py

cursor Bot reviewed May 12, 2026

View reviewed changes

Comment thread python/ray/dashboard/modules/reporter/tests/test_gpu_profiler_manager.py Outdated

Copilot AI and others added 2 commits May 12, 2026 22:49

fix(GpuProfilingManager): reorder __init__ checks and add timeout to …

bb6e4eb

…node_has_gpus Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/caeb8627-92ac-439e-a391-1e8432ce3f73 Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>

style: format test file with black

3662aa4

Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/13d3ab99-fcc1-4fd5-96de-e8766b8873c4 Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>

ray-gardener Bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels May 13, 2026

github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 27, 2026

edoakes added the go add ONLY when ready to merge, run all tests label May 27, 2026

edoakes reviewed May 27, 2026

View reviewed changes

github-actions Bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels May 28, 2026

edoakes merged commit 21cd062 into ray-project:master May 28, 2026
4 of 5 checks passed

edoakes mentioned this pull request May 28, 2026

Add warning log when GPU profiling command times out #63706

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Avoid FabricManager stall on NVLink systems in GpuProfilingManager#63312

[core] Avoid FabricManager stall on NVLink systems in GpuProfilingManager#63312
edoakes merged 4 commits into
ray-project:masterfrom
aschuh-hf:copilot/fix-gpu-profilling-manager

aschuh-hf commented May 12, 2026

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

github-actions Bot commented May 27, 2026

aschuh-hf commented May 27, 2026

Uh oh!

edoakes May 27, 2026

aschuh-hf May 28, 2026 •

edited

Loading

aschuh-hf May 28, 2026 •

edited

Loading

edoakes May 27, 2026

aschuh-hf May 28, 2026 •

edited

Loading

aschuh-hf May 28, 2026

Uh oh!

edoakes commented May 28, 2026

Labels

3 participants

Uh oh!

Conversation

aschuh-hf commented May 12, 2026

Changes

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 27, 2026

aschuh-hf commented May 27, 2026

Uh oh!

edoakes May 27, 2026

Choose a reason for hiding this comment

aschuh-hf May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

aschuh-hf May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

edoakes May 27, 2026

Choose a reason for hiding this comment

aschuh-hf May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

aschuh-hf May 28, 2026

Choose a reason for hiding this comment

Uh oh!

edoakes commented May 28, 2026

Labels

3 participants

aschuh-hf May 28, 2026 •

edited

Loading

aschuh-hf May 28, 2026 •

edited

Loading

aschuh-hf May 28, 2026 •

edited

Loading