[core] Avoid FabricManager stall on NVLink systems in GpuProfilingManager#63312
Conversation
…led check Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/3c45c1be-69c9-47ad-b79f-de26fcf1debc Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request optimizes GPU detection in the GpuProfilingManager by reordering the enabled property to short-circuit the GPU check when required binaries are missing and updating the nvidia-smi command with specific query flags to avoid stalls. Review feedback suggests adding a timeout to the subprocess.check_output call for better robustness against system hangs. Additionally, it was noted that the class constructor might still trigger the GPU check unconditionally, which could interfere with the intended optimization and cause test failures.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 9df1a13. Configure here.
…node_has_gpus Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/caeb8627-92ac-439e-a391-1e8432ce3f73 Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>
Agent-Logs-Url: https://github.com/aschuh-hf/ray/sessions/13d3ab99-fcc1-4fd5-96de-e8766b8873c4 Co-authored-by: aschuh-hf <77496589+aschuh-hf@users.noreply.github.com>
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
I believe this PR is ready and can be merged. |
| try: | ||
| subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL) | ||
| subprocess.check_output( | ||
| ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"], |
There was a problem hiding this comment.
why is it that these parameters skip the fabricmanager?
make sure to leave a comment here indicating why the "magic parameters" were chosen
There was a problem hiding this comment.
Claude Code worked on debugging this issue on our on-premise GPU server. It measured runtime and looked at various system logs [including strace] to determine that these parameters by-pass the blocking communication with the fabric manager. I myself ran "time nvidia-smi" with and without these arguments, and the time difference was massive without any downsides.
There was a problem hiding this comment.
The root cause analysis is documented in the linked issue #63243.
| elif not self.node_has_gpus(): | ||
| logger.warning( | ||
| "[GpuProfilingManager] No GPUs found on this node, GPU profiling will not be setup." | ||
| ) |
There was a problem hiding this comment.
we probably still want this log line, or it might even make sense to reverse these (if there are no GPUs, then why even check for dynolog?)
is there a dependency between the dynolog_bin and node_has_gpus? or are you just trying to skip the nvidia-smi call?
There was a problem hiding this comment.
The idea is that when the user has dynolog not installed, the GpuProfilingManager is anyway disabled, regardless if there is a GPU available. The check if the node has a GPU is what may cause problems. The check if dynolog is installed does not. Hence the switched order.
When a user does not need the GpuProfilingManager, they can simply use a container image without this dependency installed and thus completely bypass the init overhead of the GpuProfilingManager. When they did install and set up dynolog, only then we may assume they care about using the GpuProfilingManager.
Would you prefer that a separate warning alerts the user that dynolog_bin was not found, and that this is the primary reason why the GPU profiling is disabled?
There was a problem hiding this comment.
Actually, the warning that dynolog is not available is already shown. So the user will always get a reason why GPU profiling is disabled:
- dynolog must be installed and found
- A CUDA device must be available
…econds Increases the timeout for the nvidia-smi call in the node_has_gpus method from 2 to 10 seconds to accommodate slower GPU detection on systems with NVLink/NVSwitch configurations where the FabricManager RPC may take longer.
|
Thanks for the contribution @aschuh-hf |
Minor followup to: #63312 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ager (ray-project#63312) Fixes ray-project#63243 On NVLink/NVSwitch nodes (e.g. A100 SXM4), bare `nvidia-smi` triggers a blocking RPC to the FabricManager daemon. If another process (e.g. `dynolog`) holds the NVML lock, this stalls 15–20 s—long enough to exceed the raylet's dashboard agent startup timeout and prevent `ray.init()` from succeeding. ## Changes - **`node_has_gpus()`**: replace bare `nvidia-smi` with `nvidia-smi --query-gpu=name --format=csv,noheader`, which queries NVML directly (~0.1 s) without touching the FabricManager. ```python # Before subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL) # After subprocess.check_output( ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"], stderr=subprocess.DEVNULL, ) ``` - **`enabled` property**: move binary-presence checks before `node_has_gpus()` so the GPU detection call is skipped entirely on nodes without `dynolog` installed. ```python # Before: node_has_gpus() always called first return self.node_has_gpus() and self._dynolog_bin is not None and self._dyno_bin is not None # After: short-circuits before the GPU check when dynolog is absent return self._dynolog_bin is not None and self._dyno_bin is not None and self.node_has_gpus() ``` - **Tests**: added `test_node_has_gpus_uses_query_gpu_flag` (asserts exact nvidia-smi flags) and `test_enabled_does_not_call_node_has_gpus_when_dynolog_missing` (asserts short-circuit behavior). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Minor followup to: ray-project#63312 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ager (ray-project#63312) Fixes ray-project#63243 On NVLink/NVSwitch nodes (e.g. A100 SXM4), bare `nvidia-smi` triggers a blocking RPC to the FabricManager daemon. If another process (e.g. `dynolog`) holds the NVML lock, this stalls 15–20 s—long enough to exceed the raylet's dashboard agent startup timeout and prevent `ray.init()` from succeeding. ## Changes - **`node_has_gpus()`**: replace bare `nvidia-smi` with `nvidia-smi --query-gpu=name --format=csv,noheader`, which queries NVML directly (~0.1 s) without touching the FabricManager. ```python # Before subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL) # After subprocess.check_output( ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"], stderr=subprocess.DEVNULL, ) ``` - **`enabled` property**: move binary-presence checks before `node_has_gpus()` so the GPU detection call is skipped entirely on nodes without `dynolog` installed. ```python # Before: node_has_gpus() always called first return self.node_has_gpus() and self._dynolog_bin is not None and self._dyno_bin is not None # After: short-circuits before the GPU check when dynolog is absent return self._dynolog_bin is not None and self._dyno_bin is not None and self.node_has_gpus() ``` - **Tests**: added `test_node_has_gpus_uses_query_gpu_flag` (asserts exact nvidia-smi flags) and `test_enabled_does_not_call_node_has_gpus_when_dynolog_missing` (asserts short-circuit behavior). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Minor followup to: ray-project#63312 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Fixes #63243
On NVLink/NVSwitch nodes (e.g. A100 SXM4), bare
nvidia-smitriggers a blocking RPC to the FabricManager daemon. If another process (e.g.dynolog) holds the NVML lock, this stalls 15–20 s—long enough to exceed the raylet's dashboard agent startup timeout and preventray.init()from succeeding.Changes
node_has_gpus(): replace barenvidia-smiwithnvidia-smi --query-gpu=name --format=csv,noheader, which queries NVML directly (~0.1 s) without touching the FabricManager.enabledproperty: move binary-presence checks beforenode_has_gpus()so the GPU detection call is skipped entirely on nodes withoutdynologinstalled.test_node_has_gpus_uses_query_gpu_flag(asserts exact nvidia-smi flags) andtest_enabled_does_not_call_node_has_gpus_when_dynolog_missing(asserts short-circuit behavior).