What happened + What you expected to happen
Description
k8s_utils.cpu_percent() raises ZeroDivisionError on every metrics poll cycle for any Ray node
registered with --num-cpus=0. The exception is caught and returns 0.0, so behavior is functionally
correct, but logger.exception() logs a full traceback on every call, spamming pod logs.
Reproduction
Deploy a KubeRay cluster with num-cpus: "0" on the head node (the documented best practice):
# raycluster.yaml — from Ray's own docs recommendation
headGroupSpec:
rayStartParams:
num-cpus: "0" # Recommended to prevent tasks scheduling on head
Any pod with num-cpus: "0" will produce repeated errors in the dashboard agent logs:
ERROR ray.dashboard.k8s_utils - Error computing CPU usage of Ray Kubernetes pod.
Traceback (most recent call last):
File ".../ray/dashboard/k8s_utils.py", line 49, in cpu_percent
cpu_percent = round(quotient * 100 / get_num_cpus(), 1)
ZeroDivisionError: division by zero
Root cause
In python/ray/dashboard/k8s_utils.py:
cpu_percent = round(quotient * 100 / get_num_cpus(), 1) # ← ZeroDivisionError when get_num_cpus() == 0
Interestingly, the same developer already guarded against this exact scenario elsewhere in
reporter_agent.py (same file that calls k8s_utils):
# _get_load_avg() in reporter_agent.py — correctly guarded ✅
if self._cpu_counts[0] > 0:
per_cpu_load = tuple(...)
else:
per_cpu_load = None
The guard was simply missed in k8s_utils.cpu_percent().
Impact
Affects every KubeRay deployment that follows the documented num-cpus: "0" recommendation
(head nodes, zero-CPU specialized worker groups). The Ray docs explicitly recommend this pattern:
"setting num-cpus:'0' for the Ray head pod will prevent Ray workloads from being scheduled on the head"
Fix
One-line change in python/ray/dashboard/k8s_utils.py:
# Before
cpu_percent = round(quotient * 100 / get_num_cpus(), 1)
# After
cpu_percent = round(quotient * 100 / (get_num_cpus() or 1), 1)
Versions / Dependencies
Ray 2.44.0
Reproduction script
No K8s cluster needed — the bug can be triggered locally by simulating the two conditions:
get_num_cpus() returning 0 (as it does on --num-cpus=0 nodes) and cgroup CPU stats being
readable (mocked here).
"""
Minimal reproduction of k8s_utils.cpu_percent() ZeroDivisionError.
Simulates a Ray node started with --num-cpus=0 inside a K8s pod.
Run with: pip install ray && python repro.py
"""
import time
from unittest.mock import patch
import ray.dashboard.k8s_utils as k8s_utils
# Simulate two successive cgroup reads so cpu_delta > 0 (triggers the division)
cpu_reads = iter([1_000_000_000, 2_000_000_000])
sys_reads = iter([10_000_000_000, 20_000_000_000])
with (
patch.object(k8s_utils, "_cpu_usage", side_effect=cpu_reads),
patch.object(k8s_utils, "_system_usage", side_effect=sys_reads),
patch.object(k8s_utils, "_host_num_cpus", return_value=8),
):
# Prime last_* so the second call computes a real delta
k8s_utils.cpu_percent()
# Simulate --num-cpus=0: get_num_cpus() returns 0
from ray._private import utils as ray_utils
with patch.object(ray_utils, "get_num_cpus", return_value=0):
result = k8s_utils.cpu_percent()
print(f"Result: {result}") # returns 0.0 (exception caught)
# Check logs — should show ZeroDivisionError traceback
Expected output (stderr):
ERROR ray.dashboard.k8s_utils - Error computing CPU usage of Ray Kubernetes pod.
Traceback (most recent call last):
File ".../k8s_utils.py", line 49, in cpu_percent
cpu_percent = round(quotient * 100 / get_num_cpus(), 1)
ZeroDivisionError: division by zero
Issue Severity
Medium: It is a significant difficulty but I can work around it.
What happened + What you expected to happen
Description
k8s_utils.cpu_percent()raisesZeroDivisionErroron every metrics poll cycle for any Ray noderegistered with
--num-cpus=0. The exception is caught and returns0.0, so behavior is functionallycorrect, but
logger.exception()logs a full traceback on every call, spamming pod logs.Reproduction
Deploy a KubeRay cluster with
num-cpus: "0"on the head node (the documented best practice):Any pod with
num-cpus: "0"will produce repeated errors in the dashboard agent logs:Root cause
In
python/ray/dashboard/k8s_utils.py:Interestingly, the same developer already guarded against this exact scenario elsewhere in
reporter_agent.py(same file that callsk8s_utils):The guard was simply missed in
k8s_utils.cpu_percent().Impact
Affects every KubeRay deployment that follows the documented
num-cpus: "0"recommendation(head nodes, zero-CPU specialized worker groups). The Ray docs explicitly recommend this pattern:
Fix
One-line change in
python/ray/dashboard/k8s_utils.py:Versions / Dependencies
Ray 2.44.0
Reproduction script
No K8s cluster needed — the bug can be triggered locally by simulating the two conditions:
get_num_cpus()returning 0 (as it does on--num-cpus=0nodes) and cgroup CPU stats beingreadable (mocked here).
Expected output (stderr):
Issue Severity
Medium: It is a significant difficulty but I can work around it.