Skip to content

[dashboard] k8s_utils.cpu_percent() crashes with ZeroDivisionError on nodes registered with num-cpus=0 #63729

Description

@Zion-Webiks

What happened + What you expected to happen

Description

k8s_utils.cpu_percent() raises ZeroDivisionError on every metrics poll cycle for any Ray node
registered with --num-cpus=0. The exception is caught and returns 0.0, so behavior is functionally
correct, but logger.exception() logs a full traceback on every call, spamming pod logs.

Reproduction

Deploy a KubeRay cluster with num-cpus: "0" on the head node (the documented best practice):

# raycluster.yaml — from Ray's own docs recommendation
headGroupSpec:
  rayStartParams:
    num-cpus: "0"   # Recommended to prevent tasks scheduling on head

Any pod with num-cpus: "0" will produce repeated errors in the dashboard agent logs:

ERROR ray.dashboard.k8s_utils - Error computing CPU usage of Ray Kubernetes pod.
Traceback (most recent call last):
  File ".../ray/dashboard/k8s_utils.py", line 49, in cpu_percent
    cpu_percent = round(quotient * 100 / get_num_cpus(), 1)
ZeroDivisionError: division by zero

Root cause

In python/ray/dashboard/k8s_utils.py:

cpu_percent = round(quotient * 100 / get_num_cpus(), 1)  # ← ZeroDivisionError when get_num_cpus() == 0

Interestingly, the same developer already guarded against this exact scenario elsewhere in
reporter_agent.py (same file that calls k8s_utils):

# _get_load_avg() in reporter_agent.py — correctly guarded ✅
if self._cpu_counts[0] > 0:
    per_cpu_load = tuple(...)
else:
    per_cpu_load = None

The guard was simply missed in k8s_utils.cpu_percent().

Impact

Affects every KubeRay deployment that follows the documented num-cpus: "0" recommendation
(head nodes, zero-CPU specialized worker groups). The Ray docs explicitly recommend this pattern:

"setting num-cpus:'0' for the Ray head pod will prevent Ray workloads from being scheduled on the head"

Fix

One-line change in python/ray/dashboard/k8s_utils.py:

# Before
cpu_percent = round(quotient * 100 / get_num_cpus(), 1)

# After
cpu_percent = round(quotient * 100 / (get_num_cpus() or 1), 1)

Versions / Dependencies

Ray 2.44.0

Reproduction script

No K8s cluster needed — the bug can be triggered locally by simulating the two conditions:
get_num_cpus() returning 0 (as it does on --num-cpus=0 nodes) and cgroup CPU stats being
readable (mocked here).

"""
Minimal reproduction of k8s_utils.cpu_percent() ZeroDivisionError.
Simulates a Ray node started with --num-cpus=0 inside a K8s pod.

Run with:  pip install ray && python repro.py
"""
import time
from unittest.mock import patch

import ray.dashboard.k8s_utils as k8s_utils

# Simulate two successive cgroup reads so cpu_delta > 0 (triggers the division)
cpu_reads = iter([1_000_000_000, 2_000_000_000])
sys_reads  = iter([10_000_000_000, 20_000_000_000])

with (
    patch.object(k8s_utils, "_cpu_usage",    side_effect=cpu_reads),
    patch.object(k8s_utils, "_system_usage", side_effect=sys_reads),
    patch.object(k8s_utils, "_host_num_cpus", return_value=8),
):
    # Prime last_* so the second call computes a real delta
    k8s_utils.cpu_percent()

    # Simulate --num-cpus=0: get_num_cpus() returns 0
    from ray._private import utils as ray_utils
    with patch.object(ray_utils, "get_num_cpus", return_value=0):
        result = k8s_utils.cpu_percent()
        print(f"Result: {result}")  # returns 0.0 (exception caught)
        # Check logs — should show ZeroDivisionError traceback

Expected output (stderr):

ERROR ray.dashboard.k8s_utils - Error computing CPU usage of Ray Kubernetes pod.
Traceback (most recent call last):
  File ".../k8s_utils.py", line 49, in cpu_percent
    cpu_percent = round(quotient * 100 / get_num_cpus(), 1)
ZeroDivisionError: division by zero

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoredashboardIssues specific to the Ray DashboardkubernetesobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingstabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions