[Data] Add `ClusterUtil` dataclass for `ResourceUtilizationGauge` by bveeramani · Pull Request #61436 · ray-project/ray

bveeramani · 2026-03-03T02:21:55Z

ExecutionResources can't represent object store memory and memory utilization correctly because it rounds values to the nearest integer. So, the utilization for those resources is always either 0 or 1.

To ensure we can express the full [0, 1] range of resource utilization, this PR introduces a new ClusterUtil dataclass rather than re-using ExecutionResources.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

gemini-code-assist

Code Review

This pull request introduces a new ClusterUtil dataclass to correctly handle floating-point resource utilization values, which is a good improvement. The implementation is sound. I've added a couple of suggestions to refactor the validation logic in the new dataclass to improve its maintainability by reducing code duplication.

gemini-code-assist · 2026-03-03T02:23:01Z

@@ -1,12 +1,26 @@
 import abc
+import math
+from dataclasses import dataclass


To support iterating over dataclass fields for validation in __post_init__, please also import fields here. This will be used in the refactoring of __post_init__.

Suggested change

from dataclasses import dataclass

from dataclasses import dataclass, fields

gemini-code-assist · 2026-03-03T02:23:01Z

+    def __post_init__(self):
+        # If we overcommit tasks, the logical utilization can exceed 1.0.
+        assert math.isfinite(self.cpu) and 0 <= self.cpu
+        assert math.isfinite(self.gpu) and 0 <= self.gpu
+        assert math.isfinite(self.memory) and 0 <= self.memory
+        assert math.isfinite(self.object_store_memory) and 0 <= self.object_store_memory


To improve maintainability and reduce code duplication, you can iterate over the dataclass fields to perform the validation. This approach will automatically include any new fields added to the dataclass in the future, making the code more robust.

Suggested change

def __post_init__(self):

# If we overcommit tasks, the logical utilization can exceed 1.0.

assert math.isfinite(self.cpu) and 0 <= self.cpu

assert math.isfinite(self.gpu) and 0 <= self.gpu

assert math.isfinite(self.memory) and 0 <= self.memory

assert math.isfinite(self.object_store_memory) and 0 <= self.object_store_memory

def __post_init__(self):

# If we overcommit tasks, the logical utilization can exceed 1.0.

for f in fields(self):

value = getattr(self, f.name)

assert math.isfinite(value) and value >= 0, f"Invalid value for {f.name}: {value}"

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

## Description LLM post-merge tests are failing with the following error: <img width="1098" height="269" alt="Screenshot 2026-03-06 at 10 14 28 AM" src="https://github.com/user-attachments/assets/cf4a23f9-b862-4b7a-8edc-2713c003dac2" /> https://github.com/ray-project/ray/blob/645d4f17a4c15b4a8875ccb0a64b178a7eceb358/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py#L129-L137 `get()` reads rolling averages from `TimeWindowAverageCalculator`, which maintains a running _sum updated on `report()` and reduced in `_trim()` as values expire. Due to floating-point rounding, `_sum` can drift slightly negative after all values are trimmed, producing a negative average and violating the `ClusterUtil` assertion. ## Approach Clamp the rolling average to `max(0, ...)` in `RollingLogicalUtilizationGauge.get()`, since utilization ratios are inherently non-negative and small negative values are just floating-point artifacts. ## Test `python -m pytest python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vs` passes after the fix. ## Related issues - Postmerge failure: https://buildkite.com/ray-project/postmerge/builds/16343/steps/canvas?sid=019cc1c1-e7c2-4a1c-9a86-a44211b14e66&tab=output - Related PR: #61436 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

…y-project#61436) `ExecutionResources` can't represent object store memory and memory utilization correctly because it rounds values to the nearest integer. So, the utilization for those resources is always either 0 or 1. To ensure we can express the full [0, 1] range of resource utilization, this PR introduces a new `ClusterUtil` dataclass rather than re-using `ExecutionResources`. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>

## Description LLM post-merge tests are failing with the following error: <img width="1098" height="269" alt="Screenshot 2026-03-06 at 10 14 28 AM" src="https://github.com/user-attachments/assets/cf4a23f9-b862-4b7a-8edc-2713c003dac2" /> https://github.com/ray-project/ray/blob/645d4f17a4c15b4a8875ccb0a64b178a7eceb358/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py#L129-L137 `get()` reads rolling averages from `TimeWindowAverageCalculator`, which maintains a running _sum updated on `report()` and reduced in `_trim()` as values expire. Due to floating-point rounding, `_sum` can drift slightly negative after all values are trimmed, producing a negative average and violating the `ClusterUtil` assertion. ## Approach Clamp the rolling average to `max(0, ...)` in `RollingLogicalUtilizationGauge.get()`, since utilization ratios are inherently non-negative and small negative values are just floating-point artifacts. ## Test `python -m pytest python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vs` passes after the fix. ## Related issues - Postmerge failure: https://buildkite.com/ray-project/postmerge/builds/16343/steps/canvas?sid=019cc1c1-e7c2-4a1c-9a86-a44211b14e66&tab=output - Related PR: ray-project#61436 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>

## Description LLM post-merge tests are failing with the following error: <img width="1098" height="269" alt="Screenshot 2026-03-06 at 10 14 28 AM" src="https://github.com/user-attachments/assets/cf4a23f9-b862-4b7a-8edc-2713c003dac2" /> https://github.com/ray-project/ray/blob/645d4f17a4c15b4a8875ccb0a64b178a7eceb358/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py#L129-L137 `get()` reads rolling averages from `TimeWindowAverageCalculator`, which maintains a running _sum updated on `report()` and reduced in `_trim()` as values expire. Due to floating-point rounding, `_sum` can drift slightly negative after all values are trimmed, producing a negative average and violating the `ClusterUtil` assertion. ## Approach Clamp the rolling average to `max(0, ...)` in `RollingLogicalUtilizationGauge.get()`, since utilization ratios are inherently non-negative and small negative values are just floating-point artifacts. ## Test `python -m pytest python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vs` passes after the fix. ## Related issues - Postmerge failure: https://buildkite.com/ray-project/postmerge/builds/16343/steps/canvas?sid=019cc1c1-e7c2-4a1c-9a86-a44211b14e66&tab=output - Related PR: #61436 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Initial commit

b121442

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani requested a review from a team as a code owner March 3, 2026 02:21

bveeramani requested a review from owenowenisme March 3, 2026 02:22

owenowenisme approved these changes Mar 3, 2026

View reviewed changes

gemini-code-assist Bot reviewed Mar 3, 2026

View reviewed changes

cursor Bot reviewed Mar 3, 2026

View reviewed changes

Comment thread python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py

Address reveiew comments

43cbede

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani enabled auto-merge (squash) March 3, 2026 02:29

github-actions Bot added the go add ONLY when ready to merge, run all tests label Mar 3, 2026

ray-gardener Bot added the data Ray Data-related issues label Mar 3, 2026

Merge branch 'master' into add-cluster-util

7393170

github-actions Bot disabled auto-merge March 4, 2026 08:18

bveeramani merged commit 941d4e0 into master Mar 4, 2026
6 checks passed

bveeramani deleted the add-cluster-util branch March 4, 2026 23:51

jeffreywang88 mentioned this pull request Mar 6, 2026

[data] Clamp rolling utilization averages to zero #61543

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Add `ClusterUtil` dataclass for `ResourceUtilizationGauge`#61436

[Data] Add `ClusterUtil` dataclass for `ResourceUtilizationGauge`#61436
bveeramani merged 3 commits into
masterfrom
add-cluster-util

bveeramani commented Mar 3, 2026

gemini-code-assist Bot left a comment

gemini-code-assist Bot Mar 3, 2026

gemini-code-assist Bot Mar 3, 2026

cursor Bot left a comment

Uh oh!

Uh oh!

Labels

2 participants

	from dataclasses import dataclass
	from dataclasses import dataclass, fields

Uh oh!

Conversation

bveeramani commented Mar 3, 2026

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot Mar 3, 2026

Choose a reason for hiding this comment

gemini-code-assist Bot Mar 3, 2026

Choose a reason for hiding this comment

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

2 participants