Skip to content

[Data] Add ClusterUtil dataclass for ResourceUtilizationGauge#61436

Merged
bveeramani merged 3 commits into
masterfrom
add-cluster-util
Mar 4, 2026
Merged

[Data] Add ClusterUtil dataclass for ResourceUtilizationGauge#61436
bveeramani merged 3 commits into
masterfrom
add-cluster-util

Conversation

@bveeramani

Copy link
Copy Markdown
Member

ExecutionResources can't represent object store memory and memory utilization correctly because it rounds values to the nearest integer. So, the utilization for those resources is always either 0 or 1.

To ensure we can express the full [0, 1] range of resource utilization, this PR introduces a new ClusterUtil dataclass rather than re-using ExecutionResources.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani requested a review from a team as a code owner March 3, 2026 02:21
@bveeramani bveeramani requested a review from owenowenisme March 3, 2026 02:22

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new ClusterUtil dataclass to correctly handle floating-point resource utilization values, which is a good improvement. The implementation is sound. I've added a couple of suggestions to refactor the validation logic in the new dataclass to improve its maintainability by reducing code duplication.

@@ -1,12 +1,26 @@
import abc
import math
from dataclasses import dataclass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To support iterating over dataclass fields for validation in __post_init__, please also import fields here. This will be used in the refactoring of __post_init__.

Suggested change
from dataclasses import dataclass
from dataclasses import dataclass, fields
Comment on lines +18 to +23
def __post_init__(self):
# If we overcommit tasks, the logical utilization can exceed 1.0.
assert math.isfinite(self.cpu) and 0 <= self.cpu
assert math.isfinite(self.gpu) and 0 <= self.gpu
assert math.isfinite(self.memory) and 0 <= self.memory
assert math.isfinite(self.object_store_memory) and 0 <= self.object_store_memory

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve maintainability and reduce code duplication, you can iterate over the dataclass fields to perform the validation. This approach will automatically include any new fields added to the dataclass in the future, making the code more robust.

Suggested change
def __post_init__(self):
# If we overcommit tasks, the logical utilization can exceed 1.0.
assert math.isfinite(self.cpu) and 0 <= self.cpu
assert math.isfinite(self.gpu) and 0 <= self.gpu
assert math.isfinite(self.memory) and 0 <= self.memory
assert math.isfinite(self.object_store_memory) and 0 <= self.object_store_memory
def __post_init__(self):
# If we overcommit tasks, the logical utilization can exceed 1.0.
for f in fields(self):
value = getattr(self, f.name)
assert math.isfinite(value) and value >= 0, f"Invalid value for {f.name}: {value}"

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) March 3, 2026 02:29
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label Mar 3, 2026
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Mar 3, 2026
@github-actions github-actions Bot disabled auto-merge March 4, 2026 08:18
@bveeramani bveeramani merged commit 941d4e0 into master Mar 4, 2026
6 checks passed
@bveeramani bveeramani deleted the add-cluster-util branch March 4, 2026 23:51
bveeramani pushed a commit that referenced this pull request Mar 10, 2026
## Description
LLM post-merge tests are failing with the following error:
<img width="1098" height="269" alt="Screenshot 2026-03-06 at 10 14
28 AM"
src="https://github.com/user-attachments/assets/cf4a23f9-b862-4b7a-8edc-2713c003dac2"
/>


https://github.com/ray-project/ray/blob/645d4f17a4c15b4a8875ccb0a64b178a7eceb358/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py#L129-L137
`get()` reads rolling averages from `TimeWindowAverageCalculator`, which
maintains a running _sum updated on `report()` and reduced in `_trim()`
as values expire. Due to floating-point rounding, `_sum` can drift
slightly negative after all values are trimmed, producing a negative
average and violating the `ClusterUtil` assertion.

## Approach
Clamp the rolling average to `max(0, ...)` in
`RollingLogicalUtilizationGauge.get()`, since utilization ratios are
inherently non-negative and small negative values are just
floating-point artifacts.

## Test
`python -m pytest
python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vs`
passes after the fix.

## Related issues
- Postmerge failure:
https://buildkite.com/ray-project/postmerge/builds/16343/steps/canvas?sid=019cc1c1-e7c2-4a1c-9a86-a44211b14e66&tab=output
- Related PR: #61436

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
ParagEkbote pushed a commit to ParagEkbote/ray that referenced this pull request Mar 10, 2026
…y-project#61436)

`ExecutionResources` can't represent object store memory and memory
utilization correctly because it rounds values to the nearest integer.
So, the utilization for those resources is always either 0 or 1.

To ensure we can express the full [0, 1] range of resource utilization,
this PR introduces a new `ClusterUtil` dataclass rather than re-using
`ExecutionResources`.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
ParagEkbote pushed a commit to ParagEkbote/ray that referenced this pull request Mar 10, 2026
## Description
LLM post-merge tests are failing with the following error:
<img width="1098" height="269" alt="Screenshot 2026-03-06 at 10 14
28 AM"
src="https://github.com/user-attachments/assets/cf4a23f9-b862-4b7a-8edc-2713c003dac2"
/>

https://github.com/ray-project/ray/blob/645d4f17a4c15b4a8875ccb0a64b178a7eceb358/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py#L129-L137
`get()` reads rolling averages from `TimeWindowAverageCalculator`, which
maintains a running _sum updated on `report()` and reduced in `_trim()`
as values expire. Due to floating-point rounding, `_sum` can drift
slightly negative after all values are trimmed, producing a negative
average and violating the `ClusterUtil` assertion.

## Approach
Clamp the rolling average to `max(0, ...)` in
`RollingLogicalUtilizationGauge.get()`, since utilization ratios are
inherently non-negative and small negative values are just
floating-point artifacts.

## Test
`python -m pytest
python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vs`
passes after the fix.

## Related issues
- Postmerge failure:
https://buildkite.com/ray-project/postmerge/builds/16343/steps/canvas?sid=019cc1c1-e7c2-4a1c-9a86-a44211b14e66&tab=output
- Related PR: ray-project#61436

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
abrarsheikh pushed a commit that referenced this pull request Mar 11, 2026
## Description
LLM post-merge tests are failing with the following error:
<img width="1098" height="269" alt="Screenshot 2026-03-06 at 10 14
28 AM"
src="https://github.com/user-attachments/assets/cf4a23f9-b862-4b7a-8edc-2713c003dac2"
/>


https://github.com/ray-project/ray/blob/645d4f17a4c15b4a8875ccb0a64b178a7eceb358/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py#L129-L137
`get()` reads rolling averages from `TimeWindowAverageCalculator`, which
maintains a running _sum updated on `report()` and reduced in `_trim()`
as values expire. Due to floating-point rounding, `_sum` can drift
slightly negative after all values are trimmed, producing a negative
average and violating the `ClusterUtil` assertion.

## Approach
Clamp the rolling average to `max(0, ...)` in
`RollingLogicalUtilizationGauge.get()`, since utilization ratios are
inherently non-negative and small negative values are just
floating-point artifacts.

## Test
`python -m pytest
python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vs`
passes after the fix.

## Related issues
- Postmerge failure:
https://buildkite.com/ray-project/postmerge/builds/16343/steps/canvas?sid=019cc1c1-e7c2-4a1c-9a86-a44211b14e66&tab=output
- Related PR: #61436

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants