[data] Clamp rolling utilization averages to zero#61543
Merged
Conversation
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Contributor
There was a problem hiding this comment.
Code Review
The pull request correctly addresses a floating-point precision issue by clamping utilization values to be non-negative, which prevents assertion failures. The change is correct. I have provided a suggestion to refactor the implementation slightly to improve readability and maintainability by reducing code duplication.
Contributor
Author
|
Postmerge test failures are fixed by #61580. This PR is no longer needed. |
bveeramani
approved these changes
Mar 10, 2026
This was referenced Mar 10, 2026
ParagEkbote
pushed a commit
to ParagEkbote/ray
that referenced
this pull request
Mar 10, 2026
## Description LLM post-merge tests are failing with the following error: <img width="1098" height="269" alt="Screenshot 2026-03-06 at 10 14 28 AM" src="https://github.com/user-attachments/assets/cf4a23f9-b862-4b7a-8edc-2713c003dac2" /> https://github.com/ray-project/ray/blob/645d4f17a4c15b4a8875ccb0a64b178a7eceb358/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py#L129-L137 `get()` reads rolling averages from `TimeWindowAverageCalculator`, which maintains a running _sum updated on `report()` and reduced in `_trim()` as values expire. Due to floating-point rounding, `_sum` can drift slightly negative after all values are trimmed, producing a negative average and violating the `ClusterUtil` assertion. ## Approach Clamp the rolling average to `max(0, ...)` in `RollingLogicalUtilizationGauge.get()`, since utilization ratios are inherently non-negative and small negative values are just floating-point artifacts. ## Test `python -m pytest python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vs` passes after the fix. ## Related issues - Postmerge failure: https://buildkite.com/ray-project/postmerge/builds/16343/steps/canvas?sid=019cc1c1-e7c2-4a1c-9a86-a44211b14e66&tab=output - Related PR: ray-project#61436 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
abrarsheikh
pushed a commit
that referenced
this pull request
Mar 11, 2026
## Description LLM post-merge tests are failing with the following error: <img width="1098" height="269" alt="Screenshot 2026-03-06 at 10 14 28 AM" src="https://github.com/user-attachments/assets/cf4a23f9-b862-4b7a-8edc-2713c003dac2" /> https://github.com/ray-project/ray/blob/645d4f17a4c15b4a8875ccb0a64b178a7eceb358/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py#L129-L137 `get()` reads rolling averages from `TimeWindowAverageCalculator`, which maintains a running _sum updated on `report()` and reduced in `_trim()` as values expire. Due to floating-point rounding, `_sum` can drift slightly negative after all values are trimmed, producing a negative average and violating the `ClusterUtil` assertion. ## Approach Clamp the rolling average to `max(0, ...)` in `RollingLogicalUtilizationGauge.get()`, since utilization ratios are inherently non-negative and small negative values are just floating-point artifacts. ## Test `python -m pytest python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vs` passes after the fix. ## Related issues - Postmerge failure: https://buildkite.com/ray-project/postmerge/builds/16343/steps/canvas?sid=019cc1c1-e7c2-4a1c-9a86-a44211b14e66&tab=output - Related PR: #61436 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
LLM post-merge tests are failing with the following error:

ray/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py
Lines 129 to 137 in 645d4f1
get()reads rolling averages fromTimeWindowAverageCalculator, which maintains a running _sum updated onreport()and reduced in_trim()as values expire. Due to floating-point rounding,_sumcan drift slightly negative after all values are trimmed, producing a negative average and violating theClusterUtilassertion.Approach
Clamp the rolling average to
max(0, ...)inRollingLogicalUtilizationGauge.get(), since utilization ratios are inherently non-negative and small negative values are just floating-point artifacts.Test
python -m pytest python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vspasses after the fix.Related issues
ClusterUtildataclass forResourceUtilizationGauge#61436Additional information