Skip to content

[data] Clamp rolling utilization averages to zero#61543

Merged
bveeramani merged 1 commit into
masterfrom
fix-llm-post-merge
Mar 10, 2026
Merged

[data] Clamp rolling utilization averages to zero#61543
bveeramani merged 1 commit into
masterfrom
fix-llm-post-merge

Conversation

@jeffreywang88

@jeffreywang88 jeffreywang88 commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

Description

LLM post-merge tests are failing with the following error:
Screenshot 2026-03-06 at 10 14 28 AM

def get(self) -> ClusterUtil:
"""Get the average cluster utilization based on global usage / global limits."""
return ClusterUtil(
cpu=self._cluster_cpu_util_calculator.get_average() or 0,
gpu=self._cluster_gpu_util_calculator.get_average() or 0,
memory=self._cluster_mem_util_calculator.get_average() or 0,
object_store_memory=self._cluster_obj_mem_util_calculator.get_average()
or 0,
)

get() reads rolling averages from TimeWindowAverageCalculator, which maintains a running _sum updated on report() and reduced in _trim() as values expire. Due to floating-point rounding, _sum can drift slightly negative after all values are trimmed, producing a negative average and violating the ClusterUtil assertion.

Approach

Clamp the rolling average to max(0, ...) in RollingLogicalUtilizationGauge.get(), since utilization ratios are inherently non-negative and small negative values are just floating-point artifacts.

Test

python -m pytest python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vs passes after the fix.

Related issues

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang88 jeffreywang88 requested a review from a team as a code owner March 6, 2026 18:20
@jeffreywang88 jeffreywang88 requested a review from bveeramani March 6, 2026 18:20

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request correctly addresses a floating-point precision issue by clamping utilization values to be non-negative, which prevents assertion failures. The change is correct. I have provided a suggestion to refactor the implementation slightly to improve readability and maintainability by reducing code duplication.

@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Mar 6, 2026
@jeffreywang88 jeffreywang88 added the go add ONLY when ready to merge, run all tests label Mar 6, 2026
@jeffreywang88

Copy link
Copy Markdown
Contributor Author

Postmerge test failures are fixed by #61580. This PR is no longer needed.

@bveeramani bveeramani merged commit b69f225 into master Mar 10, 2026
6 checks passed
@bveeramani bveeramani deleted the fix-llm-post-merge branch March 10, 2026 17:14
ParagEkbote pushed a commit to ParagEkbote/ray that referenced this pull request Mar 10, 2026
## Description
LLM post-merge tests are failing with the following error:
<img width="1098" height="269" alt="Screenshot 2026-03-06 at 10 14
28 AM"
src="https://github.com/user-attachments/assets/cf4a23f9-b862-4b7a-8edc-2713c003dac2"
/>

https://github.com/ray-project/ray/blob/645d4f17a4c15b4a8875ccb0a64b178a7eceb358/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py#L129-L137
`get()` reads rolling averages from `TimeWindowAverageCalculator`, which
maintains a running _sum updated on `report()` and reduced in `_trim()`
as values expire. Due to floating-point rounding, `_sum` can drift
slightly negative after all values are trimmed, producing a negative
average and violating the `ClusterUtil` assertion.

## Approach
Clamp the rolling average to `max(0, ...)` in
`RollingLogicalUtilizationGauge.get()`, since utilization ratios are
inherently non-negative and small negative values are just
floating-point artifacts.

## Test
`python -m pytest
python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vs`
passes after the fix.

## Related issues
- Postmerge failure:
https://buildkite.com/ray-project/postmerge/builds/16343/steps/canvas?sid=019cc1c1-e7c2-4a1c-9a86-a44211b14e66&tab=output
- Related PR: ray-project#61436

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
abrarsheikh pushed a commit that referenced this pull request Mar 11, 2026
## Description
LLM post-merge tests are failing with the following error:
<img width="1098" height="269" alt="Screenshot 2026-03-06 at 10 14
28 AM"
src="https://github.com/user-attachments/assets/cf4a23f9-b862-4b7a-8edc-2713c003dac2"
/>


https://github.com/ray-project/ray/blob/645d4f17a4c15b4a8875ccb0a64b178a7eceb358/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py#L129-L137
`get()` reads rolling averages from `TimeWindowAverageCalculator`, which
maintains a running _sum updated on `report()` and reduced in `_trim()`
as values expire. Due to floating-point rounding, `_sum` can drift
slightly negative after all values are trimmed, producing a negative
average and violating the `ClusterUtil` assertion.

## Approach
Clamp the rolling average to `max(0, ...)` in
`RollingLogicalUtilizationGauge.get()`, since utilization ratios are
inherently non-negative and small negative values are just
floating-point artifacts.

## Test
`python -m pytest
python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vs`
passes after the fix.

## Related issues
- Postmerge failure:
https://buildkite.com/ray-project/postmerge/builds/16343/steps/canvas?sid=019cc1c1-e7c2-4a1c-9a86-a44211b14e66&tab=output
- Related PR: #61436

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants