[data] Clamp rolling utilization averages to zero by jeffreywang88 · Pull Request #61543 · ray-project/ray

jeffreywang88 · 2026-03-06T18:20:10Z

Description

LLM post-merge tests are failing with the following error:

ray/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py

Lines 129 to 137 in 645d4f1

    
           def get(self) -> ClusterUtil: 
        
               """Get the average cluster utilization based on global usage / global limits.""" 
        
               return ClusterUtil( 
        
                   cpu=self._cluster_cpu_util_calculator.get_average() or 0, 
        
                   gpu=self._cluster_gpu_util_calculator.get_average() or 0, 
        
                   memory=self._cluster_mem_util_calculator.get_average() or 0, 
        
                   object_store_memory=self._cluster_obj_mem_util_calculator.get_average() 
        
                   or 0, 
        
               )

get() reads rolling averages from TimeWindowAverageCalculator, which maintains a running _sum updated on report() and reduced in _trim() as values expire. Due to floating-point rounding, _sum can drift slightly negative after all values are trimmed, producing a negative average and violating the ClusterUtil assertion.

Approach

Clamp the rolling average to max(0, ...) in RollingLogicalUtilizationGauge.get(), since utilization ratios are inherently non-negative and small negative values are just floating-point artifacts.

Test

python -m pytest python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vs passes after the fix.

Related issues

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

gemini-code-assist

Code Review

The pull request correctly addresses a floating-point precision issue by clamping utilization values to be non-negative, which prevents assertion failures. The change is correct. I have provided a suggestion to refactor the implementation slightly to improve readability and maintainability by reducing code duplication.

jeffreywang88 · 2026-03-09T18:41:04Z

Postmerge test failures are fixed by #61580. This PR is no longer needed.

## Description LLM post-merge tests are failing with the following error: <img width="1098" height="269" alt="Screenshot 2026-03-06 at 10 14 28 AM" src="https://github.com/user-attachments/assets/cf4a23f9-b862-4b7a-8edc-2713c003dac2" /> https://github.com/ray-project/ray/blob/645d4f17a4c15b4a8875ccb0a64b178a7eceb358/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py#L129-L137 `get()` reads rolling averages from `TimeWindowAverageCalculator`, which maintains a running _sum updated on `report()` and reduced in `_trim()` as values expire. Due to floating-point rounding, `_sum` can drift slightly negative after all values are trimmed, producing a negative average and violating the `ClusterUtil` assertion. ## Approach Clamp the rolling average to `max(0, ...)` in `RollingLogicalUtilizationGauge.get()`, since utilization ratios are inherently non-negative and small negative values are just floating-point artifacts. ## Test `python -m pytest python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vs` passes after the fix. ## Related issues - Postmerge failure: https://buildkite.com/ray-project/postmerge/builds/16343/steps/canvas?sid=019cc1c1-e7c2-4a1c-9a86-a44211b14e66&tab=output - Related PR: ray-project#61436 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>

## Description LLM post-merge tests are failing with the following error: <img width="1098" height="269" alt="Screenshot 2026-03-06 at 10 14 28 AM" src="https://github.com/user-attachments/assets/cf4a23f9-b862-4b7a-8edc-2713c003dac2" /> https://github.com/ray-project/ray/blob/645d4f17a4c15b4a8875ccb0a64b178a7eceb358/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py#L129-L137 `get()` reads rolling averages from `TimeWindowAverageCalculator`, which maintains a running _sum updated on `report()` and reduced in `_trim()` as values expire. Due to floating-point rounding, `_sum` can drift slightly negative after all values are trimmed, producing a negative average and violating the `ClusterUtil` assertion. ## Approach Clamp the rolling average to `max(0, ...)` in `RollingLogicalUtilizationGauge.get()`, since utilization ratios are inherently non-negative and small negative values are just floating-point artifacts. ## Test `python -m pytest python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py -vs` passes after the fix. ## Related issues - Postmerge failure: https://buildkite.com/ray-project/postmerge/builds/16343/steps/canvas?sid=019cc1c1-e7c2-4a1c-9a86-a44211b14e66&tab=output - Related PR: #61436 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Clamp rolling utilization averages to zero to fix assertion error

1ec81f1

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang88 requested a review from a team as a code owner March 6, 2026 18:20

jeffreywang88 requested a review from bveeramani March 6, 2026 18:20

gemini-code-assist Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py

ray-gardener Bot added the data Ray Data-related issues label Mar 6, 2026

jeffreywang88 added the go add ONLY when ready to merge, run all tests label Mar 6, 2026

jeffreywang88 closed this Mar 9, 2026

jeffreywang88 reopened this Mar 10, 2026

bveeramani approved these changes Mar 10, 2026

View reviewed changes

bveeramani merged commit b69f225 into master Mar 10, 2026
6 checks passed

bveeramani deleted the fix-llm-post-merge branch March 10, 2026 17:14

This was referenced Mar 10, 2026

Release test text_embeddings_benchmark_autoscaling_preemptible failed anyscale/ray#921

Closed

Release test text_embeddings_benchmark_fixed_size_preemptible failed anyscale/ray#924

Closed

owenowenisme mentioned this pull request Mar 12, 2026

[Data] Fix floating point error from average calculator #61685

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] Clamp rolling utilization averages to zero#61543

[data] Clamp rolling utilization averages to zero#61543
bveeramani merged 1 commit into
masterfrom
fix-llm-post-merge

jeffreywang88 commented Mar 6, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

jeffreywang88 commented Mar 9, 2026

Uh oh!

Labels

2 participants

	def get(self) -> ClusterUtil:
	"""Get the average cluster utilization based on global usage / global limits."""
	return ClusterUtil(
	cpu=self._cluster_cpu_util_calculator.get_average() or 0,
	gpu=self._cluster_gpu_util_calculator.get_average() or 0,
	memory=self._cluster_mem_util_calculator.get_average() or 0,
	object_store_memory=self._cluster_obj_mem_util_calculator.get_average()
	or 0,
	)

Uh oh!

Conversation

jeffreywang88 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Approach

Test

Related issues

Additional information

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

jeffreywang88 commented Mar 9, 2026

Uh oh!

Labels

2 participants

jeffreywang88 commented Mar 6, 2026 •

edited

Loading