Skip to content

[Data] Fixed scheduling metrics computation#62031

Merged
alexeykudinkin merged 3 commits into
masterfrom
ak/op-mcs-fix
Mar 25, 2026
Merged

[Data] Fixed scheduling metrics computation#62031
alexeykudinkin merged 3 commits into
masterfrom
ak/op-mcs-fix

Conversation

@alexeykudinkin

Copy link
Copy Markdown
Contributor

Description

Previously average_task_scheduling_time_s was computed incorrectly:

  • Cumulative scheduling duration was incremented whenever task would get first output
  • But denominator would only increase only when the task would finish

Meaning that for the time while the task is running after first output is produced, but before it finishes the metric will be inflated, incorrectly.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@alexeykudinkin alexeykudinkin requested a review from a team as a code owner March 24, 2026 20:32
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) March 24, 2026 20:35
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label Mar 24, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the op_runtime_metrics.py file to improve the accuracy of task scheduling and output backpressure metrics. Key changes include correcting the calculation of average_task_scheduling_time_s to use num_tasks_have_outputs instead of num_tasks_finished, and refactoring the on_task_output_generated method to correctly track the first output generation and associated timing. Minor updates include a clarification in a metric description and a formatting adjustment. There is no feedback to provide.

return self.task_scheduling_time_s / self.num_tasks_finished
# NOTE: For correct calculation, we must use `num_tasks_have_outputs`, since
# scheduling time is incremented upon receiving of the first output
return self.task_scheduling_time_s / self.num_tasks_have_outputs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

Just want to ensure we're measuring the above.

  1. Should we not divide by num_tasks_submitted instead of num_tasks_have_outputs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check how task_scheduling_time_s is incremented -- we increment it with the first output, hence denominator

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@github-actions github-actions Bot disabled auto-merge March 24, 2026 23:05
@ray-gardener ray-gardener Bot added data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Mar 25, 2026
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) March 25, 2026 02:30
@alexeykudinkin alexeykudinkin merged commit 9858202 into master Mar 25, 2026
9 checks passed
@alexeykudinkin alexeykudinkin deleted the ak/op-mcs-fix branch March 25, 2026 18:18
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
## Description

Previously `average_task_scheduling_time_s` was computed incorrectly: 

- Cumulative scheduling duration was incremented whenever task would get
first output
  - But denominator would only increase only when the task would finish

Meaning that for the time while the task is running after first output
is produced, but before it finishes the metric will be inflated,
incorrectly.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

3 participants