[data][metrics] Add metric for task block locality#62249
Conversation
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces new metrics to track and differentiate task performance based on cache hits and misses. It updates RunningTaskInfo to store input node locations and determines cache hit status by checking if the first output block is co-located with any input blocks. New metrics include scheduling time, input bytes, completion time, and task counts, categorized by hit or miss. The review feedback suggests making the newly added average scheduling time properties public rather than internal to ensure consistency with the underlying raw metrics and other existing performance properties. I have no further feedback to provide.
| @metric_property( | ||
| description="Average scheduling time (s) for cache-hit tasks.", | ||
| metrics_group=MetricsGroup.TASKS, | ||
| internal_only=True, | ||
| ) |
There was a problem hiding this comment.
The property average_task_scheduling_time_cache_hit_s is marked as internal_only=True, but the underlying raw metrics (like task_scheduling_time_cache_hit_s and num_tasks_cache_hit) are public. This is inconsistent with other public metrics like average_task_scheduling_time_s. Since these averages are valuable for users to understand scheduling performance and locality impact, they should likely be public.
@metric_property(
description="Average scheduling time (s) for cache-hit tasks.",
metrics_group=MetricsGroup.TASKS,
)| @metric_property( | ||
| description="Average scheduling time (s) for cache-miss tasks.", | ||
| metrics_group=MetricsGroup.TASKS, | ||
| internal_only=True, | ||
| ) |
There was a problem hiding this comment.
| task_id: ray.TaskID | ||
| input_node_ids: Set[str] = field(default_factory=set) | ||
| last_updated: float = field(init=False, default_factory=lambda: time.perf_counter()) | ||
| is_cache_hit: Optional[bool] = field(init=False, default=None) |
There was a problem hiding this comment.
Can you expand on what cache hit refers to?
There was a problem hiding this comment.
Can we call this preserved_locality instead?
| task_scheduling_time_cache_hit_s: float = metric_field( | ||
| default=0, | ||
| description="Cumulative task scheduling time (s) for cache-hit tasks.", | ||
| metrics_group=MetricsGroup.TASKS, | ||
| ) | ||
| task_scheduling_time_cache_miss_s: float = metric_field( | ||
| default=0, | ||
| description="Cumulative task scheduling time (s) for cache-miss tasks.", |
There was a problem hiding this comment.
Can we organize this into CacheHit and CacheMiss metrics?
There was a problem hiding this comment.
i was thinking about that, but to keep it consistent with all metrics, I decided to leave it as is. Later, I'll add groupings so that it's easier to understand
| first_output_node_id is not None | ||
| and first_output_node_id != NODE_UNKNOWN | ||
| and first_output_node_id in task_info.input_node_ids |
There was a problem hiding this comment.
Just to clarify, this can happen only if node is dead or restarting?
There was a problem hiding this comment.
Or is there another reason the output's node id can be unknown?
There was a problem hiding this comment.
I'm not quite sure, maybe in synthetic data? It's more of a defensive guard. Here is where the NODE_UNKNOWNS occur https://github.com/iamjustinhsu/ray/blob/59bbe7e1bb40c8a41042359f12200c47c24de1a4/python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py#L186
There was a problem hiding this comment.
Hmm so if that block was never executed but somehow has metadata, and has no node attached to it...
But since the class is frozen, it should never be edited after creation.
There was a problem hiding this comment.
it shouldn't be, but since the type annotations suggests that it can be None, i would rather be defensive because we launch ray data tasks in many areas. I can follow up and check for areas to see if it can be None
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
## Description As titled, would like to track how default core scheduling is, and how much it impacts performance. Tested that this works with actors and regular tasks In a future PR, would like to add a panel ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

Description
As titled, would like to track how default core scheduling is, and how much it impacts performance. Tested that this works with actors and regular tasks
In a future PR, would like to add a panel
Related issues
Additional information