Skip to content

[data][metrics] Add metric for task block locality#62249

Merged
goutamvenkat-anyscale merged 7 commits into
ray-project:masterfrom
iamjustinhsu:jhsu/add-task-block-locality-metrics
Apr 2, 2026
Merged

[data][metrics] Add metric for task block locality#62249
goutamvenkat-anyscale merged 7 commits into
ray-project:masterfrom
iamjustinhsu:jhsu/add-task-block-locality-metrics

Conversation

@iamjustinhsu

@iamjustinhsu iamjustinhsu commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

Description

As titled, would like to track how default core scheduling is, and how much it impacts performance. Tested that this works with actors and regular tasks

In a future PR, would like to add a panel

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner April 1, 2026 01:57

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new metrics to track and differentiate task performance based on cache hits and misses. It updates RunningTaskInfo to store input node locations and determines cache hit status by checking if the first output block is co-located with any input blocks. New metrics include scheduling time, input bytes, completion time, and task counts, categorized by hit or miss. The review feedback suggests making the newly added average scheduling time properties public rather than internal to ensure consistency with the underlying raw metrics and other existing performance properties. I have no further feedback to provide.

Comment on lines +694 to +698
@metric_property(
description="Average scheduling time (s) for cache-hit tasks.",
metrics_group=MetricsGroup.TASKS,
internal_only=True,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The property average_task_scheduling_time_cache_hit_s is marked as internal_only=True, but the underlying raw metrics (like task_scheduling_time_cache_hit_s and num_tasks_cache_hit) are public. This is inconsistent with other public metrics like average_task_scheduling_time_s. Since these averages are valuable for users to understand scheduling performance and locality impact, they should likely be public.

    @metric_property(
        description="Average scheduling time (s) for cache-hit tasks.",
        metrics_group=MetricsGroup.TASKS,
    )
Comment on lines +704 to +708
@metric_property(
description="Average scheduling time (s) for cache-miss tasks.",
metrics_group=MetricsGroup.TASKS,
internal_only=True,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the cache-hit property, average_task_scheduling_time_cache_miss_s should be public for consistency with the raw metrics and other scheduling averages.

    @metric_property(
        description="Average scheduling time (s) for cache-miss tasks.",
        metrics_group=MetricsGroup.TASKS,
    )
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Apr 1, 2026
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Comment thread python/ray/data/tests/test_stats.py Outdated
task_id: ray.TaskID
input_node_ids: Set[str] = field(default_factory=set)
last_updated: float = field(init=False, default_factory=lambda: time.perf_counter())
is_cache_hit: Optional[bool] = field(init=False, default=None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on what cache hit refers to?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this preserved_locality instead?

Comment on lines +387 to +394
task_scheduling_time_cache_hit_s: float = metric_field(
default=0,
description="Cumulative task scheduling time (s) for cache-hit tasks.",
metrics_group=MetricsGroup.TASKS,
)
task_scheduling_time_cache_miss_s: float = metric_field(
default=0,
description="Cumulative task scheduling time (s) for cache-miss tasks.",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we organize this into CacheHit and CacheMiss metrics?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was thinking about that, but to keep it consistent with all metrics, I decided to leave it as is. Later, I'll add groupings so that it's easier to understand

Comment on lines +1088 to +1090
first_output_node_id is not None
and first_output_node_id != NODE_UNKNOWN
and first_output_node_id in task_info.input_node_ids

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, this can happen only if node is dead or restarting?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is there another reason the output's node id can be unknown?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure, maybe in synthetic data? It's more of a defensive guard. Here is where the NODE_UNKNOWNS occur https://github.com/iamjustinhsu/ray/blob/59bbe7e1bb40c8a41042359f12200c47c24de1a4/python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py#L186

@goutamvenkat-anyscale goutamvenkat-anyscale Apr 2, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so if that block was never executed but somehow has metadata, and has no node attached to it...

But since the class is frozen, it should never be edited after creation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shouldn't be, but since the type annotations suggests that it can be None, i would rather be defensive because we launch ray data tasks in many areas. I can follow up and check for areas to see if it can be None

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Apr 2, 2026
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale merged commit 20744ec into ray-project:master Apr 2, 2026
6 checks passed
@iamjustinhsu iamjustinhsu deleted the jhsu/add-task-block-locality-metrics branch April 2, 2026 21:57
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
## Description
As titled, would like to track how default core scheduling is, and how
much it impacts performance. Tested that this works with actors and
regular tasks

In a future PR, would like to add a panel

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants