[data][metrics] Add metric for task block locality by iamjustinhsu · Pull Request #62249 · ray-project/ray

iamjustinhsu · 2026-04-01T01:57:44Z

Description

As titled, would like to track how default core scheduling is, and how much it impacts performance. Tested that this works with actors and regular tasks

In a future PR, would like to add a panel

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces new metrics to track and differentiate task performance based on cache hits and misses. It updates RunningTaskInfo to store input node locations and determines cache hit status by checking if the first output block is co-located with any input blocks. New metrics include scheduling time, input bytes, completion time, and task counts, categorized by hit or miss. The review feedback suggests making the newly added average scheduling time properties public rather than internal to ensure consistency with the underlying raw metrics and other existing performance properties. I have no further feedback to provide.

gemini-code-assist · 2026-04-01T01:59:46Z

+    @metric_property(
+        description="Average scheduling time (s) for cache-hit tasks.",
+        metrics_group=MetricsGroup.TASKS,
+        internal_only=True,
+    )


The property average_task_scheduling_time_cache_hit_s is marked as internal_only=True, but the underlying raw metrics (like task_scheduling_time_cache_hit_s and num_tasks_cache_hit) are public. This is inconsistent with other public metrics like average_task_scheduling_time_s. Since these averages are valuable for users to understand scheduling performance and locality impact, they should likely be public.

@metric_property( description="Average scheduling time (s) for cache-hit tasks.", metrics_group=MetricsGroup.TASKS, )

gemini-code-assist · 2026-04-01T01:59:46Z

+    @metric_property(
+        description="Average scheduling time (s) for cache-miss tasks.",
+        metrics_group=MetricsGroup.TASKS,
+        internal_only=True,
+    )


Similar to the cache-hit property, average_task_scheduling_time_cache_miss_s should be public for consistency with the raw metrics and other scheduling averages.

@metric_property( description="Average scheduling time (s) for cache-miss tasks.", metrics_group=MetricsGroup.TASKS, )

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

goutamvenkat-anyscale · 2026-04-01T20:11:16Z

    task_id: ray.TaskID
+    input_node_ids: Set[str] = field(default_factory=set)
    last_updated: float = field(init=False, default_factory=lambda: time.perf_counter())
+    is_cache_hit: Optional[bool] = field(init=False, default=None)


Can you expand on what cache hit refers to?

Can we call this preserved_locality instead?

goutamvenkat-anyscale · 2026-04-01T20:12:46Z

+    task_scheduling_time_cache_hit_s: float = metric_field(
+        default=0,
+        description="Cumulative task scheduling time (s) for cache-hit tasks.",
+        metrics_group=MetricsGroup.TASKS,
+    )
+    task_scheduling_time_cache_miss_s: float = metric_field(
+        default=0,
+        description="Cumulative task scheduling time (s) for cache-miss tasks.",


Can we organize this into CacheHit and CacheMiss metrics?

i was thinking about that, but to keep it consistent with all metrics, I decided to leave it as is. Later, I'll add groupings so that it's easier to understand

goutamvenkat-anyscale · 2026-04-01T20:17:17Z

+                first_output_node_id is not None
+                and first_output_node_id != NODE_UNKNOWN
+                and first_output_node_id in task_info.input_node_ids


Just to clarify, this can happen only if node is dead or restarting?

Or is there another reason the output's node id can be unknown?

I'm not quite sure, maybe in synthetic data? It's more of a defensive guard. Here is where the NODE_UNKNOWNS occur https://github.com/iamjustinhsu/ray/blob/59bbe7e1bb40c8a41042359f12200c47c24de1a4/python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py#L186

Hmm so if that block was never executed but somehow has metadata, and has no node attached to it...

But since the class is frozen, it should never be edited after creation.

it shouldn't be, but since the type annotations suggests that it can be None, i would rather be defensive because we launch ray data tasks in many areas. I can follow up and check for areas to see if it can be None

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

## Description As titled, would like to track how default core scheduling is, and how much it impacts performance. Tested that this works with actors and regular tasks In a future PR, would like to add a panel ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

[data][metrics] Add metric for task block locality

cb67abc

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu requested a review from a team as a code owner April 1, 2026 01:57

gemini-code-assist Bot reviewed Apr 1, 2026

View reviewed changes

ray-gardener Bot added the data Ray Data-related issues label Apr 1, 2026

fix test

59bbe7e

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread python/ray/data/tests/test_stats.py Outdated

goutamvenkat-anyscale reviewed Apr 1, 2026

View reviewed changes

add comment

1c7525a

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

goutamvenkat-anyscale approved these changes Apr 2, 2026

View reviewed changes

iamjustinhsu added 2 commits April 2, 2026 09:11

rename

b3aaa8d

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

fix test

e976945

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu added the go add ONLY when ready to merge, run all tests label Apr 2, 2026

iamjustinhsu added 2 commits April 2, 2026 09:29

fix test?

e5c9cf3

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

fix test?

47c4be2

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

goutamvenkat-anyscale merged commit 20744ec into ray-project:master Apr 2, 2026
6 checks passed

iamjustinhsu deleted the jhsu/add-task-block-locality-metrics branch April 2, 2026 21:57

xinyuangui2 mentioned this pull request May 20, 2026

[Data] Reducing StreamingExecutor scheduling-loop overhead #63544

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data][metrics] Add metric for task block locality#62249

[data][metrics] Add metric for task block locality#62249
goutamvenkat-anyscale merged 7 commits into
ray-project:masterfrom
iamjustinhsu:jhsu/add-task-block-locality-metrics

iamjustinhsu commented Apr 1, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

gemini-code-assist Bot Apr 1, 2026

gemini-code-assist Bot Apr 1, 2026

cursor Bot left a comment

Uh oh!

goutamvenkat-anyscale Apr 1, 2026

goutamvenkat-anyscale Apr 2, 2026

goutamvenkat-anyscale Apr 1, 2026

iamjustinhsu Apr 1, 2026

goutamvenkat-anyscale Apr 1, 2026

goutamvenkat-anyscale Apr 1, 2026

iamjustinhsu Apr 1, 2026

goutamvenkat-anyscale Apr 2, 2026 •

edited

Loading

iamjustinhsu Apr 2, 2026

Uh oh!

Labels

2 participants

Uh oh!

Conversation

iamjustinhsu commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goutamvenkat-anyscale Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Labels

2 participants

iamjustinhsu commented Apr 1, 2026 •

edited

Loading

goutamvenkat-anyscale Apr 2, 2026 •

edited

Loading