Skip to content

[data] Show external consumer bytes in verbose operator progress log#63728

Merged
justinvyu merged 1 commit into
ray-project:masterfrom
justinvyu:justinvyu/data-show-external-consumer-bytes
May 29, 2026
Merged

[data] Show external consumer bytes in verbose operator progress log#63728
justinvyu merged 1 commit into
ray-project:masterfrom
justinvyu:justinvyu/data-show-external-consumer-bytes

Conversation

@justinvyu

Copy link
Copy Markdown
Contributor

Description

When iterating with ds.iter_batches() or consuming a streaming_split() shard, the consumer's prefetched bytes are charged to the terminal operator's out memory (via set_external_consumer_bytes). Today the progress log shows the combined number without distinguishing the operator's own queues from what the downstream iterator is holding, which makes it hard to tell how much memory the iterator's prefetch is using.

This adds external_consumer=... to the verbose (in=..., out=...) field on the terminal operator's progress line whenever an external consumer is registered:

Before:

split(4, equal=True): Tasks: 0; Actors: 0; Queued blocks: 2 (256.1MiB); Resources: 0.0 CPU, 17.9GiB object store (in=0.0B,out=17.9GiB);

After:

split(4, equal=True): Tasks: 0; Actors: 0; Queued blocks: 2 (256.1MiB); Resources: 0.0 CPU, 17.9GiB object store (in=0.0B,out=17.9GiB,external_consumer=15.2GiB);

The field only appears on the terminal operator (since external consumers attach there) and only when a consumer is registered, so existing logs for pipelines without external consumers are unchanged.

Additional details

We're going to eventually remove this "external consumer tracking" logic in Ray Data, but for now this log is useful for debugging at least internally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested a review from a team as a code owner May 29, 2026 17:09
@justinvyu justinvyu changed the title [Data] Show external consumer bytes in verbose operator progress log May 29, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the ResourceManager to surface external-consumer bytes in the verbose operator usage string of the terminal output operator, allowing users to see how much of the output memory is held by downstream iterators. A new unit test has been added to verify this behavior. There are no review comments, so I have no feedback to provide.

@justinvyu justinvyu enabled auto-merge (squash) May 29, 2026 17:30
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label May 29, 2026
@justinvyu justinvyu merged commit 711ca21 into ray-project:master May 29, 2026
6 of 7 checks passed
@justinvyu justinvyu deleted the justinvyu/data-show-external-consumer-bytes branch May 29, 2026 18:30
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
…ay-project#63728)

This adds `external_consumer=...` to the verbose `(in=..., out=...)`
field on the terminal operator's progress line whenever an external
consumer is registered:

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
…ay-project#63728)

This adds `external_consumer=...` to the verbose `(in=..., out=...)`
field on the terminal operator's progress line whenever an external
consumer is registered:

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

2 participants