Skip to content

[Data] Revisiting hashing method of Pyarrow schemas to improve perf#62108

Merged
alexeykudinkin merged 13 commits into
masterfrom
ak/ref-bndl-perf-fix
Mar 27, 2026
Merged

[Data] Revisiting hashing method of Pyarrow schemas to improve perf#62108
alexeykudinkin merged 13 commits into
masterfrom
ak/ref-bndl-perf-fix

Conversation

@alexeykudinkin

@alexeykudinkin alexeykudinkin commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

Description

This change is aiming to address following issues

  • Avoiding invoking equality/hashing on Schemas in RefBundle to reduce impact on large schemas
  • Avoid wiring input_files into DatasetStats to avoid carrying potentially large # of strings of input files (for ex, in cases of image datasets)

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@alexeykudinkin alexeykudinkin requested a review from a team as a code owner March 26, 2026 18:40

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates schema comparison and hashing logic to better align with Pyarrow's Schema interface, including adding an equals method to PandasBlockSchema. However, the changes introduce critical bugs: RefBundle.eq incorrectly attempts to compare a schema against a RefBundle instance and lacks proper null checks, while _make_hashable_schema fails for PandasBlockSchema objects which do not implement remove_metadata.

Comment thread python/ray/data/_internal/execution/interfaces/ref_bundle.py Outdated
Comment thread python/ray/data/block.py Outdated
Comment thread python/ray/data/_internal/execution/interfaces/ref_bundle.py Outdated
Comment thread python/ray/data/_internal/execution/interfaces/ref_bundle.py Outdated
Comment thread python/ray/data/block.py Outdated
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Mar 26, 2026
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Comment thread python/ray/data/block.py Outdated
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Mar 26, 2026
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@alexeykudinkin alexeykudinkin changed the title [WIP][Data] Revisiting hashing method of Pyarrow schemas to improve perf Mar 26, 2026
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) March 26, 2026 23:03
Comment thread python/ray/data/block.py
Comment on lines 316 to -317

def __hash__(self):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin do you recall the intent for overriding this in the first place? Will it cause issues if we remove it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been overridden to make BM hashable (we're previosuly hashing both blocks and metadata)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. Are block refs not hashable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectRefs are hashable

Comment on lines +367 to +372
# NOTE: We're establishing a requirement of schemas for `RefBundle`
# to be exactly the same object for it to be considered equal.
#
# This is necessary to avoid a full schema equality check that
# is computationally intensive.
and self.schema is other.schema

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we don't compare schemas at all? Like, if a bundle has the same blocks, don't they automatically have the same schema?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still set different schema on the bundle itself, right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you can pass whatever to for the schema field, but seems weird that could you have the exact same underlying PyArrow tables but distinct schemas

from ray.data.dataset import _ExecutionCache

self.__dict__.update(state)
self._cache = _ExecutionCache()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this _ExecutionCache being used?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in ExecutionPlan and Dataset. This abstraction was previously the snapshot_bundle and snapshot_metadata attributes. It's a new abstraction that's part of our migration away from the legacy ExecutionPlan abstraction

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's caching execution state, so in quite a few places

# NOTE: We're truncating `input_files` from metadata as it could
# be carrying 1000s of input files (for `ImageDatasource` for ex)
# and isn't useful inside `DatasetStats`
replace(read_task.metadata, input_files=None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, was this an issue before #61059

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, always was

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@github-actions github-actions Bot disabled auto-merge March 26, 2026 23:50

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Comment thread python/ray/data/block.py
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) March 27, 2026 01:06
@alexeykudinkin alexeykudinkin merged commit d032daf into master Mar 27, 2026
7 checks passed
@alexeykudinkin alexeykudinkin deleted the ak/ref-bndl-perf-fix branch March 27, 2026 01:39
mancfactor pushed a commit to mancfactor/ray that referenced this pull request Apr 2, 2026
…ay-project#62108)

## Description

This change is aiming to address following issues

- Avoiding invoking equality/hashing on `Schema`s in `RefBundle` to
reduce impact on large schemas
- Avoid wiring `input_files` into `DatasetStats` to avoid carrying
potentially large # of strings of input files (for ex, in cases of image
datasets)

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Frank Mancina <fmancina@haproxy.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…ay-project#62108)

## Description

This change is aiming to address following issues

- Avoiding invoking equality/hashing on `Schema`s in `RefBundle` to
reduce impact on large schemas
- Avoid wiring `input_files` into `DatasetStats` to avoid carrying
potentially large # of strings of input files (for ex, in cases of image
datasets)

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

3 participants