[Data] Revisiting hashing method of Pyarrow schemas to improve perf by alexeykudinkin · Pull Request #62108 · ray-project/ray

alexeykudinkin · 2026-03-26T18:40:32Z

Description

This change is aiming to address following issues

Avoiding invoking equality/hashing on Schemas in RefBundle to reduce impact on large schemas
Avoid wiring input_files into DatasetStats to avoid carrying potentially large # of strings of input files (for ex, in cases of image datasets)

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

gemini-code-assist

Code Review

This pull request updates schema comparison and hashing logic to better align with Pyarrow's Schema interface, including adding an equals method to PandasBlockSchema. However, the changes introduce critical bugs: RefBundle.eq incorrectly attempts to compare a schema against a RefBundle instance and lacks proper null checks, while _make_hashable_schema fails for PandasBlockSchema objects which do not implement remove_metadata.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

bveeramani · 2026-03-26T23:06:35Z


-    def __hash__(self):


@alexeykudinkin do you recall the intent for overriding this in the first place? Will it cause issues if we remove it?

It's been overridden to make BM hashable (we're previosuly hashing both blocks and metadata)

Huh. Are block refs not hashable?

ObjectRefs are hashable

bveeramani · 2026-03-26T23:08:08Z

+            # NOTE: We're establishing a requirement of schemas for `RefBundle`
+            #       to be exactly the same object for it to be considered equal.
+            #
+            #       This is necessary to avoid a full schema equality check that
+            #       is computationally intensive.
+            and self.schema is other.schema


What happens if we don't compare schemas at all? Like, if a bundle has the same blocks, don't they automatically have the same schema?

You can still set different schema on the bundle itself, right?

Yeah, you can pass whatever to for the schema field, but seems weird that could you have the exact same underlying PyArrow tables but distinct schemas

iamjustinhsu · 2026-03-26T23:05:53Z

+        from ray.data.dataset import _ExecutionCache
+
+        self.__dict__.update(state)
+        self._cache = _ExecutionCache()


where is this _ExecutionCache being used?

It's used in ExecutionPlan and Dataset. This abstraction was previously the snapshot_bundle and snapshot_metadata attributes. It's a new abstraction that's part of our migration away from the legacy ExecutionPlan abstraction

It's caching execution state, so in quite a few places

iamjustinhsu · 2026-03-26T23:11:03Z

+                # NOTE: We're truncating `input_files` from metadata as it could
+                #       be carrying 1000s of input files (for `ImageDatasource` for ex)
+                #       and isn't useful inside `DatasetStats`
+                replace(read_task.metadata, input_files=None)


wait, was this an issue before #61059

Yes, always was

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

…ay-project#62108) ## Description This change is aiming to address following issues - Avoiding invoking equality/hashing on `Schema`s in `RefBundle` to reduce impact on large schemas - Avoid wiring `input_files` into `DatasetStats` to avoid carrying potentially large # of strings of input files (for ex, in cases of image datasets) ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Frank Mancina <fmancina@haproxy.com>

…ay-project#62108) ## Description This change is aiming to address following issues - Avoiding invoking equality/hashing on `Schema`s in `RefBundle` to reduce impact on large schemas - Avoid wiring `input_files` into `DatasetStats` to avoid carrying potentially large # of strings of input files (for ex, in cases of image datasets) ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin added 3 commits March 26, 2026 11:38

Revisited hash method of the Pyarrow schemas to just drop the metadata

9061c25

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed equality check to align with hash method

ced403e

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Aligned PandasBlockSchema with Pyarrow's Schema

7a74947

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin requested a review from a team as a code owner March 26, 2026 18:40

gemini-code-assist Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/interfaces/ref_bundle.py Outdated

Comment thread python/ray/data/block.py Outdated

cursor Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/interfaces/ref_bundle.py Outdated

Comment thread python/ray/data/_internal/execution/interfaces/ref_bundle.py Outdated

Comment thread python/ray/data/block.py Outdated

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Mar 26, 2026

alexeykudinkin added 2 commits March 26, 2026 11:54

Fixed eq check

89f294e

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed hashing util

aa0867b

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin force-pushed the ak/ref-bndl-perf-fix branch from 77f6c17 to aa0867b Compare March 26, 2026 18:54

cursor Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread python/ray/data/block.py Outdated

ray-gardener Bot added the data Ray Data-related issues label Mar 26, 2026

alexeykudinkin added 7 commits March 26, 2026 15:21

Avoid carrying input_files inside DatasetStats

d3b3c62

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Avoid serializing _ExecutionCache inside ExecutionPlan

b353deb

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

dd04383

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Avoid schema comparison in RefBundle

51fd409

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Reverting changes

4e38052

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

08c6011

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Deleting

62beaf7

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin changed the title ~~[WIP][Data] Revisiting hashing method of Pyarrow schemas to improve perf~~ Mar 26, 2026

alexeykudinkin enabled auto-merge (squash) March 26, 2026 23:03

bveeramani reviewed Mar 26, 2026

View reviewed changes

iamjustinhsu reviewed Mar 26, 2026

View reviewed changes

Only hash blocks

3c1f4be

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

github-actions Bot disabled auto-merge March 26, 2026 23:50

cursor Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread python/ray/data/block.py

alexeykudinkin enabled auto-merge (squash) March 27, 2026 01:06

iamjustinhsu approved these changes Mar 27, 2026

View reviewed changes

alexeykudinkin merged commit d032daf into master Mar 27, 2026
7 checks passed

alexeykudinkin deleted the ak/ref-bndl-perf-fix branch March 27, 2026 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Revisiting hashing method of Pyarrow schemas to improve perf#62108

[Data] Revisiting hashing method of Pyarrow schemas to improve perf#62108
alexeykudinkin merged 13 commits into
masterfrom
ak/ref-bndl-perf-fix

alexeykudinkin commented Mar 26, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bveeramani Mar 26, 2026

alexeykudinkin Mar 26, 2026

bveeramani Mar 26, 2026

alexeykudinkin Mar 27, 2026

bveeramani Mar 26, 2026

alexeykudinkin Mar 26, 2026

bveeramani Mar 26, 2026

iamjustinhsu Mar 26, 2026

bveeramani Mar 26, 2026

alexeykudinkin Mar 26, 2026

iamjustinhsu Mar 26, 2026

alexeykudinkin Mar 26, 2026

cursor Bot left a comment

Uh oh!

Uh oh!

Labels

3 participants

Uh oh!

Conversation

alexeykudinkin commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

3 participants

alexeykudinkin commented Mar 26, 2026 •

edited

Loading