[Data] Revisiting hashing method of Pyarrow schemas to improve perf#62108
Conversation
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request updates schema comparison and hashing logic to better align with Pyarrow's Schema interface, including adding an equals method to PandasBlockSchema. However, the changes introduce critical bugs: RefBundle.eq incorrectly attempts to compare a schema against a RefBundle instance and lacks proper null checks, while _make_hashable_schema fails for PandasBlockSchema objects which do not implement remove_metadata.
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
77f6c17 to
aa0867b
Compare
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
|
|
||
| def __hash__(self): |
There was a problem hiding this comment.
@alexeykudinkin do you recall the intent for overriding this in the first place? Will it cause issues if we remove it?
There was a problem hiding this comment.
It's been overridden to make BM hashable (we're previosuly hashing both blocks and metadata)
There was a problem hiding this comment.
Huh. Are block refs not hashable?
There was a problem hiding this comment.
ObjectRefs are hashable
| # NOTE: We're establishing a requirement of schemas for `RefBundle` | ||
| # to be exactly the same object for it to be considered equal. | ||
| # | ||
| # This is necessary to avoid a full schema equality check that | ||
| # is computationally intensive. | ||
| and self.schema is other.schema |
There was a problem hiding this comment.
What happens if we don't compare schemas at all? Like, if a bundle has the same blocks, don't they automatically have the same schema?
There was a problem hiding this comment.
You can still set different schema on the bundle itself, right?
There was a problem hiding this comment.
Yeah, you can pass whatever to for the schema field, but seems weird that could you have the exact same underlying PyArrow tables but distinct schemas
| from ray.data.dataset import _ExecutionCache | ||
|
|
||
| self.__dict__.update(state) | ||
| self._cache = _ExecutionCache() |
There was a problem hiding this comment.
where is this _ExecutionCache being used?
There was a problem hiding this comment.
It's used in ExecutionPlan and Dataset. This abstraction was previously the snapshot_bundle and snapshot_metadata attributes. It's a new abstraction that's part of our migration away from the legacy ExecutionPlan abstraction
There was a problem hiding this comment.
It's caching execution state, so in quite a few places
| # NOTE: We're truncating `input_files` from metadata as it could | ||
| # be carrying 1000s of input files (for `ImageDatasource` for ex) | ||
| # and isn't useful inside `DatasetStats` | ||
| replace(read_task.metadata, input_files=None) |
There was a problem hiding this comment.
Yes, always was
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
…ay-project#62108) ## Description This change is aiming to address following issues - Avoiding invoking equality/hashing on `Schema`s in `RefBundle` to reduce impact on large schemas - Avoid wiring `input_files` into `DatasetStats` to avoid carrying potentially large # of strings of input files (for ex, in cases of image datasets) ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Frank Mancina <fmancina@haproxy.com>
…ay-project#62108) ## Description This change is aiming to address following issues - Avoiding invoking equality/hashing on `Schema`s in `RefBundle` to reduce impact on large schemas - Avoid wiring `input_files` into `DatasetStats` to avoid carrying potentially large # of strings of input files (for ex, in cases of image datasets) ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Description
This change is aiming to address following issues
Schemas inRefBundleto reduce impact on large schemasinput_filesintoDatasetStatsto avoid carrying potentially large # of strings of input files (for ex, in cases of image datasets)Related issues
Additional information