[Data] Remove legacy BlockList class#60575
Conversation
There was a problem hiding this comment.
Code Review
This pull request is a solid refactoring that removes the legacy BlockList class, simplifying the data flow within Ray Data and eliminating unnecessary conversion overhead. The changes are clean, consistent, and well-motivated. I've included one suggestion to further optimize the logic in legacy_compat.py for better performance and memory efficiency. Overall, this is an excellent improvement to the codebase.
| @@ -169,8 +168,8 @@ def _get_initial_stats_from_plan(plan: ExecutionPlan) -> DatasetStats: | |||
| return plan._in_stats | |||
|
|
|||
|
|
|||
| def _bundles_to_block_list(bundles: Iterator[RefBundle]) -> BlockList: | |||
| blocks, metadata = [], [] | |||
| def _bundles_to_ref_bundle(bundles: Iterator[RefBundle]) -> RefBundle: | |||
There was a problem hiding this comment.
I think we can reuse merge_ref_bundles with some changes to it? Something like
def merge_ref_bundles(cls, bundles: Iterable["RefBundle"]) -> "RefBundle":
bundles = list(bundles)
if not bundles:
return cls(blocks=(), owns_blocks=True, schema=None)
merged_blocks = list(itertools.chain.from_iterable(bundle.blocks for bundle in bundles))
merged_slices = list(itertools.chain.from_iterable(bundle.slices for bundle in bundles))
owns_blocks = all(bundle.owns_blocks for bundle in bundles)
schema = _take_first_non_empty_schema(bundle.schema for bundle in bundles)
return cls(
blocks=tuple(merged_blocks),
schema=schema,
owns_blocks=owns_blocks,
slices=merged_slices,
)There was a problem hiding this comment.
Implemented! Using merge_ref_bundles() is cleaner and also fixed a couple of bugs in that method (schema selection and ownership calculation).
BlockList was an intermediate conversion layer between the executor's RefBundle output and the plan's RefBundle consumption. This removes the unnecessary round-trip by having execute_to_ref_bundle() return RefBundle directly. Changes: - Rename execute_to_legacy_block_list to execute_to_ref_bundle - Remove _bundles_to_block_list, add _bundles_to_ref_bundle - Remove unused _DatasetStatsBuilder.build() method - Update test_split.py to use RefBundle - Delete block_list.py Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com>
BlockList was an intermediate conversion layer between the executor's RefBundle output and the plan's RefBundle consumption. This removes the unnecessary round-trip by returning RefBundle directly. Changes: - Update RefBundle.merge_ref_bundles() to handle empty input, use _take_first_non_empty_schema, and properly compute owns_blocks - Rename execute_to_legacy_block_list to execute_to_ref_bundle - Use RefBundle.merge_ref_bundles() instead of custom helper - Remove unused _DatasetStatsBuilder.build() method - Update test_split.py to use RefBundle - Delete block_list.py Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com>
c986f54 to
d13bf00
Compare
| assert bundles, "Cannot merge an empty list of RefBundles." | ||
| merged_blocks = list(itertools.chain(*[bundle.blocks for bundle in bundles])) | ||
| merged_slices = list(itertools.chain(*[bundle.slices for bundle in bundles])) | ||
| def merge_ref_bundles(cls, bundles: Iterable["RefBundle"]) -> "RefBundle": |
There was a problem hiding this comment.
Let's add test for this method
There was a problem hiding this comment.
I think this method is already tested here:
ray/python/ray/data/tests/test_ref_bundle.py
Line 379 in 7ecbca7
Is there anything new we need to test for this refactor?
There was a problem hiding this comment.
Yeah, let's test owns_block semantic properly (while we're at it)
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Remove the `BlockList` class from Ray Data, eliminating unnecessary conversion overhead between `RefBundle` representations. **Why** `BlockList` existed as a legacy abstraction from an older execution model. After `LazyBlockList` was removed in ray-project#46054, the remaining `BlockList` only served as an intermediate conversion layer: 1. Executor produces `RefBundle` 2. `legacy_compat.py` converts to `BlockList` 3. `plan.py` converts back to `RefBundle` This round-trip is unnecessary overhead. **Changes** - `legacy_compat.py`: Renamed `execute_to_legacy_block_list()` → `execute_to_ref_bundle()`, returns `RefBundle` directly - `plan.py`: Uses `RefBundle` directly from executor - `stats.py`: Removed unused `_DatasetStatsBuilder.build()` method and `BlockList` import - `test_split.py`: Updated test helper to use `RefBundle` - Deleted `block_list.py` **Testing** All existing tests pass (424 split tests, execution tests, basic dataset operations). Fixes ray-project#60621 --------- Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Adel Nour <ans9868@nyu.edu>
Remove the `BlockList` class from Ray Data, eliminating unnecessary conversion overhead between `RefBundle` representations. **Why** `BlockList` existed as a legacy abstraction from an older execution model. After `LazyBlockList` was removed in ray-project#46054, the remaining `BlockList` only served as an intermediate conversion layer: 1. Executor produces `RefBundle` 2. `legacy_compat.py` converts to `BlockList` 3. `plan.py` converts back to `RefBundle` This round-trip is unnecessary overhead. **Changes** - `legacy_compat.py`: Renamed `execute_to_legacy_block_list()` → `execute_to_ref_bundle()`, returns `RefBundle` directly - `plan.py`: Uses `RefBundle` directly from executor - `stats.py`: Removed unused `_DatasetStatsBuilder.build()` method and `BlockList` import - `test_split.py`: Updated test helper to use `RefBundle` - Deleted `block_list.py` **Testing** All existing tests pass (424 split tests, execution tests, basic dataset operations). Fixes ray-project#60621 --------- Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Remove the
BlockListclass from Ray Data, eliminating unnecessary conversion overhead betweenRefBundlerepresentations.Why
BlockListexisted as a legacy abstraction from an older execution model. AfterLazyBlockListwas removed in #46054, the remainingBlockListonly served as an intermediate conversion layer:RefBundlelegacy_compat.pyconverts toBlockListplan.pyconverts back toRefBundleThis round-trip is unnecessary overhead.
Changes
legacy_compat.py: Renamedexecute_to_legacy_block_list()→execute_to_ref_bundle(), returnsRefBundledirectlyplan.py: UsesRefBundledirectly from executorstats.py: Removed unused_DatasetStatsBuilder.build()method andBlockListimporttest_split.py: Updated test helper to useRefBundleblock_list.pyTesting
All existing tests pass (424 split tests, execution tests, basic dataset operations).
Fixes #60621