Skip to content

[Data] Remove legacy BlockList class#60575

Merged
bveeramani merged 7 commits into
ray-project:masterfrom
pushpavanthar:deprecate_blocklist
Feb 11, 2026
Merged

[Data] Remove legacy BlockList class#60575
bveeramani merged 7 commits into
ray-project:masterfrom
pushpavanthar:deprecate_blocklist

Conversation

@pushpavanthar

@pushpavanthar pushpavanthar commented Jan 29, 2026

Copy link
Copy Markdown
Contributor

Remove the BlockList class from Ray Data, eliminating unnecessary conversion overhead between RefBundle representations.

Why
BlockList existed as a legacy abstraction from an older execution model. After LazyBlockList was removed in #46054, the remaining BlockList only served as an intermediate conversion layer:

  1. Executor produces RefBundle
  2. legacy_compat.py converts to BlockList
  3. plan.py converts back to RefBundle

This round-trip is unnecessary overhead.

Changes

  • legacy_compat.py: Renamed execute_to_legacy_block_list()execute_to_ref_bundle(), returns RefBundle directly
  • plan.py: Uses RefBundle directly from executor
  • stats.py: Removed unused _DatasetStatsBuilder.build() method and BlockList import
  • test_split.py: Updated test helper to use RefBundle
  • Deleted block_list.py

Testing
All existing tests pass (424 split tests, execution tests, basic dataset operations).

Fixes #60621

@pushpavanthar pushpavanthar requested a review from a team as a code owner January 29, 2026 03:19

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a solid refactoring that removes the legacy BlockList class, simplifying the data flow within Ray Data and eliminating unnecessary conversion overhead. The changes are clean, consistent, and well-motivated. I've included one suggestion to further optimize the logic in legacy_compat.py for better performance and memory efficiency. Overall, this is an excellent improvement to the codebase.

@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 29, 2026
@@ -169,8 +168,8 @@ def _get_initial_stats_from_plan(plan: ExecutionPlan) -> DatasetStats:
return plan._in_stats


def _bundles_to_block_list(bundles: Iterator[RefBundle]) -> BlockList:
blocks, metadata = [], []
def _bundles_to_ref_bundle(bundles: Iterator[RefBundle]) -> RefBundle:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can reuse merge_ref_bundles with some changes to it? Something like

    def merge_ref_bundles(cls, bundles: Iterable["RefBundle"]) -> "RefBundle":
        bundles = list(bundles)
        if not bundles:
            return cls(blocks=(), owns_blocks=True, schema=None)
        merged_blocks = list(itertools.chain.from_iterable(bundle.blocks for bundle in bundles))
        merged_slices = list(itertools.chain.from_iterable(bundle.slices for bundle in bundles))
        owns_blocks = all(bundle.owns_blocks for bundle in bundles)
        schema = _take_first_non_empty_schema(bundle.schema for bundle in bundles)
        return cls(
            blocks=tuple(merged_blocks),
            schema=schema,
            owns_blocks=owns_blocks,
            slices=merged_slices,
        )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented! Using merge_ref_bundles() is cleaner and also fixed a couple of bugs in that method (schema selection and ownership calculation).

BlockList was an intermediate conversion layer between the executor's
RefBundle output and the plan's RefBundle consumption. This removes
the unnecessary round-trip by having execute_to_ref_bundle() return
RefBundle directly.

Changes:
- Rename execute_to_legacy_block_list to execute_to_ref_bundle
- Remove _bundles_to_block_list, add _bundles_to_ref_bundle
- Remove unused _DatasetStatsBuilder.build() method
- Update test_split.py to use RefBundle
- Delete block_list.py

Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com>
BlockList was an intermediate conversion layer between the executor's
RefBundle output and the plan's RefBundle consumption. This removes
the unnecessary round-trip by returning RefBundle directly.

Changes:
- Update RefBundle.merge_ref_bundles() to handle empty input, use
  _take_first_non_empty_schema, and properly compute owns_blocks
- Rename execute_to_legacy_block_list to execute_to_ref_bundle
- Use RefBundle.merge_ref_bundles() instead of custom helper
- Remove unused _DatasetStatsBuilder.build() method
- Update test_split.py to use RefBundle
- Delete block_list.py

Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com>
@bveeramani bveeramani enabled auto-merge (squash) January 30, 2026 21:36
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label Jan 30, 2026
@github-actions github-actions Bot disabled auto-merge January 31, 2026 03:26
Comment thread python/ray/data/_internal/execution/interfaces/ref_bundle.py
Comment thread python/ray/data/_internal/execution/interfaces/ref_bundle.py
assert bundles, "Cannot merge an empty list of RefBundles."
merged_blocks = list(itertools.chain(*[bundle.blocks for bundle in bundles]))
merged_slices = list(itertools.chain(*[bundle.slices for bundle in bundles]))
def merge_ref_bundles(cls, bundles: Iterable["RefBundle"]) -> "RefBundle":

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add test for this method

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method is already tested here:

def test_merge_ref_bundles():

Is there anything new we need to test for this refactor?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's test owns_block semantic properly (while we're at it)

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) February 11, 2026 01:39
@github-actions github-actions Bot disabled auto-merge February 11, 2026 01:39
@bveeramani bveeramani merged commit 6f0458b into ray-project:master Feb 11, 2026
7 checks passed
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
Remove the `BlockList` class from Ray Data, eliminating unnecessary
conversion overhead between `RefBundle` representations.

**Why**
`BlockList` existed as a legacy abstraction from an older execution
model. After `LazyBlockList` was removed in ray-project#46054, the remaining
`BlockList` only served as an intermediate conversion layer:

1. Executor produces `RefBundle`
2. `legacy_compat.py` converts to `BlockList`
3. `plan.py` converts back to `RefBundle`

This round-trip is unnecessary overhead.

**Changes**

- `legacy_compat.py`: Renamed `execute_to_legacy_block_list()` →
`execute_to_ref_bundle()`, returns `RefBundle` directly
- `plan.py`: Uses `RefBundle` directly from executor
- `stats.py`: Removed unused `_DatasetStatsBuilder.build()` method and
`BlockList` import
- `test_split.py`: Updated test helper to use `RefBundle`
- Deleted `block_list.py`

**Testing**
All existing tests pass (424 split tests, execution tests, basic dataset
operations).

Fixes ray-project#60621

---------

Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
Aydin-ab pushed a commit to kunling-anyscale/ray that referenced this pull request Feb 20, 2026
Remove the `BlockList` class from Ray Data, eliminating unnecessary
conversion overhead between `RefBundle` representations.

**Why**
`BlockList` existed as a legacy abstraction from an older execution
model. After `LazyBlockList` was removed in ray-project#46054, the remaining
`BlockList` only served as an intermediate conversion layer:

1. Executor produces `RefBundle`
2. `legacy_compat.py` converts to `BlockList`
3. `plan.py` converts back to `RefBundle`

This round-trip is unnecessary overhead.

**Changes**

- `legacy_compat.py`: Renamed `execute_to_legacy_block_list()` →
`execute_to_ref_bundle()`, returns `RefBundle` directly
- `plan.py`: Uses `RefBundle` directly from executor
- `stats.py`: Removed unused `_DatasetStatsBuilder.build()` method and
`BlockList` import
- `test_split.py`: Updated test helper to use `RefBundle`
- Deleted `block_list.py`

**Testing**
All existing tests pass (424 split tests, execution tests, basic dataset
operations).

Fixes ray-project#60621

---------

Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

5 participants