Skip to content

[Data] Add file partitioning for DataSourceV2 [3/n]#61997

Merged
goutamvenkat-anyscale merged 4 commits into
ray-project:masterfrom
goutamvenkat-anyscale:datasource-v2/partitioners
Mar 24, 2026
Merged

[Data] Add file partitioning for DataSourceV2 [3/n]#61997
goutamvenkat-anyscale merged 4 commits into
ray-project:masterfrom
goutamvenkat-anyscale:datasource-v2/partitioners

Conversation

@goutamvenkat-anyscale

Copy link
Copy Markdown
Contributor

Description

Add requisite abstractions for File partitioning, particularly the RoundRobinPartitioner.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner March 23, 2026 23:00

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces the FilePartitioner abstract base class and its concrete implementation, RoundRobinPartitioner, to handle file partitioning for DataSourceV2. The RoundRobinPartitioner effectively groups files into manifests based on estimated in-memory sizes, ensuring balanced read tasks. The overall design is clear and addresses the stated goal of adding file partitioning abstractions.

Comment thread python/ray/data/_internal/datasource_v2/partitioners/round_robin_partitioner.py Outdated
@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Mar 23, 2026
Signed-off-by: Goutam <goutam@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

InMemorySizeEstimator,
)

logger = logging.getLogger(__name__)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused logger variable defined but never referenced

Low Severity

The logging import and logger variable on line 13 are unused — no logger.debug(...), logger.warning(...), or any other call appears anywhere in this file. Other files in this module (e.g. file_indexer.py) define logger and actually use it, so this looks like copy-paste scaffolding that was never wired up.

Fix in Cursor Fix in Web
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale changed the title [Data] Add file partitioning for DataSourceV2 Mar 24, 2026
@goutamvenkat-anyscale goutamvenkat-anyscale merged commit 569eb4e into ray-project:master Mar 24, 2026
6 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the datasource-v2/partitioners branch March 24, 2026 20:14
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Mar 25, 2026
## Description
Add requisite abstractions for File partitioning, particularly the
RoundRobinPartitioner.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
## Description
Add requisite abstractions for File partitioning, particularly the
RoundRobinPartitioner.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants