Skip to content

[Data] add pathlib.Path support to read_* functions#61126

Merged
bveeramani merged 3 commits into
ray-project:masterfrom
yuhuan130:fix/pathlib-support-file-datasource
Feb 25, 2026
Merged

[Data] add pathlib.Path support to read_* functions#61126
bveeramani merged 3 commits into
ray-project:masterfrom
yuhuan130:fix/pathlib-support-file-datasource

Conversation

@yuhuan130

Copy link
Copy Markdown
Contributor

Why are these changes needed?

ray.data.read_* functions don't accept pathlib.Path objects, forcing users to manually convert to strings. This PR adds native support for Path objects.

What's included?

Modified:

  • path_util.py - convert Path objects to strings in _resolve_paths_and_filesystem()
  • util.py - handle Path objects in _is_local_scheme()

Added:

  • test_read_text_with_pathlib() - test single Path and list of Paths ✅

Screenshots

Screenshot 2026-02-17 at 17 36 42

Benefits

  • More Pythonic API, works with modern path handling

Note

Closes #61084

@yuhuan130 yuhuan130 requested a review from a team as a code owner February 18, 2026 01:53

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for pathlib.Path objects to ray.data.read_* functions, which is a great usability improvement. The changes in _is_local_scheme and _resolve_paths_and_filesystem correctly handle single Path objects and lists of Path objects. The new test case properly validates this functionality.

I've added one comment to improve consistency between the two modified utility functions. Additionally, I've noticed that the path normalization logic is now duplicated in ray.data._internal.util._is_local_scheme and ray.data.datasource.path_util._resolve_paths_and_filesystem. It would be beneficial to refactor this into a shared helper function in a follow-up PR to enhance maintainability.

Comment thread python/ray/data/_internal/util.py
@yuhuan130 yuhuan130 changed the title data: add pathlib.Path support to read_* functions Feb 18, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Comment thread python/ray/data/_internal/util.py
@ray-gardener ray-gardener Bot added the community-contribution Contributed by the community label Feb 18, 2026
@@ -44,6 +44,25 @@ def test_read_text(ray_start_regular_shared, tmp_path):
assert ds.count() == 4


def test_read_text_with_pathlib(ray_start_regular_shared, tmp_path):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pathlib stuff isn't specific to read_text (or any specific file format). Let's move it to test_file_based_datasource, and rewrite it with execute_read_tasks (so we can verify this behavior without launching Ray tasks?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah~ no problem. I get the difference now!!

Comment on lines +59 to +63
ds = ray.data.read_text([file1, file2])
assert sorted(_to_lines(ds.take())) == ["goodbye", "hello", "world"]

ds = ray.data.read_text(file1)
assert sorted(_to_lines(ds.take())) == ["hello", "world"]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Might be worth explicitly commenting that the intent is to verify both lists of pathlib.Path and a single pathlib.Path. Intent wasn't immediately obvious to me when I skimmed the test

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@yuhuan130

Copy link
Copy Markdown
Contributor Author
Screenshot 2026-02-18 at 18 40 21

Updated test result

@yuhuan130

Copy link
Copy Markdown
Contributor Author

@bveeramani PTAL~ thank you.

@bveeramani bveeramani left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@bveeramani bveeramani enabled auto-merge (squash) February 24, 2026 02:23
@bveeramani

Copy link
Copy Markdown
Member

@yuhuan130 I've approved and enable merge. Lmk if it doesn't merge, and I can try again

@bveeramani bveeramani added the go add ONLY when ready to merge, run all tests label Feb 24, 2026
@slfan1989

Copy link
Copy Markdown
Contributor

@yuhuan130 The failure in Buildkite microcheck (build 38734) is due to a raydepsets lockfile verification mismatch. The failure in Buildkite premerge (build 60767) is still a TIMEOUT in //python/ray/data:test_with_column. I think we can update the branch.

Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
auto-merge was automatically disabled February 25, 2026 01:08

Head branch was pushed to by a user without write access

@yuhuan130 yuhuan130 force-pushed the fix/pathlib-support-file-datasource branch from ad816c6 to b5be98a Compare February 25, 2026 01:08
@bveeramani bveeramani merged commit 817204a into ray-project:master Feb 25, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests

3 participants