[Data] add pathlib.Path support to read_* functions#61126
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for pathlib.Path objects to ray.data.read_* functions, which is a great usability improvement. The changes in _is_local_scheme and _resolve_paths_and_filesystem correctly handle single Path objects and lists of Path objects. The new test case properly validates this functionality.
I've added one comment to improve consistency between the two modified utility functions. Additionally, I've noticed that the path normalization logic is now duplicated in ray.data._internal.util._is_local_scheme and ray.data.datasource.path_util._resolve_paths_and_filesystem. It would be beneficial to refactor this into a shared helper function in a follow-up PR to enhance maintainability.
| @@ -44,6 +44,25 @@ def test_read_text(ray_start_regular_shared, tmp_path): | |||
| assert ds.count() == 4 | |||
|
|
|||
|
|
|||
| def test_read_text_with_pathlib(ray_start_regular_shared, tmp_path): | |||
There was a problem hiding this comment.
The pathlib stuff isn't specific to read_text (or any specific file format). Let's move it to test_file_based_datasource, and rewrite it with execute_read_tasks (so we can verify this behavior without launching Ray tasks?)
There was a problem hiding this comment.
Ah~ no problem. I get the difference now!!
| ds = ray.data.read_text([file1, file2]) | ||
| assert sorted(_to_lines(ds.take())) == ["goodbye", "hello", "world"] | ||
|
|
||
| ds = ray.data.read_text(file1) | ||
| assert sorted(_to_lines(ds.take())) == ["hello", "world"] |
There was a problem hiding this comment.
Nit: Might be worth explicitly commenting that the intent is to verify both lists of pathlib.Path and a single pathlib.Path. Intent wasn't immediately obvious to me when I skimmed the test
|
@bveeramani PTAL~ thank you. |
|
@yuhuan130 I've approved and enable merge. Lmk if it doesn't merge, and I can try again |
|
@yuhuan130 The failure in Buildkite microcheck (build 38734) is due to a raydepsets lockfile verification mismatch. The failure in Buildkite premerge (build 60767) is still a TIMEOUT in |
Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
Signed-off-by: “Alex <alexchien130@gmail.com>
Head branch was pushed to by a user without write access
ad816c6 to
b5be98a
Compare

Why are these changes needed?
ray.data.read_*functions don't acceptpathlib.Pathobjects, forcing users to manually convert to strings. This PR adds native support for Path objects.What's included?
Modified:
path_util.py- convert Path objects to strings in_resolve_paths_and_filesystem()✅util.py- handle Path objects in_is_local_scheme()✅Added:
test_read_text_with_pathlib()- test single Path and list of Paths ✅Screenshots
Benefits
Note
Closes #61084