Skip to content

[Data] Fix filesystem compatibility check for fsspec-wrapped PyFileSystem#61850

Merged
bveeramani merged 8 commits into
ray-project:masterfrom
xyuzh:fix-fsspec-filesystem-compatibility-check
Apr 13, 2026
Merged

[Data] Fix filesystem compatibility check for fsspec-wrapped PyFileSystem#61850
bveeramani merged 8 commits into
ray-project:masterfrom
xyuzh:fix-fsspec-filesystem-compatibility-check

Conversation

@xyuzh

@xyuzh xyuzh commented Mar 18, 2026

Copy link
Copy Markdown
Member

Summary

  • Fix _is_filesystem_compatible_with_scheme to properly inspect the inner fsspec protocol when a filesystem is wrapped in PyFileSystem, preventing incorrect rejection of user-provided fsspec filesystems (e.g., s3fs for s3:// URIs).
  • Upgrade OSError log level from debug to warning in download_bytes_threaded for better observability.
  • Add comprehensive unit tests for _is_filesystem_compatible_with_scheme covering native PyArrow filesystems, fsspec-wrapped filesystems with various protocols, and edge cases.

Test plan

  • Added TestIsFilesystemCompatibleWithScheme test class with tests for native S3/local filesystems, fsspec S3/GCS/HTTP wrappers, tuple protocols, scheme mismatches, bare paths, and unknown schemes.
  • Run existing Ray Data unit tests to verify no regressions.
…stem

The `_is_filesystem_compatible_with_scheme` function was not correctly
identifying fsspec filesystems wrapped in PyFileSystem. This caused
mismatches where a user-provided fsspec S3 filesystem would be rejected
for s3:// URIs. Now the inner fsspec protocol is inspected to determine
compatibility. Also upgrades OSError log level from debug to warning in
download_bytes_threaded.
@xyuzh xyuzh requested a review from a team as a code owner March 18, 2026 22:55
@xyuzh xyuzh requested a review from bveeramani March 18, 2026 23:02
Comment thread python/ray/data/datasource/path_util.py Outdated
The direct `fs_type in expected_types` check matched any PyFileSystem
(type_name "py") against http/https schemes, since the scheme map
includes "py" for those. This short-circuited before the inner fsspec
protocol check could reject non-HTTP wrappers like s3fs. Move the
direct match after the PyFileSystem branch so fsspec wrappers always
go through protocol inspection first.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a filesystem compatibility check for fsspec-wrapped filesystems, improves logging for OSError, and adds comprehensive unit tests. The changes look good, but I've found a potential issue in the new compatibility check logic that doesn't correctly handle RetryingPyFileSystem. I've provided a suggestion to fix this.

Comment thread python/ray/data/datasource/path_util.py
RetryingPyFileSystem has type_name "RetryingPyFileSystem", not "py",
so it was falling through to the direct match and failing. Include it
in the condition so it enters the fsspec protocol inspection branch.
Comment thread python/ray/data/datasource/path_util.py
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Mar 19, 2026
…lity

When RetryingPyFileSystem wraps a native PyArrow filesystem (e.g.,
S3FileSystem), unwrap() returns a non-PyFileSystem instance. The
isinstance(actual_fs, PyFileSystem) check then fails, skipping inner
protocol matching and falling through to return False — never reaching
the native type_name check. After unwrapping, check if the result is
a non-PyFileSystem and fall back to direct type_name matching.
Comment thread python/ray/data/tests/unit/test_path_util.py Outdated
Import from ray.data._internal.util, consistent with the rest of the
codebase.
Comment thread python/ray/data/tests/unit/test_path_util.py Outdated
@xyuzh xyuzh added the go add ONLY when ready to merge, run all tests label Mar 25, 2026
@github-actions

github-actions Bot commented Apr 9, 2026

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Apr 9, 2026
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
assert resolved_paths[0] == path


class TestIsFilesystemCompatibleWithScheme:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, I think we could improve these tests by testing against real implementations like S3FileSystem rather than mocks with fake type_name.

The current set of tests are brittle:

  • They assume _is_filesystem_compatible_with_scheme's implementation uses type_name. If that implementation detail changes, these tests break.
  • They assume the type_name of PyArrow filesystem implementations. If PyArrow changes these, our software will break but it won't be caught by these tests
@bveeramani bveeramani enabled auto-merge (squash) April 13, 2026 03:32
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@github-actions github-actions Bot disabled auto-merge April 13, 2026 07:21

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 0e1c8e8. Configure here.


actual_fs = filesystem
if isinstance(actual_fs, RetryingPyFileSystem):
actual_fs = actual_fs.unwrap()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant unwrapping duplicates earlier computation

Low Severity

The actual_fs computation on lines 206–208 exactly duplicates the unwrapped computation on lines 191–195 — both unwrap RetryingPyFileSystem to get the inner filesystem. After both run, actual_fs always equals unwrapped. Having two separate variables that track the same value introduces unnecessary complexity and a maintenance risk: if one unwrapping path is updated but the other is forgotten, a subtle bug would be introduced. The code could simply reuse unwrapped instead of re-deriving actual_fs from filesystem.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0e1c8e8. Configure here.

@bveeramani bveeramani merged commit bd60896 into ray-project:master Apr 13, 2026
6 checks passed
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
…stem (ray-project#61850)

## Summary
- Fix `_is_filesystem_compatible_with_scheme` to properly inspect the
inner fsspec protocol when a filesystem is wrapped in `PyFileSystem`,
preventing incorrect rejection of user-provided fsspec filesystems
(e.g., s3fs for `s3://` URIs).
- Upgrade OSError log level from `debug` to `warning` in
`download_bytes_threaded` for better observability.
- Add comprehensive unit tests for
`_is_filesystem_compatible_with_scheme` covering native PyArrow
filesystems, fsspec-wrapped filesystems with various protocols, and edge
cases.

## Test plan
- [x] Added `TestIsFilesystemCompatibleWithScheme` test class with tests
for native S3/local filesystems, fsspec S3/GCS/HTTP wrappers, tuple
protocols, scheme mismatches, bare paths, and unknown schemes.
- [ ] Run existing Ray Data unit tests to verify no regressions.

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…stem (ray-project#61850)

## Summary
- Fix `_is_filesystem_compatible_with_scheme` to properly inspect the
inner fsspec protocol when a filesystem is wrapped in `PyFileSystem`,
preventing incorrect rejection of user-provided fsspec filesystems
(e.g., s3fs for `s3://` URIs).
- Upgrade OSError log level from `debug` to `warning` in
`download_bytes_threaded` for better observability.
- Add comprehensive unit tests for
`_is_filesystem_compatible_with_scheme` covering native PyArrow
filesystems, fsspec-wrapped filesystems with various protocols, and edge
cases.

## Test plan
- [x] Added `TestIsFilesystemCompatibleWithScheme` test class with tests
for native S3/local filesystems, fsspec S3/GCS/HTTP wrappers, tuple
protocols, scheme mismatches, bare paths, and unknown schemes.
- [ ] Run existing Ray Data unit tests to verify no regressions.

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests stale The issue is stale. It will be closed within 7 days unless there are further conversation

2 participants