[Data] Fix filesystem compatibility check for fsspec-wrapped PyFileSystem#61850
Conversation
…stem The `_is_filesystem_compatible_with_scheme` function was not correctly identifying fsspec filesystems wrapped in PyFileSystem. This caused mismatches where a user-provided fsspec S3 filesystem would be rejected for s3:// URIs. Now the inner fsspec protocol is inspected to determine compatibility. Also upgrades OSError log level from debug to warning in download_bytes_threaded.
The direct `fs_type in expected_types` check matched any PyFileSystem (type_name "py") against http/https schemes, since the scheme map includes "py" for those. This short-circuited before the inner fsspec protocol check could reject non-HTTP wrappers like s3fs. Move the direct match after the PyFileSystem branch so fsspec wrappers always go through protocol inspection first.
There was a problem hiding this comment.
Code Review
This pull request fixes a filesystem compatibility check for fsspec-wrapped filesystems, improves logging for OSError, and adds comprehensive unit tests. The changes look good, but I've found a potential issue in the new compatibility check logic that doesn't correctly handle RetryingPyFileSystem. I've provided a suggestion to fix this.
RetryingPyFileSystem has type_name "RetryingPyFileSystem", not "py", so it was falling through to the direct match and failing. Include it in the condition so it enters the fsspec protocol inspection branch.
…lity When RetryingPyFileSystem wraps a native PyArrow filesystem (e.g., S3FileSystem), unwrap() returns a non-PyFileSystem instance. The isinstance(actual_fs, PyFileSystem) check then fails, skipping inner protocol matching and falling through to return False — never reaching the native type_name check. After unwrapping, check if the result is a non-PyFileSystem and fall back to direct type_name matching.
Import from ray.data._internal.util, consistent with the rest of the codebase.
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
| assert resolved_paths[0] == path | ||
|
|
||
|
|
||
| class TestIsFilesystemCompatibleWithScheme: |
There was a problem hiding this comment.
For future reference, I think we could improve these tests by testing against real implementations like S3FileSystem rather than mocks with fake type_name.
The current set of tests are brittle:
- They assume
_is_filesystem_compatible_with_scheme's implementation usestype_name. If that implementation detail changes, these tests break. - They assume the
type_nameof PyArrow filesystem implementations. If PyArrow changes these, our software will break but it won't be caught by these tests
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 0e1c8e8. Configure here.
|
|
||
| actual_fs = filesystem | ||
| if isinstance(actual_fs, RetryingPyFileSystem): | ||
| actual_fs = actual_fs.unwrap() |
There was a problem hiding this comment.
Redundant unwrapping duplicates earlier computation
Low Severity
The actual_fs computation on lines 206–208 exactly duplicates the unwrapped computation on lines 191–195 — both unwrap RetryingPyFileSystem to get the inner filesystem. After both run, actual_fs always equals unwrapped. Having two separate variables that track the same value introduces unnecessary complexity and a maintenance risk: if one unwrapping path is updated but the other is forgotten, a subtle bug would be introduced. The code could simply reuse unwrapped instead of re-deriving actual_fs from filesystem.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 0e1c8e8. Configure here.
…stem (ray-project#61850) ## Summary - Fix `_is_filesystem_compatible_with_scheme` to properly inspect the inner fsspec protocol when a filesystem is wrapped in `PyFileSystem`, preventing incorrect rejection of user-provided fsspec filesystems (e.g., s3fs for `s3://` URIs). - Upgrade OSError log level from `debug` to `warning` in `download_bytes_threaded` for better observability. - Add comprehensive unit tests for `_is_filesystem_compatible_with_scheme` covering native PyArrow filesystems, fsspec-wrapped filesystems with various protocols, and edge cases. ## Test plan - [x] Added `TestIsFilesystemCompatibleWithScheme` test class with tests for native S3/local filesystems, fsspec S3/GCS/HTTP wrappers, tuple protocols, scheme mismatches, bare paths, and unknown schemes. - [ ] Run existing Ray Data unit tests to verify no regressions. --------- Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
…stem (ray-project#61850) ## Summary - Fix `_is_filesystem_compatible_with_scheme` to properly inspect the inner fsspec protocol when a filesystem is wrapped in `PyFileSystem`, preventing incorrect rejection of user-provided fsspec filesystems (e.g., s3fs for `s3://` URIs). - Upgrade OSError log level from `debug` to `warning` in `download_bytes_threaded` for better observability. - Add comprehensive unit tests for `_is_filesystem_compatible_with_scheme` covering native PyArrow filesystems, fsspec-wrapped filesystems with various protocols, and edge cases. ## Test plan - [x] Added `TestIsFilesystemCompatibleWithScheme` test class with tests for native S3/local filesystems, fsspec S3/GCS/HTTP wrappers, tuple protocols, scheme mismatches, bare paths, and unknown schemes. - [ ] Run existing Ray Data unit tests to verify no regressions. --------- Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>


Summary
_is_filesystem_compatible_with_schemeto properly inspect the inner fsspec protocol when a filesystem is wrapped inPyFileSystem, preventing incorrect rejection of user-provided fsspec filesystems (e.g., s3fs fors3://URIs).debugtowarningindownload_bytes_threadedfor better observability._is_filesystem_compatible_with_schemecovering native PyArrow filesystems, fsspec-wrapped filesystems with various protocols, and edge cases.Test plan
TestIsFilesystemCompatibleWithSchemetest class with tests for native S3/local filesystems, fsspec S3/GCS/HTTP wrappers, tuple protocols, scheme mismatches, bare paths, and unknown schemes.