[Data] Remove obsolete PyArrow 9.0 version checks#61483
Conversation
Since Ray's minimum PyArrow version is >= 9.0.0, version checks for object extension type support are no longer needed. Changes: - Remove MIN_PYARROW_VERSION_SCALAR_SUBCLASS and _object_extension_type_allowed() - Simplify object fallback logic to only use DataContext.enable_fallback_to_arrow_object_ext_type - Fix PYARROW_VERSION import in arrow_ops/transform_pyarrow.py by defining it locally - Remove version-based test skips - Improve test accuracy: dicts use Arrow struct type, not pandas fallback This cleanup removes ~80 lines of dead code and improves maintainability. Signed-off-by: slfan1989 <slfan1989@apache.org>
There was a problem hiding this comment.
Code Review
This pull request effectively removes obsolete PyArrow 9.0 version checks, which simplifies the codebase as the minimum supported version now includes the necessary features. The changes are well-executed, removing redundant logic and constants, and updating tests to reflect the new baseline behavior. I've noticed one minor opportunity for improvement regarding code duplication in the tests, which I've commented on. Overall, this is a great cleanup that improves code clarity and maintainability.
I am having trouble creating individual review comments. Click here to see my feedback.
python/ray/data/tests/test_transform_pyarrow.py (50-51)
This UnsupportedType class is removed from the module scope but then re-defined in both test_map_batches_fallback_to_pandas_on_incompatible_data and test_map_raises_on_incompatible_data. To avoid this code duplication, it would be better to keep the class definition at the module level and reuse it in both tests. This will improve code maintainability.
| # an error during block construction because we cannot cast dicts | ||
| # to a supported arrow type. This test checks that the block | ||
| # construction falls back to pandas and still succeeds. | ||
| def test_dict_and_none_use_arrow_block(ray_start_regular_shared, restore_data_context): |
There was a problem hiding this comment.
I didn't understand this name. I think this test is more like "Test that batches of dicts are stored as a structs in PyArrow tables". What does none allude to?
| # Dicts are represented as Arrow struct types, so the block should remain Arrow | ||
| # even if object-extension fallback is disabled. | ||
| DataContext.get_current().enable_fallback_to_arrow_object_ext_type = False |
There was a problem hiding this comment.
What happens if we remove this? Even if if we use the object extension type, won't the struct assertion fail?
## Description Remove obsolete PyArrow version checks for object extension type support. Since Ray's minimum PyArrow version is now >= 9.0.0 (and object extension type was introduced in PyArrow 9.0.0), the version gating logic is redundant and always evaluates to true. This PR simplifies the codebase by: - Removing `MIN_PYARROW_VERSION_SCALAR_SUBCLASS` constant and `_object_extension_type_allowed()` function - Simplifying object fallback logic to only check `DataContext.enable_fallback_to_arrow_object_ext_type` config - Removing version-based test skips - Fixing `PYARROW_VERSION` import issue in `transform_pyarrow.py` ## Related issues > Link related issues: "Fixes ray-project#61482", "Closes ray-project#61482", or "Related to ray-project#61482". ## Additional information **Key changes:** 1. **Core logic simplification** (`tensor_extensions/arrow.py`): - Removed nested version check conditions - Fallback behavior now controlled solely by DataContext config - Updated log messages to reflect simplified logic 2. **Import fix** (`arrow_ops/transform_pyarrow.py`): - Removed `PYARROW_VERSION` from imports (was causing circular dependency) - Defined locally as `PYARROW_VERSION = get_pyarrow_version()` 3. **Test improvements**: - Removed `@pytest.mark.skipif(_object_extension_type_allowed(), ...)` decorators - Fixed test assumption: `test_dict_and_none_use_arrow_block` now correctly expects Arrow blocks with struct types, not pandas fallback - Split `test_fallback_to_pandas_on_incompatible_data` into separate tests for `map` and `map_batches` to clarify different behaviors Signed-off-by: slfan1989 <slfan1989@apache.org>
## Description Remove obsolete PyArrow version checks for object extension type support. Since Ray's minimum PyArrow version is now >= 9.0.0 (and object extension type was introduced in PyArrow 9.0.0), the version gating logic is redundant and always evaluates to true. This PR simplifies the codebase by: - Removing `MIN_PYARROW_VERSION_SCALAR_SUBCLASS` constant and `_object_extension_type_allowed()` function - Simplifying object fallback logic to only check `DataContext.enable_fallback_to_arrow_object_ext_type` config - Removing version-based test skips - Fixing `PYARROW_VERSION` import issue in `transform_pyarrow.py` ## Related issues > Link related issues: "Fixes ray-project#61482", "Closes ray-project#61482", or "Related to ray-project#61482". ## Additional information **Key changes:** 1. **Core logic simplification** (`tensor_extensions/arrow.py`): - Removed nested version check conditions - Fallback behavior now controlled solely by DataContext config - Updated log messages to reflect simplified logic 2. **Import fix** (`arrow_ops/transform_pyarrow.py`): - Removed `PYARROW_VERSION` from imports (was causing circular dependency) - Defined locally as `PYARROW_VERSION = get_pyarrow_version()` 3. **Test improvements**: - Removed `@pytest.mark.skipif(_object_extension_type_allowed(), ...)` decorators - Fixed test assumption: `test_dict_and_none_use_arrow_block` now correctly expects Arrow blocks with struct types, not pandas fallback - Split `test_fallback_to_pandas_on_incompatible_data` into separate tests for `map` and `map_batches` to clarify different behaviors Signed-off-by: slfan1989 <slfan1989@apache.org>
Description
Remove obsolete PyArrow version checks for object extension type support. Since Ray's minimum PyArrow version is now >= 9.0.0 (and object extension type was introduced in PyArrow 9.0.0), the version gating logic is redundant and always evaluates to true.
This PR simplifies the codebase by:
MIN_PYARROW_VERSION_SCALAR_SUBCLASSconstant and_object_extension_type_allowed()functionDataContext.enable_fallback_to_arrow_object_ext_typeconfigPYARROW_VERSIONimport issue intransform_pyarrow.pyRelated issues
Additional information
Key changes:
Core logic simplification (
tensor_extensions/arrow.py):Import fix (
arrow_ops/transform_pyarrow.py):PYARROW_VERSIONfrom imports (was causing circular dependency)PYARROW_VERSION = get_pyarrow_version()Test improvements:
@pytest.mark.skipif(_object_extension_type_allowed(), ...)decoratorstest_dict_and_none_use_arrow_blocknow correctly expects Arrow blocks with struct types, not pandas fallbacktest_fallback_to_pandas_on_incompatible_datainto separate tests formapandmap_batchesto clarify different behaviors