[data] V1 _split_predicate_by_columns correctness fix#63176
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a "residual_predicate" to the "_SplitPredicateResult" in "parquet_datasource.py" to handle predicates that cannot be safely split between data and partition columns, such as "OR" operations involving both. This change ensures that unsplittable filters are not silently dropped, preventing the over-inclusion of rows during data loading. Corresponding unit tests were updated to validate the handling of these residual predicates. The reviewer suggested using explicit "is not None" checks instead of truthiness checks for "Expr" objects in the "combine_predicates" helper function to improve robustness and adhere to PEP 8 best practices.
| def combine_predicates( | ||
| left: Optional[Expr], right: Optional[Expr] | ||
| ) -> Optional[Expr]: | ||
| if left and right: | ||
| return left & right | ||
| return left or right |
There was a problem hiding this comment.
Using truthiness checks (if left and right and left or right) for Expr objects is less robust than explicit None checks. While Expr objects are currently truthy, it is a best practice per PEP 8 to use is not None when checking for optional values to avoid potential issues if __bool__ is ever implemented for these objects. Additionally, using left or right to pick the non-None value is slightly less clear than an explicit conditional expression.
| def combine_predicates( | |
| left: Optional[Expr], right: Optional[Expr] | |
| ) -> Optional[Expr]: | |
| if left and right: | |
| return left & right | |
| return left or right | |
| def combine_predicates( | |
| left: Optional[Expr], right: Optional[Expr] | |
| ) -> Optional[Expr]: | |
| if left is not None and right is not None: | |
| return left & right | |
| return left if left is not None else right |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 01d80e8. Configure here.
Independent V1 bug fix surfaced during V2 work. ``_split_predicate_by_columns`` previously returned ``(None, None)`` for unsplittable mixed conjuncts inside an enclosing AND chain, which let V1 push the splittable parts and silently over-include rows. Extend ``_SplitPredicateResult`` with ``residual_predicate`` so V1 keeps the surrounding ``Filter`` intact. - ``_SplitPredicateResult`` gains ``residual_predicate``. - ``_split_predicate_by_columns`` reworked to walk the top-level AND chain and classify each conjunct into data / partition / residual buckets. Combining the three with AND reproduces the original predicate exactly. - V1 ``ParquetDatasource.apply_predicate`` returns ``self`` when ``residual_predicate`` is non-None so ``PredicatePushdown`` keeps the ``Filter`` above the read. - New unit-test cases for the unsplittable-mixed-conjunct path. Signed-off-by: Goutam <goutam@anyscale.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
01d80e8 to
fae8dbe
Compare
| # For OR, NOT, or other operations with mixed columns, | ||
| # we can't safely split - must evaluate the full predicate together | ||
| return _SplitPredicateResult(data_predicate=None, partition_predicate=None) | ||
| # ``OR``/``NOT``/etc. straddling both column kinds — not safely |
There was a problem hiding this comment.
I think it would be good to have a bigger list of what is and what's not splittable, since it's not clear to me why NOT can't be done. Let be A be a data column, B be a partition column
A & B
NOT (A & B)
A | B
NOT (A | B)
NOT A | B
NOT A & B
...) Independent V1 bug fix surfaced during V2 work. ``_split_predicate_by_columns`` previously returned ``(None, None)`` for unsplittable mixed conjuncts inside an enclosing AND chain, which let V1 push the splittable parts and silently over-include rows. Extend ``_SplitPredicateResult`` with ``residual_predicate`` so V1 keeps the surrounding ``Filter`` intact. - ``_SplitPredicateResult`` gains ``residual_predicate``. - ``_split_predicate_by_columns`` reworked to walk the top-level AND chain and classify each conjunct into data / partition / residual buckets. Combining the three with AND reproduces the original predicate exactly. - V1 ``ParquetDatasource.apply_predicate`` returns ``self`` when ``residual_predicate`` is non-None so ``PredicatePushdown`` keeps the ``Filter`` above the read. - New unit-test cases for the unsplittable-mixed-conjunct path. Signed-off-by: Goutam <goutam@anyscale.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/ray-project/ray/pull/63176). * ray-project#63163 * __->__ ray-project#63176 Signed-off-by: Goutam <goutam@anyscale.com> Co-authored-by: Goutam V. <> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Independent V1 bug fix surfaced during V2 work.
_split_predicate_by_columnspreviously returned
(None, None)for unsplittable mixed conjuncts insidean enclosing AND chain, which let V1 push the splittable parts and silently
over-include rows. Extend
_SplitPredicateResultwithresidual_predicateso V1 keeps the surroundingFilterintact._SplitPredicateResultgainsresidual_predicate._split_predicate_by_columnsreworked to walk the top-level ANDchain and classify each conjunct into data / partition / residual
buckets. Combining the three with AND reproduces the original
predicate exactly.
ParquetDatasource.apply_predicatereturnsselfwhenresidual_predicateis non-None soPredicatePushdownkeepsthe
Filterabove the read.Signed-off-by: Goutam goutam@anyscale.com
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com
Stack created with Sapling. Best reviewed with ReviewStack.