[data] V1 _split_predicate_by_columns correctness fix by goutamvenkat-anyscale · Pull Request #63176 · ray-project/ray

goutamvenkat-anyscale · 2026-05-07T00:48:10Z

Independent V1 bug fix surfaced during V2 work. _split_predicate_by_columns
previously returned (None, None) for unsplittable mixed conjuncts inside
an enclosing AND chain, which let V1 push the splittable parts and silently
over-include rows. Extend _SplitPredicateResult with
residual_predicate so V1 keeps the surrounding Filter intact.

_SplitPredicateResult gains residual_predicate.
_split_predicate_by_columns reworked to walk the top-level AND
chain and classify each conjunct into data / partition / residual
buckets. Combining the three with AND reproduces the original
predicate exactly.
V1 ParquetDatasource.apply_predicate returns self when
residual_predicate is non-None so PredicatePushdown keeps
the Filter above the read.
New unit-test cases for the unsplittable-mixed-conjunct path.

Signed-off-by: Goutam goutam@anyscale.com
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

Stack created with Sapling. Best reviewed with ReviewStack.

gemini-code-assist

Code Review

This pull request introduces a "residual_predicate" to the "_SplitPredicateResult" in "parquet_datasource.py" to handle predicates that cannot be safely split between data and partition columns, such as "OR" operations involving both. This change ensures that unsplittable filters are not silently dropped, preventing the over-inclusion of rows during data loading. Corresponding unit tests were updated to validate the handling of these residual predicates. The reviewer suggested using explicit "is not None" checks instead of truthiness checks for "Expr" objects in the "combine_predicates" helper function to improve robustness and adhere to PEP 8 best practices.

gemini-code-assist · 2026-05-07T00:50:56Z

        def combine_predicates(
            left: Optional[Expr], right: Optional[Expr]
        ) -> Optional[Expr]:
            if left and right:
                return left & right
            return left or right


Using truthiness checks (if left and right and left or right) for Expr objects is less robust than explicit None checks. While Expr objects are currently truthy, it is a best practice per PEP 8 to use is not None when checking for optional values to avoid potential issues if __bool__ is ever implemented for these objects. Additionally, using left or right to pick the non-None value is slightly less clear than an explicit conditional expression.

Suggested change

def combine_predicates(

left: Optional[Expr], right: Optional[Expr]

) -> Optional[Expr]:

if left and right:

return left & right

return left or right

def combine_predicates(

left: Optional[Expr], right: Optional[Expr]

) -> Optional[Expr]:

if left is not None and right is not None:

return left & right

return left if left is not None else right

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 01d80e8. Configure here.}

Independent V1 bug fix surfaced during V2 work. ``_split_predicate_by_columns`` previously returned ``(None, None)`` for unsplittable mixed conjuncts inside an enclosing AND chain, which let V1 push the splittable parts and silently over-include rows. Extend ``_SplitPredicateResult`` with ``residual_predicate`` so V1 keeps the surrounding ``Filter`` intact. - ``_SplitPredicateResult`` gains ``residual_predicate``. - ``_split_predicate_by_columns`` reworked to walk the top-level AND chain and classify each conjunct into data / partition / residual buckets. Combining the three with AND reproduces the original predicate exactly. - V1 ``ParquetDatasource.apply_predicate`` returns ``self`` when ``residual_predicate`` is non-None so ``PredicatePushdown`` keeps the ``Filter`` above the read. - New unit-test cases for the unsplittable-mixed-conjunct path. Signed-off-by: Goutam <goutam@anyscale.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

iamjustinhsu · 2026-05-08T22:02:40Z

-    # For OR, NOT, or other operations with mixed columns,
-    # we can't safely split - must evaluate the full predicate together
-    return _SplitPredicateResult(data_predicate=None, partition_predicate=None)
+    # ``OR``/``NOT``/etc. straddling both column kinds — not safely


I think it would be good to have a bigger list of what is and what's not splittable, since it's not clear to me why NOT can't be done. Let be A be a data column, B be a partition column

A & B NOT (A & B) A | B NOT (A | B) NOT A | B NOT A & B ...

) Independent V1 bug fix surfaced during V2 work. ``_split_predicate_by_columns`` previously returned ``(None, None)`` for unsplittable mixed conjuncts inside an enclosing AND chain, which let V1 push the splittable parts and silently over-include rows. Extend ``_SplitPredicateResult`` with ``residual_predicate`` so V1 keeps the surrounding ``Filter`` intact. - ``_SplitPredicateResult`` gains ``residual_predicate``. - ``_split_predicate_by_columns`` reworked to walk the top-level AND chain and classify each conjunct into data / partition / residual buckets. Combining the three with AND reproduces the original predicate exactly. - V1 ``ParquetDatasource.apply_predicate`` returns ``self`` when ``residual_predicate`` is non-None so ``PredicatePushdown`` keeps the ``Filter`` above the read. - New unit-test cases for the unsplittable-mixed-conjunct path. Signed-off-by: Goutam <goutam@anyscale.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/ray-project/ray/pull/63176). * ray-project#63163 * __->__ ray-project#63176 Signed-off-by: Goutam <goutam@anyscale.com> Co-authored-by: Goutam V. <> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

goutamvenkat-anyscale requested a review from a team as a code owner May 7, 2026 00:48

goutamvenkat-anyscale mentioned this pull request May 7, 2026

[data] DataSourceV2: V2 predicate splitting + consistency fixes #63163

Merged

4 tasks

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

cursor Bot reviewed May 7, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/parquet_datasource.py

goutamvenkat-anyscale force-pushed the pr63176 branch from 01d80e8 to fae8dbe Compare May 7, 2026 01:22

ray-gardener Bot added the data Ray Data-related issues label May 7, 2026

goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label May 7, 2026

iamjustinhsu approved these changes May 8, 2026

View reviewed changes

goutamvenkat-anyscale merged commit 189b7a9 into ray-project:master May 8, 2026
10 checks passed

goutamvenkat-anyscale deleted the pr63176 branch May 8, 2026 23:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] V1 _split_predicate_by_columns correctness fix#63176

[data] V1 _split_predicate_by_columns correctness fix#63176
goutamvenkat-anyscale merged 1 commit into
ray-project:masterfrom
goutamvenkat-anyscale:pr63176

goutamvenkat-anyscale commented May 7, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

gemini-code-assist Bot May 7, 2026

cursor Bot left a comment

Uh oh!

iamjustinhsu May 8, 2026

Uh oh!

Labels

2 participants

Uh oh!

Conversation

goutamvenkat-anyscale commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

iamjustinhsu May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Labels

2 participants

goutamvenkat-anyscale commented May 7, 2026 •

edited

Loading