Skip to content

[data] V1 _split_predicate_by_columns correctness fix#63176

Merged
goutamvenkat-anyscale merged 1 commit into
ray-project:masterfrom
goutamvenkat-anyscale:pr63176
May 8, 2026
Merged

[data] V1 _split_predicate_by_columns correctness fix#63176
goutamvenkat-anyscale merged 1 commit into
ray-project:masterfrom
goutamvenkat-anyscale:pr63176

Conversation

@goutamvenkat-anyscale

@goutamvenkat-anyscale goutamvenkat-anyscale commented May 7, 2026

Copy link
Copy Markdown
Contributor

Independent V1 bug fix surfaced during V2 work. _split_predicate_by_columns
previously returned (None, None) for unsplittable mixed conjuncts inside
an enclosing AND chain, which let V1 push the splittable parts and silently
over-include rows. Extend _SplitPredicateResult with
residual_predicate so V1 keeps the surrounding Filter intact.

  • _SplitPredicateResult gains residual_predicate.
  • _split_predicate_by_columns reworked to walk the top-level AND
    chain and classify each conjunct into data / partition / residual
    buckets. Combining the three with AND reproduces the original
    predicate exactly.
  • V1 ParquetDatasource.apply_predicate returns self when
    residual_predicate is non-None so PredicatePushdown keeps
    the Filter above the read.
  • New unit-test cases for the unsplittable-mixed-conjunct path.

Signed-off-by: Goutam goutam@anyscale.com
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com


Stack created with Sapling. Best reviewed with ReviewStack.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a "residual_predicate" to the "_SplitPredicateResult" in "parquet_datasource.py" to handle predicates that cannot be safely split between data and partition columns, such as "OR" operations involving both. This change ensures that unsplittable filters are not silently dropped, preventing the over-inclusion of rows during data loading. Corresponding unit tests were updated to validate the handling of these residual predicates. The reviewer suggested using explicit "is not None" checks instead of truthiness checks for "Expr" objects in the "combine_predicates" helper function to improve robustness and adhere to PEP 8 best practices.

Comment on lines 303 to 308
def combine_predicates(
left: Optional[Expr], right: Optional[Expr]
) -> Optional[Expr]:
if left and right:
return left & right
return left or right

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using truthiness checks (if left and right and left or right) for Expr objects is less robust than explicit None checks. While Expr objects are currently truthy, it is a best practice per PEP 8 to use is not None when checking for optional values to avoid potential issues if __bool__ is ever implemented for these objects. Additionally, using left or right to pick the non-None value is slightly less clear than an explicit conditional expression.

Suggested change
def combine_predicates(
left: Optional[Expr], right: Optional[Expr]
) -> Optional[Expr]:
if left and right:
return left & right
return left or right
def combine_predicates(
left: Optional[Expr], right: Optional[Expr]
) -> Optional[Expr]:
if left is not None and right is not None:
return left & right
return left if left is not None else right

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 01d80e8. Configure here.

Comment thread python/ray/data/_internal/datasource/parquet_datasource.py
Independent V1 bug fix surfaced during V2 work. ``_split_predicate_by_columns``
previously returned ``(None, None)`` for unsplittable mixed conjuncts inside
an enclosing AND chain, which let V1 push the splittable parts and silently
over-include rows. Extend ``_SplitPredicateResult`` with
``residual_predicate`` so V1 keeps the surrounding ``Filter`` intact.

- ``_SplitPredicateResult`` gains ``residual_predicate``.
- ``_split_predicate_by_columns`` reworked to walk the top-level AND
  chain and classify each conjunct into data / partition / residual
  buckets. Combining the three with AND reproduces the original
  predicate exactly.
- V1 ``ParquetDatasource.apply_predicate`` returns ``self`` when
  ``residual_predicate`` is non-None so ``PredicatePushdown`` keeps
  the ``Filter`` above the read.
- New unit-test cases for the unsplittable-mixed-conjunct path.

Signed-off-by: Goutam <goutam@anyscale.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label May 7, 2026
@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label May 7, 2026
# For OR, NOT, or other operations with mixed columns,
# we can't safely split - must evaluate the full predicate together
return _SplitPredicateResult(data_predicate=None, partition_predicate=None)
# ``OR``/``NOT``/etc. straddling both column kinds — not safely

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to have a bigger list of what is and what's not splittable, since it's not clear to me why NOT can't be done. Let be A be a data column, B be a partition column

A & B
NOT (A & B)
A | B
NOT (A | B)
NOT A | B
NOT A & B
...
@goutamvenkat-anyscale goutamvenkat-anyscale merged commit 189b7a9 into ray-project:master May 8, 2026
10 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the pr63176 branch May 8, 2026 23:11
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
)

Independent V1 bug fix surfaced during V2 work.
``_split_predicate_by_columns``
previously returned ``(None, None)`` for unsplittable mixed conjuncts
inside
an enclosing AND chain, which let V1 push the splittable parts and
silently
over-include rows. Extend ``_SplitPredicateResult`` with
``residual_predicate`` so V1 keeps the surrounding ``Filter`` intact.

- ``_SplitPredicateResult`` gains ``residual_predicate``.
- ``_split_predicate_by_columns`` reworked to walk the top-level AND
  chain and classify each conjunct into data / partition / residual
  buckets. Combining the three with AND reproduces the original
  predicate exactly.
- V1 ``ParquetDatasource.apply_predicate`` returns ``self`` when
  ``residual_predicate`` is non-None so ``PredicatePushdown`` keeps
  the ``Filter`` above the read.
- New unit-test cases for the unsplittable-mixed-conjunct path.

Signed-off-by: Goutam <goutam@anyscale.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with [ReviewStack](https://reviewstack.dev/ray-project/ray/pull/63176).
* ray-project#63163
* __->__ ray-project#63176

Signed-off-by: Goutam <goutam@anyscale.com>
Co-authored-by: Goutam V. <>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants