Skip to content

[Data] Honor compute= in filter(expr=...), deprecate concurrency=#63576

Merged
goutamvenkat-anyscale merged 2 commits into
ray-project:masterfrom
goutamvenkat-anyscale:fix-filter-expr-compute-arg
May 22, 2026
Merged

[Data] Honor compute= in filter(expr=...), deprecate concurrency=#63576
goutamvenkat-anyscale merged 2 commits into
ray-project:masterfrom
goutamvenkat-anyscale:fix-filter-expr-compute-arg

Conversation

@goutamvenkat-anyscale

Copy link
Copy Markdown
Contributor

Summary

  • Dataset.filter(expr=...) built its compute strategy as TaskPoolStrategy(size=concurrency), silently ignoring any compute= argument the caller passed. This broke fusion with downstream operators that set their compute strategy explicitly (Task→Task fusion requires matching size per operator_fusion._fuse_compute_strategy).
  • Route the expr branch through get_compute_strategy(fn=None, ...) so it follows the same resolution rules as map, map_batches, flat_map, and the fn branch of filter: compute= takes precedence, concurrency= remains a deprecated fallback that emits the standard warning, and both unset falls back to TaskPoolStrategy(). Predicate expressions also reject ActorPoolStrategy, mirroring the existing guardrail for regular functions (both are stateless).

Repro

ds = ds.filter(expr=col("bytes").is_not_null(), compute=TaskPoolStrategy(size=20))
ds = ds.with_column("norm_image", udf_fn(col("bytes")), compute=TaskPoolStrategy(size=20))

Before: the filter's compute= was discarded → upstream size=None, downstream size=20 → fusion blocked.
After: both ops carry size=20Filter(...)->Project appears as a single fused operator.

Test plan

  • pytest python/ray/data/tests/test_filter.py — 43 passed, including 4 new tests:
    • test_filter_expr_compute_resolution (parametrized): compute= and concurrency= both round-trip into Filter.compute.
    • test_filter_expr_compute_honors_taskpoolstrategy: explicit TaskPoolStrategy(size=20) is preserved.
    • test_filter_expr_concurrency_is_deprecated: legacy concurrency= still works but logs the deprecation warning.
    • test_filter_expr_rejects_actor_pool: passing ActorPoolStrategy raises (same as map(fn=lambda r: r, compute=ActorPoolStrategy(...))).
  • Lint clean (pre-commit run --files ...).

🤖 Generated with Claude Code

The `filter(expr=...)` path built its compute strategy as
`TaskPoolStrategy(size=concurrency)`, silently ignoring any `compute=`
argument. This broke fusion with downstream operators that set their
compute strategy explicitly (Task->Task fusion requires matching sizes
per `operator_fusion._fuse_compute_strategy`).

Route the expr branch through `get_compute_strategy` so it follows the
same resolution rules as `map`, `map_batches`, `flat_map`, and the
`fn` branch of `filter`: `compute=` takes precedence, `concurrency=`
remains as a deprecated fallback that emits the standard warning, and
both unset falls back to `TaskPoolStrategy()`. Stateless predicate
expressions also reject `ActorPoolStrategy`, mirroring the existing
guardrail for regular functions.

Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner May 21, 2026 16:52
@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels May 21, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the filter operation in Ray Data to properly handle compute strategies when predicate expressions are used. It replaces a hardcoded TaskPoolStrategy with get_compute_strategy, allowing the compute parameter to be honored and maintaining concurrency as a deprecated fallback. The PR also includes a suite of tests to ensure correct strategy resolution and proper error handling for incompatible strategies like ActorPoolStrategy. I have no feedback to provide.

@richardliaw richardliaw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we previously just ignored the param? lol

@goutamvenkat-anyscale goutamvenkat-anyscale merged commit 918d003 into ray-project:master May 22, 2026
6 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the fix-filter-expr-compute-arg branch May 22, 2026 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants