[Data][1/2] Schema inference for non black box UDF logical operators by goutamvenkat-anyscale · Pull Request #63387 · ray-project/ray

goutamvenkat-anyscale · 2026-05-15T23:48:59Z

Description

This PR teaches Ray Data's logical plan to infer output schemas for non-UDF operators before execution.

Previously, a typed pipeline could still make Dataset.schema() fall back to a limit(1) execution whenever an intermediate logical operator could not describe its output schema. With these overrides, ds.schema(fetch_if_missing=False) can resolve through typed pipelines made from projections, filters, shuffles/repartitioning, aggregate/count, union/mix, zip, join, download, write, and streaming split/repartition operators without sampling data.

This does not apply to black-box UDF transforms such as map, map_batches, and similar APIs. Expression UDF schemas still come from their declared return_dtype; this PR does not infer result types from UDF implementation code.

Related issues

N/A

Additional information

The main pieces are:

Adds LogicalOperator.infer_schema() plus reusable mixins for operators that preserve or unify input schemas.
Adds expression-level field resolution through Expr.to_field(), get_type(), and nullable(), including projection handling for *, aliases, renames, and upserts.
Adds aggregate output fields for built-in aggregations such as count, sum, min, max, mean, std, and abs max.
Reuses runtime table logic for zip and join schema inference on empty Arrow tables so inferred schemas match execution behavior.
Covers the new behavior with 37 expression/unit cases and 18 integration cases that verify ds.schema(fetch_if_missing=False) resolves without execution.

gemini-code-assist

Code Review

This pull request implements static schema inference for Ray Data logical operators, enabling Dataset.schema() to resolve output schemas without plan execution. It introduces schema-related mixins and implements infer_schema for operators such as Project, Aggregate, Join, and Union, while also enhancing expression and aggregator classes to support type resolution. Feedback indicates that the Aggregate schema inference should be updated to handle multi-key groupings and that PyArrow schema truthiness checks should be replaced with length checks to correctly identify empty schemas.

goutamvenkat-anyscale · 2026-05-16T05:08:48Z

/gemini review

gemini-code-assist

Code Review

This pull request implements static schema inference for Ray Data logical operators, allowing Dataset.schema() to resolve output schemas without falling back to plan execution for non-UDF chains. The changes include refactoring join logic into reusable utilities, introducing schema-inference mixins for logical operators, and enhancing expressions and aggregators to provide static type information. A critical issue was identified regarding a missing functools import in map_operator.py, which would cause a NameError when accessing the cached schema property.

goutamvenkat-anyscale · 2026-05-20T08:16:10Z

-            right_keys=right_on,
-            left_suffix=self._left_columns_suffix,
-            right_suffix=self._right_columns_suffix,
+def join_tables(


the changes are just about pull the instance methods out.

Use https://app.semanticdiff.com/gh/ray-project/ray/pull/63387/changes#python/ray/data/_internal/execution/operators/join.py?ignore_comments=false to review the semantic diff

goutamvenkat-anyscale · 2026-05-20T09:28:52Z

/gemini review

gemini-code-assist

Code Review

This pull request implements static schema inference for Ray Data logical operators, enabling schema resolution without materializing blocks. It introduces mixin classes for schema propagation, adds infer_schema methods to various operators, and implements expression-level type resolution. Comprehensive tests are included to verify the new functionality. The review comment correctly identifies a missing functools import in python/ray/data/_internal/logical/operators/map_operator.py that needs to be addressed.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{Reviewed by Cursor Bugbot for commit 7ae7fbfb02dc6343e829e84658c79a3f604fc615. Configure here.}

iamjustinhsu

didn't look through all of the files, but my main concern is this https://github.com/ray-project/ray/pull/63387/changes#r3284806824

iamjustinhsu · 2026-05-21T22:50:37Z

+        # (``PandasBlockSchema`` chains fall back to ``limit(1)`` execution.)
+        if not isinstance(input_schema, pa.Schema):
+            return None
+        fields = exprlist_to_fields(self.exprs, input_schema)


hmm not blocking but i think long-term we want to a concept of resolved and unresolved expressions. Right now those expressions can be anything, and we'll need rules to convert those to resolved expressions. So let's say someone does this:

ds = ( ray.data.read_parquet() # Infer the schema on creation since we have no DSL/IR distinction .streaming_repartition() # A new logical operator, no expressions .select_columns(...) # Another logical operator, with an unresolved expression )

Now, we do this:

# Now we should go through these steps: # 1) This should go through all operators, and resolve their expressions recursively against the schema # This can lead to a resolved or unresolved (if child schemas unknown, or if there was an error) # The resolution should be rule-based, ie, one for star expressions, one for attributes, etc... # 2) If step 1 returns an invalid schema or error, then fallback to limit(1) ds.schema()

Now we do this:

# Since ds.schema() doesn't mutate the dataset, we still need to go through attribute resolution to get # resolved and unresolved columns. Then we logical optimizations, logical -> physical, physical optimization ds.materialize()

What are ur thoughts on this? (i have a prototype here https://github.com/ray-project/ray/pull/59117/changes#diff-91eaab60fc55ba17ab52a7498bf46e64ef8630689880fbf72288b1f7a9d3d28bR1)

Ok I will add this Analyzer as a follow up after expanding stars. Did some more reading on datafusion. The rules like TypeCoercion etc. make it clean as to what's being resolved in the planning phase.

Add ``LogicalOperator.infer_schema()`` overrides to every non-UDF operator so ``Dataset.schema()`` resolves typed pipelines without falling back to a ``limit(1)`` execution. Expressions: * New ``Expr.to_field(schema)`` / ``get_type(schema)`` / ``nullable(schema)`` API. Default delegates to ``data_type.to_arrow_dtype()`` (covers ``LiteralExpr``, ``UDFExpr``, ``DownloadExpr``, ``MonoIncId``, ``RandomExpr``, ``UUIDExpr``); schema-dependent subclasses override. * ``BinaryExpr``/``UnaryExpr`` type promotion via PyArrow compute kernels on empty arrays (same kernels the runtime uses). * ``exprlist_to_fields`` helper expands ``StarExpr`` inline. Logical operators: * ``LogicalOperatorPreservesSchema`` mixin: ``Filter``, ``Sort``, ``Limit``, ``Repartition``, ``RandomShuffle``, ``RandomizeBlocks``, ``StreamingRepartition``, ``StreamingSplit``, ``Write``. * ``LogicalOperatorUnifiesInputSchemas`` mixin: ``Union``, ``Mix``. * ``Project.infer_schema()`` via ``exprlist_to_fields``. * ``Aggregate.infer_schema()`` via ``AggregateFn.output_field``; implemented on ``Count``, ``Sum``, ``Min``, ``Max``, ``Mean``, ``Std``, ``AbsMax``. * ``Zip.infer_schema()`` reuses ``BlockAccessor.zip`` on empty tables. * ``Join.infer_schema()`` reuses the new shared ``join_tables`` utility extracted from ``JoiningAggregation.finalize``. * ``Count.infer_schema()``, ``Download.infer_schema()``. Tests: 37 unit + 18 integration verifying ``ds.schema(fetch_if_missing=False)`` resolves typed chains without execution. Co-authored-by: Cursor <cursoragent@cursor.com>

Signed-off-by: Goutam <goutam@anyscale.com>

iamjustinhsu · 2026-05-29T00:01:33Z

+        A list of ``pa.Field`` in projection order, or ``None`` if
+        any expression is unresolvable.
+    """
+    if input_schema is None:


from type annotation, seems like it can't be Optional?

map_operator.py has this check, I can nuke this line.

iamjustinhsu · 2026-05-29T02:32:58Z

+
+    # Any rename whose source isn't in ``input_schema`` falls through
+    # here and will fail resolution -> None, matching the runtime's
+    # "column not found" error.
+    for expr in (*rename_by_source_name.values(), *non_rename_exprs):
+        if not _resolve_and_upsert(expr):
+            return None


wait so this code will also run if has_star is True? I'm under the impression that the previous if block will handle that case

Yes but they handle 2 different cases. The 1st loop is resolving and upserting the input schema. The 2nd loop is handling the other expressions in the list.

Take this example:

[star(), (a+b).alias("sum")]: - Loop 1 emits a, b (from input_schema), and never executes line 1890 (no renames). - Loop 2 iterates [] + [(a+b).alias("sum")] → appends sum.

iamjustinhsu · 2026-05-29T02:34:06Z

+    op_fn = _ARROW_EXPR_OPS_MAP.get(op)
+    if op_fn is None:
+        return None


should this be an assertion?

Yea technically we can't evaluate if the operation isn't supported

iamjustinhsu · 2026-05-29T19:04:05Z

+            .with_column("s", col("a") + col("b"))
+            .groupby("k")
+            .aggregate(Sum("a"), Mean("b"))
+            .sort("k")


i forget -- does "s" get propagated?

no cause of the groupby().agg() so only k, sum(a), mean(b) will stay post-execution

iamjustinhsu · 2026-05-29T19:06:05Z

+        ds_a = ray.data.read_parquet(str(parquet_path))
+        ds_b = ray.data.read_parquet(str(parquet_path))
+        ds = ds_a.union(ds_b)


u should probably do a select(a,b) for ds_a, and select(b,k) for ds_b, so that u see the union?

iamjustinhsu · 2026-05-29T19:09:58Z

@@ -332,6 +344,34 @@ def _validate(self, schema: Optional["Schema"]) -> None:
            SortKey(self._target_col_name).validate_schema(schema)


+def _agg_output_field(
+    name: str,


i think it would be good to add a docstring for name and target_col? Is this partition column?

expanded the doc string

iamjustinhsu · 2026-05-29T19:10:24Z

@@ -902,6 +960,16 @@ def combine(
    ) -> SupportsRichComparisonType:
        return max(current_accumulator, new)

+    def output_field(self, input_schema: "pa.Schema") -> Optional["pa.Field"]:


how come u can't use _agg_output_field here?

We can

return _agg_output_field( self.name, input_schema, self._target_col_name, lambda a: pc.max(pc.abs(a)), )

Signed-off-by: Goutam <goutam@anyscale.com>

…ay-project#63387) ## Description This PR teaches Ray Data's logical plan to infer output schemas for non-UDF operators before execution. Previously, a typed pipeline could still make `Dataset.schema()` fall back to a `limit(1)` execution whenever an intermediate logical operator could not describe its output schema. With these overrides, `ds.schema(fetch_if_missing=False)` can resolve through typed pipelines made from projections, filters, shuffles/repartitioning, aggregate/count, union/mix, zip, join, download, write, and streaming split/repartition operators without sampling data. This does not apply to black-box UDF transforms such as `map`, `map_batches`, and similar APIs. Expression UDF schemas still come from their declared `return_dtype`; this PR does not infer result types from UDF implementation code. ## Related issues N/A ## Additional information The main pieces are: * Adds `LogicalOperator.infer_schema()` plus reusable mixins for operators that preserve or unify input schemas. * Adds expression-level field resolution through `Expr.to_field()`, `get_type()`, and `nullable()`, including projection handling for `*`, aliases, renames, and upserts. * Adds aggregate output fields for built-in aggregations such as count, sum, min, max, mean, std, and abs max. * Reuses runtime table logic for zip and join schema inference on empty Arrow tables so inferred schemas match execution behavior. * Covers the new behavior with 37 expression/unit cases and 18 integration cases that verify `ds.schema(fetch_if_missing=False)` resolves without execution. --------- Signed-off-by: Goutam <goutam@anyscale.com> Co-authored-by: Cursor <cursoragent@cursor.com>

goutamvenkat-anyscale requested a review from a team as a code owner May 15, 2026 23:49

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

Comment thread python/ray/data/_internal/logical/operators/all_to_all_operator.py Outdated

Comment thread python/ray/data/_internal/execution/operators/join.py Outdated

goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels May 16, 2026

goutamvenkat-anyscale changed the title ~~[Data] Phase 1 schema inference for non-UDF logical operators~~ May 16, 2026

gemini-code-assist Bot reviewed May 16, 2026

View reviewed changes

Comment thread python/ray/data/_internal/logical/operators/map_operator.py

cursor Bot reviewed May 17, 2026

View reviewed changes

Comment thread python/ray/data/expressions.py Outdated

cursor Bot reviewed May 17, 2026

View reviewed changes

Comment thread python/ray/data/_internal/logical/interfaces/logical_operator.py Outdated

goutamvenkat-anyscale commented May 20, 2026

View reviewed changes

goutamvenkat-anyscale force-pushed the goutam/schema_inference_phase_1 branch from 9514948 to ddedb4f Compare May 20, 2026 08:38

cursor Bot reviewed May 20, 2026

View reviewed changes

Comment thread python/ray/data/expressions.py Outdated

cursor Bot reviewed May 20, 2026

View reviewed changes

Comment thread python/ray/data/_internal/logical/interfaces/logical_operator.py

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

Comment thread python/ray/data/_internal/logical/operators/map_operator.py

cursor Bot reviewed May 20, 2026

View reviewed changes

Comment thread python/ray/data/_internal/logical/operators/one_to_one_operator.py

iamjustinhsu reviewed May 21, 2026

View reviewed changes

goutamvenkat-anyscale force-pushed the goutam/schema_inference_phase_1 branch 2 times, most recently from 1585770 to 4b6491c Compare May 22, 2026 20:51

goutamvenkat-anyscale and others added 10 commits May 28, 2026 10:54

Address comments

68ab4c9

Signed-off-by: Goutam <goutam@anyscale.com>

Fix all tests

a63305f

Signed-off-by: Goutam <goutam@anyscale.com>

make pyrefly and annotations happy

99a23ef

Signed-off-by: Goutam <goutam@anyscale.com>

Fix pyrefly

508477e

Signed-off-by: Goutam <goutam@anyscale.com>

Address comments

168cbd0

Signed-off-by: Goutam <goutam@anyscale.com>

Add comment back

1c276b7

Signed-off-by: Goutam <goutam@anyscale.com>

Clean up

7ac6fb0

Signed-off-by: Goutam <goutam@anyscale.com>

Address comments

a473588

Signed-off-by: Goutam <goutam@anyscale.com>

Move pa_join method

b00c17f

Signed-off-by: Goutam <goutam@anyscale.com>

iamjustinhsu reviewed May 29, 2026

View reviewed changes

goutamvenkat-anyscale force-pushed the goutam/schema_inference_phase_1 branch from 4b6491c to b00c17f Compare May 30, 2026 00:13

Address comments

adf953f

Signed-off-by: Goutam <goutam@anyscale.com>

iamjustinhsu approved these changes Jun 1, 2026

View reviewed changes

goutamvenkat-anyscale merged commit 68041c4 into ray-project:master Jun 1, 2026
6 checks passed

goutamvenkat-anyscale deleted the goutam/schema_inference_phase_1 branch June 1, 2026 18:50

goutamvenkat-anyscale mentioned this pull request Jun 3, 2026

[Data] Fix hash-shuffle aggregator memory estimation: metadata propagation, node-size clamp, and column pruning #63809

Merged

Uh oh!

Conversation

goutamvenkat-anyscale commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

goutamvenkat-anyscale commented May 16, 2026

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale commented May 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

iamjustinhsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

2 participants

goutamvenkat-anyscale commented May 15, 2026 •

edited

Loading