[Data] Convert `drop_columns` to a `Project` logical operator when input schema is known by goutamvenkat-anyscale · Pull Request #63813 · ray-project/ray

goutamvenkat-anyscale · 2026-06-03T05:11:59Z

Change

drop_columns reshapes into self.select_columns(keep_cols) when the input op's infer_schema() returns a pa.Schema, keeping the typed schema chain intact so Dataset.schema() resolves without a limit(1) execution. Missing columns raise KeyError eagerly at the call site on the typed path.

When the input schema is opaque (UDF chain, PandasBlockSchema source) or all columns are dropped (avoids the internal __bsp_stub placeholder that select_columns([]) inserts), falls back to the existing MapBatches path.

… known drop_columns reshapes into self.select_columns(keep_cols) when the input op's infer_schema() returns a pa.Schema, keeping the typed schema chain intact so Dataset.schema() resolves without a limit(1) execution. Missing columns raise KeyError eagerly at the call site on the typed path. When the input schema is opaque (UDF chain, PandasBlockSchema source) or all columns are dropped (avoids the internal __bsp_stub placeholder that select_columns([]) inserts), falls back to the existing MapBatches path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates Dataset.drop_columns to reshape the operation into a Project over the surviving columns when the input schema is statically known. This keeps the typed schema chain intact and allows missing columns to be reported eagerly. If the schema is unknown or all columns are dropped, it falls back to the MapBatches implementation. Unit tests were added to verify these behaviors. The reviewer suggested performance improvements, including an early return when cols is empty and using sets for O(1) membership lookups during column filtering.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Return early when cols is empty to skip schema inference and a redundant Project. Use sets for membership checks, reducing the missing/keep computation from O(M*N) to O(M+N). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Goutam <goutam@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 088f044. Configure here.}

cursor · 2026-06-03T08:37:42Z

+                    compute=compute,
+                    concurrency=concurrency,
+                    **ray_remote_args,
+                )


Typed path ignores compute

Medium Severity

When the input has a static PyArrow schema, drop_columns routes through select_columns, which always builds a TaskPoolStrategy from concurrency and never applies the compute argument. The map_batches fallback still resolves compute via get_compute_strategy, so execution settings can change depending on schema visibility.

^{Reviewed by Cursor Bugbot for commit 088f044. Configure here.}

I believe compute is deprecated for drop_columns...

iamjustinhsu · 2026-06-03T23:30:17Z

+                    f"{missing}. Available columns: {input_schema.names}"
+                )
+            keep = [c for c in input_schema.names if c not in cols_set]
+            if keep:


what if keep is empty?

…put schema is known (ray-project#63813) ## Change `drop_columns` reshapes into `self.select_columns(keep_cols)` when the input op's `infer_schema()` returns a `pa.Schema`, keeping the typed schema chain intact so `Dataset.schema()` resolves without a `limit(1)` execution. Missing columns raise `KeyError` eagerly at the call site on the typed path. When the input schema is opaque (UDF chain, `PandasBlockSchema` source) or all columns are dropped (avoids the internal __bsp_stub placeholder that `select_columns([])` inserts), falls back to the existing `MapBatches` path. --------- Signed-off-by: Goutam <goutam@anyscale.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

goutamvenkat-anyscale requested a review from a team as a code owner June 3, 2026 05:12

gemini-code-assist Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread python/ray/data/dataset.py Outdated

goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Jun 3, 2026

cursor Bot reviewed Jun 3, 2026

View reviewed changes

iamjustinhsu approved these changes Jun 3, 2026

View reviewed changes

goutamvenkat-anyscale merged commit d8ea7ee into ray-project:master Jun 3, 2026
9 checks passed

goutamvenkat-anyscale deleted the goutam/schema_inference_phase_3 branch June 3, 2026 23:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Convert `drop_columns` to a `Project` logical operator when input schema is known#63813

[Data] Convert `drop_columns` to a `Project` logical operator when input schema is known#63813
goutamvenkat-anyscale merged 2 commits into
ray-project:masterfrom
goutamvenkat-anyscale:goutam/schema_inference_phase_3

goutamvenkat-anyscale commented Jun 3, 2026

gemini-code-assist Bot left a comment

Uh oh!

cursor Bot left a comment

cursor Bot Jun 3, 2026

goutamvenkat-anyscale Jun 3, 2026

iamjustinhsu Jun 3, 2026

Uh oh!

Labels

2 participants

Uh oh!

Conversation

goutamvenkat-anyscale commented Jun 3, 2026

Change

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

cursor Bot Jun 3, 2026

Choose a reason for hiding this comment

Typed path ignores compute

goutamvenkat-anyscale Jun 3, 2026

Choose a reason for hiding this comment

iamjustinhsu Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Labels

2 participants