Skip to content

[Data] Convert drop_columns to a Project logical operator when input schema is known#63813

Merged
goutamvenkat-anyscale merged 2 commits into
ray-project:masterfrom
goutamvenkat-anyscale:goutam/schema_inference_phase_3
Jun 3, 2026
Merged

[Data] Convert drop_columns to a Project logical operator when input schema is known#63813
goutamvenkat-anyscale merged 2 commits into
ray-project:masterfrom
goutamvenkat-anyscale:goutam/schema_inference_phase_3

Conversation

@goutamvenkat-anyscale

Copy link
Copy Markdown
Contributor

Change

drop_columns reshapes into self.select_columns(keep_cols) when the input op's infer_schema() returns a pa.Schema, keeping the typed schema chain intact so Dataset.schema() resolves without a limit(1) execution. Missing columns raise KeyError eagerly at the call site on the typed path.

When the input schema is opaque (UDF chain, PandasBlockSchema source) or all columns are dropped (avoids the internal __bsp_stub placeholder that select_columns([]) inserts), falls back to the existing MapBatches path.

… known

drop_columns reshapes into self.select_columns(keep_cols) when the input
op's infer_schema() returns a pa.Schema, keeping the typed schema chain
intact so Dataset.schema() resolves without a limit(1) execution. Missing
columns raise KeyError eagerly at the call site on the typed path.

When the input schema is opaque (UDF chain, PandasBlockSchema source) or
all columns are dropped (avoids the internal __bsp_stub placeholder that
select_columns([]) inserts), falls back to the existing MapBatches path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner June 3, 2026 05:12

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates Dataset.drop_columns to reshape the operation into a Project over the surviving columns when the input schema is statically known. This keeps the typed schema chain intact and allows missing columns to be reported eagerly. If the schema is unknown or all columns are dropped, it falls back to the MapBatches implementation. Unit tests were added to verify these behaviors. The reviewer suggested performance improvements, including an early return when cols is empty and using sets for O(1) membership lookups during column filtering.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/ray/data/dataset.py Outdated
Return early when cols is empty to skip schema inference and a redundant
Project. Use sets for membership checks, reducing the missing/keep
computation from O(M*N) to O(M+N).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Jun 3, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 088f044. Configure here.

compute=compute,
concurrency=concurrency,
**ray_remote_args,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typed path ignores compute

Medium Severity

When the input has a static PyArrow schema, drop_columns routes through select_columns, which always builds a TaskPoolStrategy from concurrency and never applies the compute argument. The map_batches fallback still resolves compute via get_compute_strategy, so execution settings can change depending on schema visibility.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 088f044. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe compute is deprecated for drop_columns...

f"{missing}. Available columns: {input_schema.names}"
)
keep = [c for c in input_schema.names if c not in cols_set]
if keep:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if keep is empty?

@goutamvenkat-anyscale goutamvenkat-anyscale merged commit d8ea7ee into ray-project:master Jun 3, 2026
9 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/schema_inference_phase_3 branch June 3, 2026 23:39
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
…put schema is known (ray-project#63813)

## Change
`drop_columns` reshapes into `self.select_columns(keep_cols)` when the
input op's `infer_schema()` returns a `pa.Schema`, keeping the typed
schema chain intact so `Dataset.schema()` resolves without a `limit(1)`
execution. Missing columns raise `KeyError` eagerly at the call site on
the typed path.

When the input schema is opaque (UDF chain, `PandasBlockSchema` source)
or all columns are dropped (avoids the internal __bsp_stub placeholder
that `select_columns([])` inserts), falls back to the existing
`MapBatches` path.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
…put schema is known (ray-project#63813)

## Change
`drop_columns` reshapes into `self.select_columns(keep_cols)` when the
input op's `infer_schema()` returns a `pa.Schema`, keeping the typed
schema chain intact so `Dataset.schema()` resolves without a `limit(1)`
execution. Missing columns raise `KeyError` eagerly at the call site on
the typed path.

When the input schema is opaque (UDF chain, `PandasBlockSchema` source)
or all columns are dropped (avoids the internal __bsp_stub placeholder
that `select_columns([])` inserts), falls back to the existing
`MapBatches` path.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants