[Data] - Iceberg support predicate & projection pushdown by goutamvenkat-anyscale · Pull Request #58286 · ray-project/ray

goutamvenkat-anyscale · 2025-10-29T20:21:31Z

Description

Predicate pushdown (#58150) in conjunction with this PR should speed up reads from Iceberg.

Once the above change lands, we can add the pushdown interface support for IcebergDatasource

Signed-off-by: Goutam <goutam@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a converter from Ray Data expressions to PyIceberg expressions, enabling predicate pushdown for Iceberg data sources. The implementation is well-structured and includes comprehensive tests. My feedback focuses on improving code maintainability by reducing duplication, adding missing type hints for better code clarity and safety, and strengthening test assertions to ensure structural equality of the converted expressions.

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-10-29T20:29:27Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a converter from Ray Data expressions to PyIceberg expressions, enabling predicate pushdown for Iceberg data sources. The implementation is clean and well-tested. I've added a few suggestions to improve performance and maintainability in the new _IcebergExpressionVisitor by making operation maps class-level constants and dynamically generating error messages.

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-10-31T21:00:11Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces predicate and projection pushdown support for the Iceberg datasource, which is a significant enhancement for read performance. The implementation is well-structured, introducing an _IcebergExpressionVisitor for converting Ray Data expressions and updating IcebergDatasource to support the necessary pushdown interfaces. The accompanying tests are exceptionally thorough, covering a wide range of scenarios and combinations of filters, projections, and column renames. I've identified a minor potential bug in the projection logic and have a couple of small suggestions for code simplification.

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-11-01T00:20:19Z

/gemini review

Signed-off-by: Goutam <goutam@anyscale.com>

alexeykudinkin · 2025-11-06T20:40:24Z

+        Returns:
+            List of column names to project, or None if all columns are selected.
+        """
+        return self._data_columns


Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-11-07T02:33:41Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces significant improvements by adding predicate and projection pushdown support for Iceberg datasources. The implementation is well-structured, leveraging a new _DatasourceProjectionPushdownMixin to provide a generic framework for projection pushdown, which is also adopted by the Parquet datasource for better code consistency and reuse. A new _IcebergExpressionVisitor is introduced to translate Ray Data expressions into PyIceberg expressions, enabling efficient filtering at the source. The test coverage for these new features is comprehensive and robust.

I've found two issues: a critical bug in ParquetDatasource where a super().__init__() call is missing, which would lead to an AttributeError during predicate pushdown, and a high-severity logic error in the projection combination logic that could cause all columns to be read instead of none. Addressing these will make the implementation solid.

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-11-07T02:45:05Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces significant improvements by adding predicate and projection pushdown support for Iceberg datasources. The refactoring to create a _DatasourceProjectionPushdownMixin is a great step towards centralizing projection logic and promoting code reuse, as demonstrated by its adoption in the Parquet datasource as well. The implementation of an _IcebergExpressionVisitor for translating Ray Data expressions is a solid approach for enabling predicate pushdown. Furthermore, the API enhancements in read_iceberg, deprecating older parameters in favor of a more fluent API, improve consistency across the library. The test coverage for these new features is comprehensive and well-executed. I've included a couple of minor suggestions to improve documentation clarity and type hinting.

Signed-off-by: Goutam <goutam@anyscale.com>

bveeramani · 2025-11-07T21:12:18Z

-    def supports_predicate_pushdown(self) -> bool:
-        return True
-


Yes. In the prior implementation, csv datasource was incorrectly applying predicate and projection pushdown even though it makes no sense since csv has no accompanying metadata.

bveeramani · 2025-11-07T22:11:01Z

+        # Store as projection_map (identity mapping if columns specified, None otherwise)
+        # Note: Empty list [] means no columns, None means all columns
+        if data_columns is None:
+            self._projection_map = None
+        else:
+            self._projection_map = {col: col for col in data_columns}


Also out-of-scope for this PR, but I think it's implicit that you need to set this specific _projection_map attribute that's defined in an ancestor class, and it's not part of the _DatasourceProjectionPushdownMixin interface.

Actually this will be deprecated in a few releases, cause it's an anti-pattern to pass in the columns as part of the reads. But I see what you mean.

bveeramani · 2025-11-07T22:15:38Z

                    current_project.exprs
                )


Is there ever a case where a logical operator subclasses LogicalOperatorSupportsProjectionPushdown but supports_projection_pushdown() is false?

Yes. The Read operator supports pushdown but not all readers support projection pushdown (CSV is an example.)

Signed-off-by: Goutam <goutam@anyscale.com>

…#58286) ## Description Predicate pushdown (ray-project#58150) in conjunction with this PR should speed up reads from Iceberg. Once the above change lands, we can add the pushdown interface support for IcebergDatasource --------- Signed-off-by: Goutam <goutam@anyscale.com>

…#58286) ## Description Predicate pushdown (ray-project#58150) in conjunction with this PR should speed up reads from Iceberg. Once the above change lands, we can add the pushdown interface support for IcebergDatasource --------- Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

alexeykudinkin · 2025-11-08T03:47:26Z

+        # Initialize parent class to set up predicate pushdown mixin
+        super().__init__()
+


Just make init an abstract method in the mixin

alexeykudinkin · 2026-01-02T02:50:18Z

-                return input_op.apply_projection(
-                    required_columns, output_column_rename_map
-                )
+                # Determine columns to project


@goutamvenkat-anyscale please explain changes in this rule

[Data] - Ray Data Expr to Iceberg Expr converter

efc1a60

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from a team as a code owner October 29, 2025 20:21

goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Oct 29, 2025

Merge branch 'master' into goutam/iceberg_expr

a818110

gemini-code-assist Bot reviewed Oct 29, 2025

View reviewed changes

Comment thread python/ray/data/_internal/planner/plan_expression/expression_visitors.py Outdated

Comment thread python/ray/data/_internal/planner/plan_expression/expression_visitors.py Outdated

Comment thread python/ray/data/tests/test_expressions.py Outdated

Cleanup

5aade3e

Signed-off-by: Goutam <goutam@anyscale.com>

gemini-code-assist Bot reviewed Oct 29, 2025

View reviewed changes

goutamvenkat-anyscale added 3 commits October 29, 2025 13:43

More comments

d55cf61

Signed-off-by: Goutam <goutam@anyscale.com>

type_checking

adf7c3f

Signed-off-by: Goutam <goutam@anyscale.com>

Guard imports for pyiceberg

c97d29a

Signed-off-by: Goutam <goutam@anyscale.com>

This comment was marked as outdated.

Sign in to view

goutamvenkat-anyscale added 3 commits October 29, 2025 15:18

One more cleanup

0c4d033

Signed-off-by: Goutam <goutam@anyscale.com>

Merge branch 'master' into goutam/iceberg_expr

2a82a26

Predicate pushdown into Iceberg

45ccaaa

Signed-off-by: Goutam <goutam@anyscale.com>

This comment was marked as outdated.

Sign in to view

goutamvenkat-anyscale added 2 commits October 31, 2025 13:33

go back to pyiceberg 0.9.0

19384ea

Signed-off-by: Goutam <goutam@anyscale.com>

Add projection pushdown support

17356ef

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale changed the title ~~[Data] - Ray Data Expr to Iceberg Expr converter~~ Oct 31, 2025

Deprecation warnings at read_iceberg level

32a416b

Signed-off-by: Goutam <goutam@anyscale.com>

gemini-code-assist Bot reviewed Oct 31, 2025

View reviewed changes

Comment thread python/ray/data/_internal/datasource/iceberg_datasource.py Outdated

alexeykudinkin reviewed Oct 31, 2025

View reviewed changes

Comment thread python/ray/data/expressions.py Outdated

Comment thread python/ray/data/_internal/datasource/iceberg_datasource.py Outdated

Comment thread python/ray/data/_internal/datasource/iceberg_datasource.py Outdated

Address comments + reduce duplication

d097ea0

Signed-off-by: Goutam <goutam@anyscale.com>

This comment was marked as outdated.

Sign in to view

goutamvenkat-anyscale added 2 commits October 31, 2025 17:04

More unification

1c84e54

Signed-off-by: Goutam <goutam@anyscale.com>

Cleanup

eac91a7

Signed-off-by: Goutam <goutam@anyscale.com>

This comment was marked as outdated.

Sign in to view

Remove prior change

3fcd4c3

Signed-off-by: Goutam <goutam@anyscale.com>

alexeykudinkin reviewed Nov 6, 2025

View reviewed changes

Comment thread python/ray/data/_internal/datasource/csv_datasource.py Outdated

alexeykudinkin reviewed Nov 6, 2025

View reviewed changes

goutamvenkat-anyscale added 4 commits November 6, 2025 17:12

Temp changes

4f57a05

Signed-off-by: Goutam <goutam@anyscale.com>

Addressed comments

adf5eca

Signed-off-by: Goutam <goutam@anyscale.com>

Cleanup

c0cab8b

Signed-off-by: Goutam <goutam@anyscale.com>

Some more cleanup

bc9134a

Signed-off-by: Goutam <goutam@anyscale.com>

gemini-code-assist Bot reviewed Nov 7, 2025

View reviewed changes

Comment thread python/ray/data/_internal/datasource/parquet_datasource.py

Comment thread python/ray/data/datasource/datasource.py Outdated

thing

35830e2

Signed-off-by: Goutam <goutam@anyscale.com>

gemini-code-assist Bot reviewed Nov 7, 2025

View reviewed changes

Comment thread python/ray/data/datasource/datasource.py

Comment thread python/ray/data/_internal/planner/plan_expression/expression_visitors.py Outdated

doctest

24d7210

Signed-off-by: Goutam <goutam@anyscale.com>

bveeramani reviewed Nov 7, 2025

View reviewed changes

goutamvenkat-anyscale added 2 commits November 7, 2025 16:01

Merge branch 'master' into goutam/iceberg_expr

8227428

Address comments

745a1be

Signed-off-by: Goutam <goutam@anyscale.com>

bveeramani approved these changes Nov 11, 2025

View reviewed changes

bveeramani merged commit 10983e8 into ray-project:master Nov 11, 2025
6 checks passed

goutamvenkat-anyscale deleted the goutam/iceberg_expr branch November 11, 2025 01:38

richardliaw mentioned this pull request Nov 15, 2025

Ray Data Q4 Roadmap + Wishlist #58665

Open

alexeykudinkin reviewed Jan 2, 2026

View reviewed changes

		# Initialize parent class to set up predicate pushdown mixin
		super().__init__()

Uh oh!

Conversation

goutamvenkat-anyscale commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale commented Oct 29, 2025

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

goutamvenkat-anyscale commented Oct 31, 2025

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

goutamvenkat-anyscale commented Nov 1, 2025

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

goutamvenkat-anyscale commented Nov 7, 2025

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

goutamvenkat-anyscale commented Nov 7, 2025

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Choose a reason for hiding this comment

goutamvenkat-anyscale Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

3 participants

goutamvenkat-anyscale commented Oct 29, 2025 •

edited

Loading

goutamvenkat-anyscale Nov 8, 2025 •

edited

Loading