Skip to content

Revert "[Data] - Iceberg support upsert tables + schema update + overwrite tables"#59185

Merged
alexeykudinkin merged 4 commits into
masterfrom
revert-58270-goutam/iceberg_upsert_overwrite_tables
Dec 5, 2025
Merged

Revert "[Data] - Iceberg support upsert tables + schema update + overwrite tables"#59185
alexeykudinkin merged 4 commits into
masterfrom
revert-58270-goutam/iceberg_upsert_overwrite_tables

Conversation

@goutamvenkat-anyscale

Copy link
Copy Markdown
Contributor

Reverts #58270

@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner December 4, 2025 22:37
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) December 4, 2025 22:38
@goutamvenkat-anyscale goutamvenkat-anyscale added the data Ray Data-related issues label Dec 4, 2025
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label Dec 4, 2025

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts the recent changes that added upsert, schema update, and overwrite support for Iceberg tables. The revert appears to be clean and complete, correctly rolling back the code changes, tests, and dependency versions.

I've identified a couple of minor issues in the reverted code that could be addressed to improve code quality. One is a potential side effect due to modifying a user-provided dictionary in-place, and the other is a suggestion to use keyword arguments for better readability and robustness. Please see my detailed comments.

f"overwrite_kwargs can only be specified when mode is SaveMode.OVERWRITE, "
f"but mode is {self._mode}"
)
self._catalog_kwargs = catalog_kwargs if catalog_kwargs is not None else {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The catalog_kwargs dictionary is assigned directly and later modified in-place with .pop() on line 60. This can lead to unexpected side effects for the caller if they reuse the dictionary. It's safer to work with a copy of the dictionary to avoid modifying the original object passed by the user.

Suggested change
self._catalog_kwargs = catalog_kwargs if catalog_kwargs is not None else {}
self._catalog_kwargs = (catalog_kwargs or {}).copy()
Comment on lines 4082 to 4084
datasink = IcebergDatasink(
table_identifier=table_identifier,
catalog_kwargs=catalog_kwargs,
snapshot_properties=snapshot_properties,
mode=mode,
overwrite_filter=overwrite_filter,
upsert_kwargs=upsert_kwargs,
overwrite_kwargs=overwrite_kwargs,
table_identifier, catalog_kwargs, snapshot_properties
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and robustness against future changes in the IcebergDatasink constructor's signature, it's recommended to use keyword arguments when instantiating the IcebergDatasink.

        datasink = IcebergDatasink(
            table_identifier=table_identifier,
            catalog_kwargs=catalog_kwargs,
            snapshot_properties=snapshot_properties,
        )
@github-actions github-actions Bot disabled auto-merge December 4, 2025 23:55
self._catalog_kwargs = catalog_kwargs if catalog_kwargs is not None else {}
self._snapshot_properties = (
snapshot_properties if snapshot_properties is not None else {}
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing dictionary copy causes caller's dict mutation

The reverted code removes the .copy() call when assigning catalog_kwargs. The pre-revert code used (catalog_kwargs or {}).copy() but the reverted code directly assigns the reference with catalog_kwargs if catalog_kwargs is not None else {}. Since line 60 calls self._catalog_kwargs.pop("name"), this mutates the caller's original dictionary, causing unexpected side effects where the user's catalog_kwargs dictionary loses its name key after calling write_iceberg().

Fix in Cursor Fix in Web

from pyiceberg.table import DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE
from pyiceberg.utils.config import Config

data_files_list: WriteResult[List["DataFile"]] = []

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incorrect return type annotation in write method

The write method's return type annotation -> WriteResult[List["DataFile"]] is incorrect. According to the Datasink base class, the write method should return WriteReturnType (which is List["DataFile"] for this class), not WriteResult[WriteReturnType]. The WriteResult wrapper is created by the framework and passed to on_write_complete. Additionally, line 130 declares data_files_list with type WriteResult[List["DataFile"]] but assigns it an empty list []. The runtime behavior is correct (returning a list), but the type annotations are misleading.

Fix in Cursor Fix in Web

@alexeykudinkin alexeykudinkin merged commit afa578c into master Dec 5, 2025
6 checks passed
@alexeykudinkin alexeykudinkin deleted the revert-58270-goutam/iceberg_upsert_overwrite_tables branch December 5, 2025 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants