Skip to content

[Data] - Iceberg support upsert tables + schema update + overwrite tables#58270

Merged
raulchen merged 13 commits into
ray-project:masterfrom
goutamvenkat-anyscale:goutam/iceberg_upsert_overwrite_tables
Nov 13, 2025
Merged

[Data] - Iceberg support upsert tables + schema update + overwrite tables#58270
raulchen merged 13 commits into
ray-project:masterfrom
goutamvenkat-anyscale:goutam/iceberg_upsert_overwrite_tables

Conversation

@goutamvenkat-anyscale

@goutamvenkat-anyscale goutamvenkat-anyscale commented Oct 29, 2025

Copy link
Copy Markdown
Contributor

Description

  • Support upserting iceberg tables for IcebergDatasink
  • Update schema on APPEND and UPSERT
  • Enable overwriting the entire table

Upgrades to pyicberg 0.10.0 because it now supports upsert and overwrite functionality. Also for append, the library now handles the transaction logic implicitly so that burden can be lifted from Ray Data.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

…bles

Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner October 29, 2025 01:38

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant enhancements to the Iceberg datasink, adding support for UPSERT and OVERWRITE save modes, along with automatic schema evolution. The implementation is well-structured, cleanly separating the logic for different write modes and leveraging PyIceberg's high-level APIs where appropriate. The accompanying documentation updates and comprehensive test suite are excellent, clearly explaining the new features and ensuring their correctness. Overall, this is a high-quality contribution that greatly improves the functionality of the Iceberg integration. I've left a couple of minor suggestions for code refinement.

Comment thread python/ray/data/_internal/datasource/iceberg_datasink.py Outdated
Comment thread python/ray/data/_internal/datasource/iceberg_datasink.py Outdated
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale

Copy link
Copy Markdown
Contributor Author

/gemini review

Comment thread python/ray/data/_internal/datasource/iceberg_datasink.py Outdated

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds significant new functionality to the Iceberg datasink, including support for UPSERT and OVERWRITE modes, as well as automatic schema evolution. The code is well-structured, with good separation of concerns into different helper methods. The addition of to_iceberg() on expressions to support filter pushdown for overwrites is a great feature. The tests are comprehensive and cover the new modes and schema evolution scenarios well.

I have a few comments, including a high-severity typing issue in IcebergDatasink that violates the Datasink generic contract, a point about reliance on a private pyiceberg API, a minor clarification needed in the write_iceberg docstring, and a misleading test name that should be corrected for clarity. Overall, this is a great contribution that significantly enhances Ray Data's Iceberg integration.

Comment thread python/ray/data/_internal/datasource/iceberg_datasink.py Outdated
Comment thread python/ray/data/dataset.py Outdated
Comment thread python/ray/data/tests/test_iceberg.py Outdated
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Oct 29, 2025
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Comment thread python/ray/data/_internal/datasource/iceberg_datasink.py Outdated
@goutamvenkat-anyscale

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant enhancements to the Iceberg datasink, adding support for UPSERT and OVERWRITE save modes, as well as automatic schema evolution. The implementation is well-structured, separating logic for different modes into distinct methods, which improves readability and maintainability. The accompanying tests are comprehensive and cover the new functionality well.

My review includes a few suggestions to improve documentation accuracy and code robustness by avoiding the use of private members from the pyiceberg library. These changes will make the code more maintainable and easier for users to understand.

Comment thread python/ray/data/_internal/datasource/iceberg_datasink.py Outdated
Comment thread python/ray/data/_internal/datasource/iceberg_datasink.py Outdated
Comment thread python/ray/data/dataset.py
Comment thread python/ray/data/dataset.py Outdated
Comment thread python/ray/data/tests/test_iceberg.py Outdated
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Comment thread python/ray/data/_internal/datasource/iceberg_datasink.py Outdated
Signed-off-by: Goutam <goutam@anyscale.com>
Comment thread python/ray/data/_internal/datasource/iceberg_datasink.py
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Nov 11, 2025
Signed-off-by: Goutam <goutam@anyscale.com>
Comment thread python/ray/data/_internal/datasource/iceberg_datasink.py
Signed-off-by: Goutam <goutam@anyscale.com>
Comment thread python/ray/data/_internal/savemode.py Outdated
ERROR: Raise an error if data already exists.
UPSERT: Update existing rows that match on key fields, or insert new rows.
Requires identifier/key fields to be specified.
"""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, use triple-quote-strings for each item.
This would work better with IDEs - you can hover an item to show its doc.

self._mode = mode
self._overwrite_filter = overwrite_filter
self._upsert_kwargs = (upsert_kwargs or {}).copy()
self._overwrite_kwargs = (overwrite_kwargs or {}).copy()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, check the kwargs are only set when mode is relevant.

@raulchen raulchen merged commit 2f55d07 into ray-project:master Nov 13, 2025
6 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/iceberg_upsert_overwrite_tables branch November 13, 2025 00:38
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…bles (ray-project#58270)

## Description
- Support upserting iceberg tables for IcebergDatasink
- Update schema on APPEND and UPSERT
- Enable overwriting the entire table

Upgrades to pyicberg 0.10.0 because it now supports upsert and overwrite
functionality. Also for append, the library now handles the transaction
logic implicitly so that burden can be lifted from Ray Data.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…bles (ray-project#58270)

## Description
- Support upserting iceberg tables for IcebergDatasink
- Update schema on APPEND and UPSERT
- Enable overwriting the entire table

Upgrades to pyicberg 0.10.0 because it now supports upsert and overwrite
functionality. Also for append, the library now handles the transaction
logic implicitly so that burden can be lifted from Ray Data.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…bles (ray-project#58270)

## Description
- Support upserting iceberg tables for IcebergDatasink
- Update schema on APPEND and UPSERT
- Enable overwriting the entire table

Upgrades to pyicberg 0.10.0 because it now supports upsert and overwrite
functionality. Also for append, the library now handles the transaction
logic implicitly so that burden can be lifted from Ray Data.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants