Skip to content

[Data] Remove tfx-bsl support from read_tfrecords#63245

Merged
goutamvenkat-anyscale merged 1 commit into
ray-project:masterfrom
goutamvenkat-anyscale:data/remove-tfx-bsl
May 11, 2026
Merged

[Data] Remove tfx-bsl support from read_tfrecords#63245
goutamvenkat-anyscale merged 1 commit into
ray-project:masterfrom
goutamvenkat-anyscale:data/remove-tfx-bsl

Conversation

@goutamvenkat-anyscale

@goutamvenkat-anyscale goutamvenkat-anyscale commented May 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Ray Data now pins pyarrow>=17, but tfx-bsl caps pyarrow<11, so the optional tfx-bsl-backed reader for read_tfrecords can no longer install alongside Ray. Remove the tfx-bsl path entirely; TFRecord files continue to read via the default pure-pyarrow + tensorflow path.

Changes

  • Drop TFXReadOptions and the tfx_read_options kwarg on ray.data.read_tfrecords; drop the public re-export.
  • Remove _tfx_read_stream, _resolve_full_path, and the schema-inference helpers used only by the tfx-bsl path.
  • Delete the dedicated CI: datatfxbslbuild docker image, data_tfxbsl_depset / relaxed_data_tfxbsl_ci_depset depset entries and lockfiles, the "TFRecords (tfx-bsl) tests" Buildkite step, python/requirements/data/tfxbsl.txt, and the tfxbsl test tag (so test_tfrecords now runs in the standard arrow v17/v23 data jobs).
  • Drop the tfx_bsl.* pyrefly ignore, the TFXReadOptions doc entry, and the corresponding pydoclint baseline line.

read_tfrecords and TFXReadOptions were both @PublicAPI(stability=\"alpha\"), so per Ray's API stability policy breaking changes are permitted without a deprecation cycle.

Ray Data now requires pyarrow>=17, but tfx-bsl pins pyarrow<11, so the two
cannot coexist. Drop the tfx-bsl read path along with TFXReadOptions and the
dedicated CI infra; TFRecord files continue to be read via the default
pure-pyarrow + tensorflow path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested review from a team as code owners May 9, 2026 00:36

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the tfx-bsl dependency and all associated logic for reading TFRecord files, including the TFXReadOptions class and specialized schema inference. The read_tfrecords API is simplified to rely solely on the standard TensorFlow-based implementation. Review feedback identifies that the new function signature for read_tfrecords and the instantiation of TFRecordDatasource are missing **kwargs, which prevents users from passing file-based configuration options such as compression. Furthermore, it is suggested to retain the GZIP test cases to verify that compression support continues to work correctly in the remaining implementation.

Comment on lines 2209 to 2212
file_extensions: Optional[List[str]] = None,
concurrency: Optional[int] = None,
override_num_blocks: Optional[int] = None,
tfx_read_options: Optional["TFXReadOptions"] = None,
) -> Dataset:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The read_tfrecords function signature is missing **kwargs. This prevents users from passing additional file-based datasource options like arrow_open_stream_args (used for compression) to the underlying TFRecordDatasource. Since the tests in test_tfrecords.py attempt to use this argument, it should be added to the signature to maintain consistency with other read_* APIs and support file-based options.

    file_extensions: Optional[List[str]] = None,
    concurrency: Optional[int] = None,
    override_num_blocks: Optional[int] = None,
    **kwargs,
) -> Dataset:
Comment on lines 2297 to 2300
shuffle=shuffle,
include_paths=include_paths,
file_extensions=file_extensions,
tfx_read_options=tfx_read_options,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The TFRecordDatasource constructor should receive the **kwargs from read_tfrecords to ensure that file-based options (like compression settings) are correctly passed to the datasource.

        shuffle=shuffle,
        include_paths=include_paths,
        file_extensions=file_extensions,
        **kwargs,
    )
Comment on lines +385 to 389
"with_tf_schema,compression",
[
(True, True, None),
(True, True, "GZIP"),
(True, False, None),
(False, True, None),
(False, True, "GZIP"),
(False, False, None),
(True, None),
(False, None),
],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The parametrization for test_read_tfrecords has been reduced to only test with compression=None. Since the default path (pure-pyarrow + tensorflow) supports reading compressed TFRecord files via arrow_open_stream_args, we should retain the GZIP test cases to ensure continued support and avoid regressions in the remaining implementation.

    "with_tf_schema,compression",
    [
        (True, None),
        (True, "GZIP"),
        (False, None),
        (False, "GZIP"),
    ],
@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels May 9, 2026
@aslonnie aslonnie requested a review from elliot-barn May 11, 2026 15:37
@goutamvenkat-anyscale goutamvenkat-anyscale enabled auto-merge (squash) May 11, 2026 16:09
@goutamvenkat-anyscale goutamvenkat-anyscale merged commit 09ab4fe into ray-project:master May 11, 2026
8 of 9 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the data/remove-tfx-bsl branch May 11, 2026 16:33
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
## Summary

Ray Data now pins `pyarrow>=17`, but `tfx-bsl` caps `pyarrow<11`, so the
optional tfx-bsl-backed reader for `read_tfrecords` can no longer
install alongside Ray. Remove the tfx-bsl path entirely; TFRecord files
continue to read via the default pure-pyarrow + tensorflow path.

## Changes

- Drop `TFXReadOptions` and the `tfx_read_options` kwarg on
`ray.data.read_tfrecords`; drop the public re-export.
- Remove `_tfx_read_stream`, `_resolve_full_path`, and the
schema-inference helpers used only by the tfx-bsl path.
- Delete the dedicated CI: `datatfxbslbuild` docker image,
`data_tfxbsl_depset` / `relaxed_data_tfxbsl_ci_depset` depset entries
and lockfiles, the \"TFRecords (tfx-bsl) tests\" Buildkite step,
`python/requirements/data/tfxbsl.txt`, and the `tfxbsl` test tag (so
`test_tfrecords` now runs in the standard arrow v17/v23 data jobs).
- Drop the `tfx_bsl.*` pyrefly ignore, the `TFXReadOptions` doc entry,
and the corresponding pydoclint baseline line.

`read_tfrecords` and `TFXReadOptions` were both
`@PublicAPI(stability=\"alpha\")`, so per Ray's [API stability
policy](https://docs.ray.io/en/latest/ray-contribute/stability.html)
breaking changes are permitted without a deprecation cycle.

Signed-off-by: Goutam <goutam@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

3 participants