[Data] Remove tfx-bsl support from read_tfrecords by goutamvenkat-anyscale · Pull Request #63245 · ray-project/ray

goutamvenkat-anyscale · 2026-05-09T00:36:27Z

Summary

Ray Data now pins pyarrow>=17, but tfx-bsl caps pyarrow<11, so the optional tfx-bsl-backed reader for read_tfrecords can no longer install alongside Ray. Remove the tfx-bsl path entirely; TFRecord files continue to read via the default pure-pyarrow + tensorflow path.

Changes

Drop TFXReadOptions and the tfx_read_options kwarg on ray.data.read_tfrecords; drop the public re-export.
Remove _tfx_read_stream, _resolve_full_path, and the schema-inference helpers used only by the tfx-bsl path.
Delete the dedicated CI: datatfxbslbuild docker image, data_tfxbsl_depset / relaxed_data_tfxbsl_ci_depset depset entries and lockfiles, the "TFRecords (tfx-bsl) tests" Buildkite step, python/requirements/data/tfxbsl.txt, and the tfxbsl test tag (so test_tfrecords now runs in the standard arrow v17/v23 data jobs).
Drop the tfx_bsl.* pyrefly ignore, the TFXReadOptions doc entry, and the corresponding pydoclint baseline line.

read_tfrecords and TFXReadOptions were both @PublicAPI(stability=\"alpha\"), so per Ray's API stability policy breaking changes are permitted without a deprecation cycle.

Ray Data now requires pyarrow>=17, but tfx-bsl pins pyarrow<11, so the two cannot coexist. Drop the tfx-bsl read path along with TFXReadOptions and the dedicated CI infra; TFRecord files continue to be read via the default pure-pyarrow + tensorflow path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Goutam <goutam@anyscale.com>

gemini-code-assist

Code Review

This pull request removes the tfx-bsl dependency and all associated logic for reading TFRecord files, including the TFXReadOptions class and specialized schema inference. The read_tfrecords API is simplified to rely solely on the standard TensorFlow-based implementation. Review feedback identifies that the new function signature for read_tfrecords and the instantiation of TFRecordDatasource are missing **kwargs, which prevents users from passing file-based configuration options such as compression. Furthermore, it is suggested to retain the GZIP test cases to verify that compression support continues to work correctly in the remaining implementation.

gemini-code-assist · 2026-05-09T00:39:04Z

    file_extensions: Optional[List[str]] = None,
    concurrency: Optional[int] = None,
    override_num_blocks: Optional[int] = None,
-    tfx_read_options: Optional["TFXReadOptions"] = None,
 ) -> Dataset:


The read_tfrecords function signature is missing **kwargs. This prevents users from passing additional file-based datasource options like arrow_open_stream_args (used for compression) to the underlying TFRecordDatasource. Since the tests in test_tfrecords.py attempt to use this argument, it should be added to the signature to maintain consistency with other read_* APIs and support file-based options.

file_extensions: Optional[List[str]] = None, concurrency: Optional[int] = None, override_num_blocks: Optional[int] = None, **kwargs, ) -> Dataset:

gemini-code-assist · 2026-05-09T00:39:04Z

        shuffle=shuffle,
        include_paths=include_paths,
        file_extensions=file_extensions,
-        tfx_read_options=tfx_read_options,
    )


The TFRecordDatasource constructor should receive the **kwargs from read_tfrecords to ensure that file-based options (like compression settings) are correctly passed to the datasource.

shuffle=shuffle, include_paths=include_paths, file_extensions=file_extensions, **kwargs, )

gemini-code-assist · 2026-05-09T00:39:04Z

+    "with_tf_schema,compression",
    [
-        (True, True, None),
-        (True, True, "GZIP"),
-        (True, False, None),
-        (False, True, None),
-        (False, True, "GZIP"),
-        (False, False, None),
+        (True, None),
+        (False, None),
    ],


The parametrization for test_read_tfrecords has been reduced to only test with compression=None. Since the default path (pure-pyarrow + tensorflow) supports reading compressed TFRecord files via arrow_open_stream_args, we should retain the GZIP test cases to ensure continued support and avoid regressions in the remaining implementation.

"with_tf_schema,compression", [ (True, None), (True, "GZIP"), (False, None), (False, "GZIP"), ],

## Summary Ray Data now pins `pyarrow>=17`, but `tfx-bsl` caps `pyarrow<11`, so the optional tfx-bsl-backed reader for `read_tfrecords` can no longer install alongside Ray. Remove the tfx-bsl path entirely; TFRecord files continue to read via the default pure-pyarrow + tensorflow path. ## Changes - Drop `TFXReadOptions` and the `tfx_read_options` kwarg on `ray.data.read_tfrecords`; drop the public re-export. - Remove `_tfx_read_stream`, `_resolve_full_path`, and the schema-inference helpers used only by the tfx-bsl path. - Delete the dedicated CI: `datatfxbslbuild` docker image, `data_tfxbsl_depset` / `relaxed_data_tfxbsl_ci_depset` depset entries and lockfiles, the \"TFRecords (tfx-bsl) tests\" Buildkite step, `python/requirements/data/tfxbsl.txt`, and the `tfxbsl` test tag (so `test_tfrecords` now runs in the standard arrow v17/v23 data jobs). - Drop the `tfx_bsl.*` pyrefly ignore, the `TFXReadOptions` doc entry, and the corresponding pydoclint baseline line. `read_tfrecords` and `TFXReadOptions` were both `@PublicAPI(stability=\"alpha\")`, so per Ray's [API stability policy](https://docs.ray.io/en/latest/ray-contribute/stability.html) breaking changes are permitted without a deprecation cycle. Signed-off-by: Goutam <goutam@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

goutamvenkat-anyscale requested review from a team as code owners May 9, 2026 00:36

gemini-code-assist Bot reviewed May 9, 2026

View reviewed changes

goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels May 9, 2026

richardliaw approved these changes May 9, 2026

View reviewed changes

aslonnie approved these changes May 11, 2026

View reviewed changes

aslonnie requested a review from elliot-barn May 11, 2026 15:37

goutamvenkat-anyscale enabled auto-merge (squash) May 11, 2026 16:09

goutamvenkat-anyscale merged commit 09ab4fe into ray-project:master May 11, 2026
8 of 9 checks passed

goutamvenkat-anyscale deleted the data/remove-tfx-bsl branch May 11, 2026 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Remove tfx-bsl support from read_tfrecords#63245

[Data] Remove tfx-bsl support from read_tfrecords#63245
goutamvenkat-anyscale merged 1 commit into
ray-project:masterfrom
goutamvenkat-anyscale:data/remove-tfx-bsl

goutamvenkat-anyscale commented May 9, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

gemini-code-assist Bot May 9, 2026

gemini-code-assist Bot May 9, 2026

gemini-code-assist Bot May 9, 2026

Uh oh!

Labels

3 participants

Uh oh!

Conversation

goutamvenkat-anyscale commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

goutamvenkat-anyscale commented May 9, 2026 •

edited

Loading