[Data] Remove tfx-bsl support from read_tfrecords#63245
Conversation
Ray Data now requires pyarrow>=17, but tfx-bsl pins pyarrow<11, so the two cannot coexist. Drop the tfx-bsl read path along with TFXReadOptions and the dedicated CI infra; TFRecord files continue to be read via the default pure-pyarrow + tensorflow path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Goutam <goutam@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request removes the tfx-bsl dependency and all associated logic for reading TFRecord files, including the TFXReadOptions class and specialized schema inference. The read_tfrecords API is simplified to rely solely on the standard TensorFlow-based implementation. Review feedback identifies that the new function signature for read_tfrecords and the instantiation of TFRecordDatasource are missing **kwargs, which prevents users from passing file-based configuration options such as compression. Furthermore, it is suggested to retain the GZIP test cases to verify that compression support continues to work correctly in the remaining implementation.
| file_extensions: Optional[List[str]] = None, | ||
| concurrency: Optional[int] = None, | ||
| override_num_blocks: Optional[int] = None, | ||
| tfx_read_options: Optional["TFXReadOptions"] = None, | ||
| ) -> Dataset: |
There was a problem hiding this comment.
The read_tfrecords function signature is missing **kwargs. This prevents users from passing additional file-based datasource options like arrow_open_stream_args (used for compression) to the underlying TFRecordDatasource. Since the tests in test_tfrecords.py attempt to use this argument, it should be added to the signature to maintain consistency with other read_* APIs and support file-based options.
file_extensions: Optional[List[str]] = None,
concurrency: Optional[int] = None,
override_num_blocks: Optional[int] = None,
**kwargs,
) -> Dataset:| shuffle=shuffle, | ||
| include_paths=include_paths, | ||
| file_extensions=file_extensions, | ||
| tfx_read_options=tfx_read_options, | ||
| ) |
There was a problem hiding this comment.
| "with_tf_schema,compression", | ||
| [ | ||
| (True, True, None), | ||
| (True, True, "GZIP"), | ||
| (True, False, None), | ||
| (False, True, None), | ||
| (False, True, "GZIP"), | ||
| (False, False, None), | ||
| (True, None), | ||
| (False, None), | ||
| ], |
There was a problem hiding this comment.
The parametrization for test_read_tfrecords has been reduced to only test with compression=None. Since the default path (pure-pyarrow + tensorflow) supports reading compressed TFRecord files via arrow_open_stream_args, we should retain the GZIP test cases to ensure continued support and avoid regressions in the remaining implementation.
"with_tf_schema,compression",
[
(True, None),
(True, "GZIP"),
(False, None),
(False, "GZIP"),
],09ab4fe
into
ray-project:master
## Summary Ray Data now pins `pyarrow>=17`, but `tfx-bsl` caps `pyarrow<11`, so the optional tfx-bsl-backed reader for `read_tfrecords` can no longer install alongside Ray. Remove the tfx-bsl path entirely; TFRecord files continue to read via the default pure-pyarrow + tensorflow path. ## Changes - Drop `TFXReadOptions` and the `tfx_read_options` kwarg on `ray.data.read_tfrecords`; drop the public re-export. - Remove `_tfx_read_stream`, `_resolve_full_path`, and the schema-inference helpers used only by the tfx-bsl path. - Delete the dedicated CI: `datatfxbslbuild` docker image, `data_tfxbsl_depset` / `relaxed_data_tfxbsl_ci_depset` depset entries and lockfiles, the \"TFRecords (tfx-bsl) tests\" Buildkite step, `python/requirements/data/tfxbsl.txt`, and the `tfxbsl` test tag (so `test_tfrecords` now runs in the standard arrow v17/v23 data jobs). - Drop the `tfx_bsl.*` pyrefly ignore, the `TFXReadOptions` doc entry, and the corresponding pydoclint baseline line. `read_tfrecords` and `TFXReadOptions` were both `@PublicAPI(stability=\"alpha\")`, so per Ray's [API stability policy](https://docs.ray.io/en/latest/ray-contribute/stability.html) breaking changes are permitted without a deprecation cycle. Signed-off-by: Goutam <goutam@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Ray Data now pins
pyarrow>=17, buttfx-bslcapspyarrow<11, so the optional tfx-bsl-backed reader forread_tfrecordscan no longer install alongside Ray. Remove the tfx-bsl path entirely; TFRecord files continue to read via the default pure-pyarrow + tensorflow path.Changes
TFXReadOptionsand thetfx_read_optionskwarg onray.data.read_tfrecords; drop the public re-export._tfx_read_stream,_resolve_full_path, and the schema-inference helpers used only by the tfx-bsl path.datatfxbslbuilddocker image,data_tfxbsl_depset/relaxed_data_tfxbsl_ci_depsetdepset entries and lockfiles, the "TFRecords (tfx-bsl) tests" Buildkite step,python/requirements/data/tfxbsl.txt, and thetfxbsltest tag (sotest_tfrecordsnow runs in the standard arrow v17/v23 data jobs).tfx_bsl.*pyrefly ignore, theTFXReadOptionsdoc entry, and the corresponding pydoclint baseline line.read_tfrecordsandTFXReadOptionswere both@PublicAPI(stability=\"alpha\"), so per Ray's API stability policy breaking changes are permitted without a deprecation cycle.