Skip to content

[Data] Make All Preprocessors Implement SerializablePreprocessorBase#61341

Merged
bveeramani merged 55 commits into
ray-project:masterfrom
rayhhome:make-prepro-serializable
Mar 4, 2026
Merged

[Data] Make All Preprocessors Implement SerializablePreprocessorBase#61341
bveeramani merged 55 commits into
ray-project:masterfrom
rayhhome:make-prepro-serializable

Conversation

@rayhhome

Copy link
Copy Markdown
Contributor

Description

The SerializablePreprocessorBase abstract class declares functions for saving and loading preprocessor states and should be implemented by all preprocessors. This PR implements the abstract methods for all preprocessors that are not yet inheriting from this base class.

Related issues

Related to #61028 , which implemented backwards compatibility for legacy pickling layer. The __setstate__ functions should be removed along with the deprecated Predictor, while _get_serializable_fields and _set_serializable_fields should be used instead for saving and loading preprocessor states in future iterations.

Additional information

Accompanied by new field serializing tests of all preprocessors involved.

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
… plan

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…reprocessor field

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…ty + remove preprocessor setter

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Copilot AI review requested due to automatic review settings February 26, 2026 03:06
@rayhhome rayhhome requested a review from a team as a code owner February 26, 2026 03:06
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements serialization support for preprocessors that were not yet inheriting from SerializablePreprocessorBase. The changes add the required abstract methods _get_serializable_fields() and _set_serializable_fields() to enable CloudPickle-based serialization for all preprocessors.

Changes:

  • Migrated 9 preprocessor classes to inherit from SerializablePreprocessorBase and implement serialization methods
  • Added comprehensive serialization tests for all migrated preprocessors
  • Updated imports and added @SerializablePreprocessor decorator with version and identifier metadata

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
python/ray/data/preprocessors/vectorizer.py Added serialization support for HashingVectorizer and CountVectorizer
python/ray/data/preprocessors/transformer.py Added serialization support for PowerTransformer
python/ray/data/preprocessors/torch.py Added serialization support for TorchVisionPreprocessor
python/ray/data/preprocessors/tokenizer.py Added serialization support for Tokenizer
python/ray/data/preprocessors/normalizer.py Added serialization support for Normalizer
python/ray/data/preprocessors/hasher.py Added serialization support for FeatureHasher
python/ray/data/preprocessors/discretizer.py Added serialization support for CustomKBinsDiscretizer and UniformKBinsDiscretizer
python/ray/data/preprocessors/concatenator.py Added serialization support for Concatenator
python/ray/data/preprocessors/chain.py Added serialization support for Chain preprocessor
python/ray/data/tests/preprocessors/test_vectorizer.py Added serialization tests for vectorizers
python/ray/data/tests/preprocessors/test_transformer.py Added serialization test for PowerTransformer
python/ray/data/tests/preprocessors/test_torch.py Added serialization test for TorchVisionPreprocessor
python/ray/data/tests/preprocessors/test_tokenizer.py Added serialization test for Tokenizer
python/ray/data/tests/preprocessors/test_normalizer.py Added serialization test for Normalizer
python/ray/data/tests/preprocessors/test_hasher.py Added serialization test for FeatureHasher
python/ray/data/tests/preprocessors/test_discretizer.py Added serialization tests for discretizers
python/ray/data/tests/preprocessors/test_concatenator.py Added serialization test for Concatenator
python/ray/data/tests/preprocessors/test_chain.py Added serialization test for Chain

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/ray/data/preprocessors/chain.py Outdated
Comment thread python/ray/data/preprocessors/hasher.py Outdated
Comment thread python/ray/data/preprocessors/vectorizer.py Outdated
Comment thread python/ray/data/preprocessors/discretizer.py Outdated
Comment thread python/ray/data/preprocessors/discretizer.py Outdated
Comment thread python/ray/data/preprocessors/tokenizer.py Outdated
Comment thread python/ray/data/preprocessors/vectorizer.py Outdated
Comment thread python/ray/data/preprocessors/hasher.py Outdated
Comment thread python/ray/data/preprocessors/discretizer.py Outdated
Comment thread python/ray/data/preprocessors/chain.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Comment thread python/ray/data/preprocessors/discretizer.py Outdated
Comment thread python/ray/data/tests/preprocessors/test_torch.py Outdated
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
@ray-gardener ray-gardener Bot added the community-contribution Contributed by the community label Feb 26, 2026
@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Feb 27, 2026
@rayhhome rayhhome self-assigned this Feb 28, 2026
Comment thread python/ray/data/preprocessors/tokenizer.py
@bveeramani bveeramani merged commit b9cbe1f into ray-project:master Mar 4, 2026
6 checks passed
ParagEkbote pushed a commit to ParagEkbote/ray that referenced this pull request Mar 10, 2026
…ay-project#61341)

## Description
The `SerializablePreprocessorBase` abstract class declares functions for
saving and loading preprocessor states and should be implemented by all
preprocessors. This PR implements the abstract methods for all
preprocessors that are not yet inheriting from this base class.

## Related issues
Related to ray-project#61028 , which implemented backwards compatibility for legacy
pickling layer. The `__setstate__` functions should be removed along
with the deprecated `Predictor`, while `_get_serializable_fields` and
`_set_serializable_fields` should be used instead for saving and loading
preprocessor states in future iterations.

## Additional information
Accompanied by new field serializing tests of all preprocessors
involved.

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
@rayhhome rayhhome deleted the make-prepro-serializable branch July 2, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

4 participants