Skip to content

[Data] - [1/n] Add Temporal, list, tensor, struct datatype support to RD Datatype#58225

Merged
raulchen merged 1 commit into
ray-project:masterfrom
goutamvenkat-anyscale:goutam/1_n_datatype
Nov 13, 2025
Merged

[Data] - [1/n] Add Temporal, list, tensor, struct datatype support to RD Datatype#58225
raulchen merged 1 commit into
ray-project:masterfrom
goutamvenkat-anyscale:goutam/1_n_datatype

Conversation

@goutamvenkat-anyscale

Copy link
Copy Markdown
Contributor

Description

As title suggests

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner October 27, 2025 20:59

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds significant new functionality to DataType by introducing support for list, struct, map, tensor, and temporal types, along with a pattern-matching system using DataType.ANY. The implementation is well-structured and includes a comprehensive set of factory methods and type predicates. The accompanying tests cover most of the new functionality thoroughly.

My review focuses on a few areas for improvement:

  1. Doctest Corrections: The __repr__ output for pattern-matching DataTypes in several new factory method docstrings is incorrect and will cause doctests to fail.
  2. Implementation Consistency: The is_numerical_type method has an inconsistency between its documentation and implementation regarding boolean types for Arrow and NumPy.
  3. Test Coverage: The tests for is_numerical_type should be expanded to cover the boolean cases for Arrow and NumPy types.

Overall, this is a great addition. Addressing these points will improve the correctness and robustness of the new features.

Comment thread python/ray/data/datatype.py Outdated
Comment thread python/ray/data/datatype.py Outdated
Comment thread python/ray/data/datatype.py Outdated
Comment thread python/ray/data/datatype.py Outdated
Comment thread python/ray/data/datatype.py Outdated
Comment thread python/ray/data/datatype.py Outdated
Comment thread python/ray/data/datatype.py Outdated
Comment thread python/ray/data/datatype.py
Comment thread python/ray/data/tests/test_datatype.py
@goutamvenkat-anyscale goutamvenkat-anyscale force-pushed the goutam/1_n_datatype branch 2 times, most recently from 6841413 to e101fec Compare October 27, 2025 23:03
cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Oct 28, 2025
@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Oct 28, 2025

@iamjustinhsu iamjustinhsu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a first pass, didn't look at tests, but good job 👍

Comment thread python/ray/data/datatype.py Outdated
Comment thread python/ray/data/datatype.py
Comment thread python/ray/data/datatype.py Outdated
Comment thread python/ray/data/datatype.py Outdated
>>> DataType.fixed_size_list(DataType.float32(), 3)
DataType(arrow:fixed_size_list<item: float>[3])
"""
value_arrow_type = value_type.to_arrow_dtype()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come this doesn't include the checks for value_type is None or isinstance(value_type, _LogicalDataType)? Also, can you elaborate on the meaning of this check and why it's necessary? Since I see it a lot, I suggest pushing it into a method

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So fixed_size_list requires both element type and list_size. For instance, I wanted to prevented cases like this: DataType.fixed_size_list(DataType.ANY, 3)

Comment thread python/ray/data/datatype.py Outdated
Comment thread python/ray/data/datatype.py
Comment thread python/ray/data/datatype.py
Comment thread python/ray/data/datatype.py
Comment thread python/ray/data/datatype.py
Comment thread python/ray/data/datatype.py
@goutamvenkat-anyscale goutamvenkat-anyscale force-pushed the goutam/1_n_datatype branch 2 times, most recently from 67674c2 to 296efc2 Compare November 10, 2025 21:09
Comment thread python/ray/data/datatype.py
… RD Datatype

Signed-off-by: Goutam <goutam@anyscale.com>
assert isinstance(result, DataType)
assert result.is_arrow_type()
assert result._internal_type == pa_type
assert result._physical_dtype == pa_type

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good coverage of unit test.
Can you also add some e2e tests that use these types from the user-facing API?

@goutamvenkat-anyscale goutamvenkat-anyscale Nov 13, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we don't actually rely on the datatype for anything yet. We use it UDFExpr but there's no validation or schema updates based off that. The e2e tests will make sense once we plumb in schema inference & also use it for the new dataset.summary API.

@raulchen raulchen merged commit c3ba35e into ray-project:master Nov 13, 2025
6 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/1_n_datatype branch November 13, 2025 21:04
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
… RD Datatype (ray-project#58225)

## Description
As title suggests

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
… RD Datatype (ray-project#58225)

## Description
As title suggests

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
… RD Datatype (ray-project#58225)

## Description
As title suggests

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

3 participants