[data][3/n] DataSourceV2: ParquetDatasourceV2 + read_parquet V2 dispatch by goutamvenkat-anyscale · Pull Request #63113 · ray-project/ray

goutamvenkat-anyscale · 2026-05-04T20:46:49Z

Wires the V2 Parquet read path end-to-end. Behind the
DataContext.use_datasource_v2 opt-in flag (default False);
read_parquet continues to take the V1 path until pr-Z flips the
default.

datasource_v2/datasource_v2.py: resolve_partitioning base
method (default None) so subclasses can derive partitioning
field names from a sample without mutating instance state.
datasource_v2/parquet_datasource_v2.py (new): ParquetDatasourceV2
with file indexer, scanner factory, schema inference (parallelized
footer reads + thread pool), resolve_partitioning override
populating field_names from path discovery, and user-supplied
schema overrides.
read_api.py: _read_datasource_v2 driver entry that wires the
ListFiles → ReadFiles op pair and threads through the file pruner
list. read_parquet dispatches to V2 when
use_datasource_v2=True. Raises NotImplementedError for
_block_udf, tensor_column_schema, dataset_kwargs,
and columns= (each gets its own follow-up — pr-C, pr-G).
context.py: use_datasource_v2: bool flag, default False.
BUILD.bazel: register the new V2 unit-test packages.
doc/source/data/performance-tips.rst: brief mention.
tests: parquet datasource unit tests, ListFiles-op fixture tests,
read_files logical-op tests, V2 read_parquet end-to-end tests.

Signed-off-by: Goutam goutam@anyscale.com
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

gemini-code-assist

Code Review

This pull request introduces the DataSourceV2 pipeline for Parquet files, establishing a logical chain from ListFiles to ReadFiles. It adds the ParquetDatasourceV2 class, integrates a configuration flag in DataContext to enable this new path, and provides extensive testing for schema inference and operator logic. Feedback is provided to optimize performance by eliminating unnecessary list conversions when indexing or iterating over file paths.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{Reviewed by Cursor Bugbot for commit 1909034. Configure here.}

cursor · 2026-05-04T23:33:29Z

+            ray_remote_args=ray_remote_args,
+            concurrency=concurrency,
+            partition_filter=partition_filter,
+        )


V2 path silently drops include_row_hash parameter

Medium Severity

The V2 dispatch block raises NotImplementedError for unsupported parameters (_block_udf, tensor_column_schema, dataset_kwargs, columns) but silently ignores include_row_hash. When a user passes include_row_hash=True with use_datasource_v2=True, no row_hash column is produced and no error is raised, leading to silent data loss. The V1 path correctly passes include_row_hash to ParquetDatasource.

^{Reviewed by Cursor Bugbot for commit 1909034. Configure here.}

Resolved this in a follow up PR

iamjustinhsu

some questions + comments, but overall nice. didn't look at tests

iamjustinhsu · 2026-05-05T21:20:51Z

+        File-based partitioned datasources override this to populate
+        path-discovered field names (e.g. hive ``Partitioning`` ships
+        with ``field_names=None`` and needs to read a sample path to
+        learn the keys). Keeping the discovery on a dedicated method


I'm not really sure I understand the "Keeping the discovery ..." last bit of the paragraph. Can you elaborate?

Basically after grabbing the FileManifests from the sample, this computes the Hive Partitioning scheme from that sample. Then this Partitioning is propagated to the scanner class.

I can make this clearer

iamjustinhsu · 2026-05-05T21:24:24Z

+        # Parquet footer reads against high-latency object stores
+        # (S3, GCS) are ~50-100 ms each. Reading the sample's footers in
+        # parallel keeps driver-side schema inference bounded by the
+        # slowest single read rather than the sum. Order is preserved


Why does order matter in schema unification?

Two reasons (neither is type promotion, my original comment was wrong, fixed in the latest push):

Field order anchors on the first schema. Verified on pyarrow 23.0.1:

s1 = pa.schema([('a', pa.int64()), ('b', pa.string())]) s2 = pa.schema([('c', pa.float64()), ('a', pa.int64())]) pa.unify_schemas([s1, s2]).names # ['a', 'b', 'c'] pa.unify_schemas([s2, s1]).names # ['c', 'a', 'b']

Without preserved input order, the unified schema's column order is non-deterministic.

sample_paths[0] drives partition discovery below, PathPartitionParser runs against the first path to extract hive keys, so the schema and path lists need to stay aligned.

Type promotion itself is order-independent under permissive mode (null + int → int regardless of position)

oh are u saying that the FileManifests underlying block schema is ordered?

iamjustinhsu · 2026-05-05T21:28:03Z

+            partition_pa_schema = _partition_field_types_to_pa_schema(
+                list(partition_kv.keys()), resolved_partitioning.field_types or {}
+            )
+            for field_name in partition_kv.keys():


no unify_schemas_with_validation?

It breaks test_read_file_with_partition_values for instance.

The loop only adds partition fields that are missing from the file schema, so on a name collision the file's type silently wins. unify_schemas_with_validation instead tries to merge both types, and PyArrow has no promotion path between unrelated primitives (e.g. int64 <-> string), so any collision where the file column's type doesn't match the partition's declared/default type raises ArrowTypeError and the read crashes.

wait should that be an error tho? like if the column's types don't match? Or are u saying that the inference of column types might be wrong? not sure what test_read_file_with_partition_values is supposed to test.

Yea the inference could be incorrect.

iamjustinhsu · 2026-05-05T21:30:54Z

+        min_bucket_size=min_bucket_size,
+        max_bucket_size=max_bucket_size,


The size here is referring to size of the file, not length of file name, right?

You mean the get_size_estimator()?

That's just the encoding ratio * uncompressed file sizes for parquet. It's not pertaining to the length of the file name.

iamjustinhsu · 2026-05-05T21:32:24Z

+
+    read_op = ReadFiles(
+        input_op=list_files_op,
+        datasource=datasource,


can you remind me why datasource needs to be passed in for ReadFiles? I thought we just needed scanner?

Oh lol it's actually just being used for retrieving the name of the datasource for the logical operator naming

I can change ReadFiles to just take in name instead.

iamjustinhsu · 2026-05-05T21:33:28Z

+            f"no files found under {datasource.paths!r}. Check the path and any "
+            "configured `partition_filter` or `file_extensions` filters."
+        )
+    schema = datasource.infer_schema(sample)


Hmm with 1000s of files, will this be slow? Should delay schema inference until we start streaming execution?

Discussed offline. The sample size is ~16 files. Delaying schema inference until execution won't be possible without rearchitecting the logical plan to have its own IR

Wires the V2 Parquet read path end-to-end. Behind the ``DataContext.use_datasource_v2`` opt-in flag (default ``False``); ``read_parquet`` continues to take the V1 path until pr-Z flips the default. - datasource_v2/datasource_v2.py: ``resolve_partitioning`` base method (default ``None``) so subclasses can derive partitioning field names from a sample without mutating instance state. - datasource_v2/parquet_datasource_v2.py (new): ParquetDatasourceV2 with file indexer, scanner factory, schema inference (parallelized footer reads + thread pool), ``resolve_partitioning`` override populating field_names from path discovery, and user-supplied ``schema`` overrides. - read_api.py: ``_read_datasource_v2`` driver entry that wires the ListFiles → ReadFiles op pair and threads through the file pruner list. ``read_parquet`` dispatches to V2 when ``use_datasource_v2=True``. Raises ``NotImplementedError`` for ``_block_udf``, ``tensor_column_schema``, ``dataset_kwargs``, and ``columns=`` (each gets its own follow-up — pr-C, pr-G). - context.py: ``use_datasource_v2: bool`` flag, default ``False``. - BUILD.bazel: register the new V2 unit-test packages. - doc/source/data/performance-tips.rst: brief mention. - tests: parquet datasource unit tests, ListFiles-op fixture tests, read_files logical-op tests, V2 read_parquet end-to-end tests. Signed-off-by: Goutam <goutam@anyscale.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

iamjustinhsu · 2026-05-06T19:26:36Z

        """
        ...
+
+    def resolve_partitioning(self, sample: InputSplit) -> Optional[Any]:


Can the type annotation be Optional[Partitioning]

Partitioning seems to be catering only to file based datasources. So I can't use that across the board

iamjustinhsu · 2026-05-06T19:31:02Z

+            partition_pa_schema = _partition_field_types_to_pa_schema(
+                list(partition_kv.keys()), resolved_partitioning.field_types or {}
+            )
+            for field_name in partition_kv.keys():


wait should that be an error tho? like if the column's types don't match? Or are u saying that the inference of column types might be wrong? not sure what test_read_file_with_partition_values is supposed to test.

iamjustinhsu · 2026-05-06T19:33:42Z

+        # Parquet footer reads against high-latency object stores
+        # (S3, GCS) are ~50-100 ms each. Reading the sample's footers in
+        # parallel keeps driver-side schema inference bounded by the
+        # slowest single read rather than the sum. Order is preserved


oh are u saying that the FileManifests underlying block schema is ordered?

…tch (ray-project#63113) Wires the V2 Parquet read path end-to-end. Behind the ``DataContext.use_datasource_v2`` opt-in flag (default ``False``); ``read_parquet`` continues to take the V1 path until pr-Z flips the default. - datasource_v2/datasource_v2.py: ``resolve_partitioning`` base method (default ``None``) so subclasses can derive partitioning field names from a sample without mutating instance state. - datasource_v2/parquet_datasource_v2.py (new): ParquetDatasourceV2 with file indexer, scanner factory, schema inference (parallelized footer reads + thread pool), ``resolve_partitioning`` override populating field_names from path discovery, and user-supplied ``schema`` overrides. - read_api.py: ``_read_datasource_v2`` driver entry that wires the ListFiles → ReadFiles op pair and threads through the file pruner list. ``read_parquet`` dispatches to V2 when ``use_datasource_v2=True``. Raises ``NotImplementedError`` for ``_block_udf``, ``tensor_column_schema``, ``dataset_kwargs``, and ``columns=`` (each gets its own follow-up — pr-C, pr-G). - context.py: ``use_datasource_v2: bool`` flag, default ``False``. - BUILD.bazel: register the new V2 unit-test packages. - doc/source/data/performance-tips.rst: brief mention. - tests: parquet datasource unit tests, ListFiles-op fixture tests, read_files logical-op tests, V2 read_parquet end-to-end tests. Signed-off-by: Goutam <goutam@anyscale.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Goutam <goutam@anyscale.com> Co-authored-by: Goutam V. <> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

goutamvenkat-anyscale requested a review from a team as a code owner May 4, 2026 20:46

goutamvenkat-anyscale changed the title ~~[data] DataSourceV2: ParquetDatasourceV2 + read_parquet V2 dispatch~~ May 4, 2026

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource_v2/parquet_datasource_v2.py

Comment thread python/ray/data/_internal/datasource_v2/parquet_datasource_v2.py Outdated

cursor Bot reviewed May 4, 2026

View reviewed changes

Comment thread python/ray/data/read_api.py

Comment thread python/ray/data/read_api.py

goutamvenkat-anyscale force-pushed the pr63113 branch from b6c3da1 to 97c4779 Compare May 4, 2026 21:22

goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels May 4, 2026

goutamvenkat-anyscale force-pushed the pr63113 branch from 97c4779 to 1909034 Compare May 4, 2026 23:28

cursor Bot reviewed May 4, 2026

View reviewed changes

goutamvenkat-anyscale force-pushed the pr63113 branch 2 times, most recently from 2598656 to 92aef3b Compare May 5, 2026 03:06

iamjustinhsu reviewed May 5, 2026

View reviewed changes

goutamvenkat-anyscale force-pushed the pr63113 branch from 92aef3b to 93f7970 Compare May 5, 2026 23:23

iamjustinhsu reviewed May 6, 2026

View reviewed changes

iamjustinhsu approved these changes May 6, 2026

View reviewed changes

goutamvenkat-anyscale merged commit abf0d0a into ray-project:master May 6, 2026
7 checks passed

goutamvenkat-anyscale deleted the pr63113 branch May 6, 2026 20:22

		min_bucket_size=min_bucket_size,
		max_bucket_size=max_bucket_size,

Uh oh!

Conversation

goutamvenkat-anyscale commented May 4, 2026

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

cursor Bot May 4, 2026

Choose a reason for hiding this comment

V2 path silently drops include_row_hash parameter

Choose a reason for hiding this comment

iamjustinhsu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Labels

2 participants

V2 path silently drops `include_row_hash` parameter