[Data] Add include_row_hash to read_parquet by wingkitlee0 · Pull Request #61408 · ray-project/ray

wingkitlee0 · 2026-03-01T14:45:46Z

Description

This PR adds an include_row_hash option to read_parquet, which adds a new column. The row hash is computed from the file path, each row's index after filtering, and a mixing step (so values are spread across the uint64 range rather than clustering in a few buckets).

Row hashes are unique across the rows you actually read for a given read configuration (same files, same filter, same ordering). They are reproducible under that same configuration, which supports checkpointing for Ray Data and Ray Train.

The column type is unsigned 64-bit integer (uint64).

Row hash semantics (filters and checkpointing)

Each row_hash is deterministic for a given read: it uses the file path and the row's position after filtering (0-based—the first row that survives the filter is 0, the next is 1, and so on). It is not the row's index in the raw Parquet file before filtering.

If you change the filter, which columns you read, or which files you read, which rows appear—and their positions after filtering—can change, so hashes can change too.

For checkpointing and resume, we assume you keep the same read setup, including the same filter, across runs. Rows that were filtered out are not part of the pipeline anyway, so identifying rows after filtering is enough; we do not rely on pre-filter physical row positions for that use case.

Related issues

Closes #61410

Additional information

How it works:

Path seed: For each Parquet file, MD5-hash its file path and take the first 8 bytes as a uint64 seed. Identical data in different files still gets different hashes because paths differ.
Row keys: After filtering, add each row's 0-based index in the filtered output for that file (tracked across batches) to the path seed: key = path_seed + row_index.
Mix: Apply the splitmix64 finalizer (a bijective 64-bit integer mixing function) to scatter nearby keys across the full uint64 range:

  keys ^= keys >> 30
  keys *= 0xBF58476D1CE4E5B9
  keys ^= keys >> 27
  keys *= 0x94D049BB133111EB
  keys ^= keys >> 31

All operations are vectorized with NumPy—no Python loops.

Properties:

Reproducible: Same file path + same filter + same position after filtering → same hash.
Unique: Different files get different seeds (via MD5 of path); different rows in the filtered output get different indices. The splitmix64 step is bijective, so distinct inputs do not collide.
Fast: One MD5 call per file, then pure NumPy vectorized arithmetic per batch.

gemini-code-assist

Code Review

This pull request introduces a useful include_row_hash option to read_parquet, which is valuable for checkpointing and data versioning. The implementation is generally solid and consistent with existing features like include_paths. However, I've identified a critical bug that can cause a crash when include_row_hash=True is used on a file that already contains a row_hash column, particularly when no specific columns are selected for reading. I've provided details and a suggested fix for this issue. Additionally, I've included a few medium-severity suggestions to improve user experience by adding a warning for column name conflicts, updating the documentation to clarify this behavior, and enhancing test coverage for this edge case.

iamjustinhsu · 2026-04-14T19:07:21Z

+            logger.warning(
+                "The Parquet file(s) already contain a column named 'row_hash'. "
+                "It will be overwritten by the generated row hash column."
+            )


I don't think the warning is necessary, since

We don't use it for include_paths

We explicitly say in the documentation

With that said, if u make a warning hear, u probably want to add a warning for include_paths too to keep it consistent

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit cb74257. Configure here.}

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

## Description This PR adds an `include_row_hash` option to `read_parquet`, which adds a new column. The row hash is computed from the file path, each row's **index after filtering**, and a mixing step (so values are spread across the uint64 range rather than clustering in a few buckets). Row hashes are unique across the rows you actually read for a given read configuration (same files, same filter, same ordering). They are reproducible under that same configuration, which supports checkpointing for Ray Data and Ray Train. The column type is unsigned 64-bit integer (`uint64`). ## Row hash semantics (filters and checkpointing) Each `row_hash` is deterministic for a given read: it uses the file path and the row's **position after filtering** (0-based—the first row that survives the filter is 0, the next is 1, and so on). It is **not** the row's index in the raw Parquet file before filtering. If you change the filter, which columns you read, or which files you read, which rows appear—and their positions after filtering—can change, so hashes can change too. For **checkpointing and resume**, we assume you keep the **same read setup**, including the **same filter**, across runs. Rows that were filtered out are not part of the pipeline anyway, so identifying rows **after filtering** is enough; we do not rely on pre-filter physical row positions for that use case. ## Related issues Closes ray-project#61410 ## Additional information How it works: 1. Path seed: For each Parquet file, MD5-hash its file path and take the first 8 bytes as a uint64 seed. Identical data in different files still gets different hashes because paths differ. 2. Row keys: After filtering, add each row's **0-based index in the filtered output** for that file (tracked across batches) to the path seed: `key = path_seed + row_index`. 3. Mix: Apply the splitmix64 finalizer (a bijective 64-bit integer mixing function) to scatter nearby keys across the full uint64 range: ``` keys ^= keys >> 30 keys *= 0xBF58476D1CE4E5B9 keys ^= keys >> 27 keys *= 0x94D049BB133111EB keys ^= keys >> 31 ``` All operations are vectorized with NumPy—no Python loops. Properties: - **Reproducible:** Same file path + same filter + same position after filtering → same hash. - **Unique:** Different files get different seeds (via MD5 of path); different rows in the filtered output get different indices. The splitmix64 step is bijective, so distinct inputs do not collide. - **Fast:** One MD5 call per file, then pure NumPy vectorized arithmetic per batch. Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

gemini-code-assist Bot reviewed Mar 1, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/parquet_datasource.py

Comment thread python/ray/data/_internal/datasource/parquet_datasource.py

Comment thread python/ray/data/read_api.py Outdated

Comment thread python/ray/data/tests/datasource/test_parquet.py

wingkitlee0 force-pushed the kit/read-row-hash branch from e1514fa to 2a97a98 Compare March 2, 2026 04:19

wingkitlee0 force-pushed the kit/read-row-hash branch 2 times, most recently from 5baca95 to a403b17 Compare March 15, 2026 14:20

wingkitlee0 added the go add ONLY when ready to merge, run all tests label Mar 15, 2026

wingkitlee0 marked this pull request as ready for review March 18, 2026 12:24

wingkitlee0 requested a review from a team as a code owner March 18, 2026 12:24

cursor Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/parquet_datasource.py Outdated

Comment thread python/ray/data/_internal/datasource/parquet_datasource.py Outdated

ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Mar 18, 2026

wingkitlee0 marked this pull request as draft March 19, 2026 02:32

wingkitlee0 force-pushed the kit/read-row-hash branch 4 times, most recently from f49769f to 05965d0 Compare March 22, 2026 00:41

wingkitlee0 marked this pull request as ready for review March 22, 2026 03:39

wingkitlee0 force-pushed the kit/read-row-hash branch from 05965d0 to a24e8a4 Compare March 29, 2026 00:41

wingkitlee0 force-pushed the kit/read-row-hash branch from a24e8a4 to e9c89a4 Compare April 5, 2026 19:03

wingkitlee0 marked this pull request as draft April 5, 2026 19:05

wingkitlee0 force-pushed the kit/read-row-hash branch 2 times, most recently from b4305f3 to 09edc82 Compare April 11, 2026 12:48

wingkitlee0 marked this pull request as ready for review April 11, 2026 12:50

iamjustinhsu self-assigned this Apr 14, 2026

iamjustinhsu approved these changes Apr 14, 2026

View reviewed changes

wingkitlee0 force-pushed the kit/read-row-hash branch 2 times, most recently from b2474d7 to fc7a149 Compare April 18, 2026 19:46

cursor Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/parquet_datasource.py

wingkitlee0 force-pushed the kit/read-row-hash branch from cb74257 to 77e18b9 Compare April 22, 2026 01:13

[Data] Add include_row_hash to read_parquet

e3eb186

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

wingkitlee0 force-pushed the kit/read-row-hash branch from 77e18b9 to e3eb186 Compare April 27, 2026 00:03

richardliaw merged commit e1fe22f into ray-project:master Apr 28, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Add include_row_hash to read_parquet#61408

[Data] Add include_row_hash to read_parquet#61408
richardliaw merged 1 commit into
ray-project:masterfrom
wingkitlee0:kit/read-row-hash

wingkitlee0 commented Mar 1, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iamjustinhsu Apr 14, 2026

cursor Bot left a comment

Uh oh!

Uh oh!

Labels

3 participants

Uh oh!

Conversation

wingkitlee0 commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Row hash semantics (filters and checkpointing)

Related issues

Additional information

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iamjustinhsu Apr 14, 2026

Choose a reason for hiding this comment

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

3 participants

wingkitlee0 commented Mar 1, 2026 •

edited

Loading