[data] Optimize concat tables further for happy path by iamjustinhsu · Pull Request #61315 · ray-project/ray

iamjustinhsu · 2026-02-25T18:40:13Z

Description

Currently, we concat tables together in every map_task. In the worst case, blocks have different schemas, so their schemas must be unified (and hence the block's columns too). However, most cases will encounter the happy path, where all blocks have the same column type.

Main change

This PR creates a happy path optimization to use the built-in pa.concat_tables when all blocks share the same schema.

if all table's type are the same for one column, we use pa.concat_tables
otherwise, we concat the tables that do share the same unified schema type. Call that result A (fast). Then we concat all different tables manually (slow). Call that result B. That we concat A and B together

Other changes

Allows extension types (Python objects, tensors) to be concat'ed together via fast path pa.concat_tables so long as the all the extension types are the same. Essentially, promote_types=permissive does not work for tensor or object extensions, so we can only do that if the column types across the blocks are equal
Adds more docstring

Benchmarks

for 3000 individual tables with the same tensor schema, it is about 1.5x slower than pa.concat_tables, which is faster from like 25x slower in my original testing

script: https://gist.github.com/iamjustinhsu/2f35c7802101876598ef51325f31f772

Related issues

None

Additional information

None

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a significant optimization to table concatenation by adding a fast path for blocks with identical schemas, which is a great improvement. The refactoring to allow extension types to use this fast path when their types are uniform across blocks is also a solid enhancement. I appreciate the improved docstrings and examples, which increase code clarity. I've left a couple of minor suggestions regarding type hints to align them with the new, more memory-efficient generator-based calls. Overall, these are excellent changes that improve both performance and maintainability.

iamjustinhsu · 2026-02-25T18:52:33Z

bugbot run

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

## Description Currently, we concat tables together in every map_task. In the worst case, blocks have different schemas, so their schemas must be unified (and hence the block's columns too). However, most cases will encounter the happy path, where all blocks have the same column type. ### Main change This PR creates a happy path optimization to use the built-in `pa.concat_tables` when all blocks share the same schema. - if all table's type are the same for one column, we use `pa.concat_tables` - otherwise, we concat the tables that do share the same unified schema type. Call that result A (fast). Then we concat all different tables manually (slow). Call that result B. That we concat A and B together ### Other changes - Allows extension types (Python objects, tensors) to be concat'ed together via fast path `pa.concat_tables` so long as the all the extension types are the same. Essentially, `promote_types=permissive` does not work for tensor or object extensions, so we can only do that if the column types across the blocks are equal - Adds more docstring ### Benchmarks for 3000 individual tables with the same tensor schema, it is about 1.5x slower than `pa.concat_tables`, which is faster from like 25x slower in my original testing script: https://gist.github.com/iamjustinhsu/2f35c7802101876598ef51325f31f772 ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>

## Description Currently, we concat tables together in every map_task. In the worst case, blocks have different schemas, so their schemas must be unified (and hence the block's columns too). However, most cases will encounter the happy path, where all blocks have the same column type. ### Main change This PR creates a happy path optimization to use the built-in `pa.concat_tables` when all blocks share the same schema. - if all table's type are the same for one column, we use `pa.concat_tables` - otherwise, we concat the tables that do share the same unified schema type. Call that result A (fast). Then we concat all different tables manually (slow). Call that result B. That we concat A and B together ### Other changes - Allows extension types (Python objects, tensors) to be concat'ed together via fast path `pa.concat_tables` so long as the all the extension types are the same. Essentially, `promote_types=permissive` does not work for tensor or object extensions, so we can only do that if the column types across the blocks are equal - Adds more docstring ### Benchmarks for 3000 individual tables with the same tensor schema, it is about 1.5x slower than `pa.concat_tables`, which is faster from like 25x slower in my original testing script: https://gist.github.com/iamjustinhsu/2f35c7802101876598ef51325f31f772 ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

[data] Optimize concat tables further for fast path

368ee51

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu requested a review from a team as a code owner February 25, 2026 18:40

gemini-code-assist Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py

Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated

small adjustments

97026e2

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu added the go add ONLY when ready to merge, run all tests label Feb 25, 2026

cursor Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py

iamjustinhsu added 3 commits February 25, 2026 11:47

add back null calculation

9eb9ebb

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

fix

21773ce

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

handle mismatched blocks only

c51a6c4

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py

iamjustinhsu force-pushed the jhsu/fast-path-pyarrow-concat branch from 1f8f2ba to 09ce535 Compare February 25, 2026 23:16

add a preserve_order with mismatched blocks

1b63642

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/fast-path-pyarrow-concat branch from 09ce535 to 1b63642 Compare February 25, 2026 23:50

iamjustinhsu added 3 commits February 25, 2026 15:57

rebase

35d6a27

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

fix

dcaa925

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

fix

38f56dd

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

ray-gardener Bot added the data Ray Data-related issues label Feb 27, 2026

reorder code to make it cleaner

a05ca15

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu changed the title ~~[data] Optimize concat tables further for fast path~~ Mar 4, 2026

goutamvenkat-anyscale reviewed Mar 4, 2026

View reviewed changes

Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py

goutamvenkat-anyscale reviewed Mar 4, 2026

View reviewed changes

Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py

goutamvenkat-anyscale reviewed Mar 4, 2026

View reviewed changes

Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated

iamjustinhsu added 2 commits March 5, 2026 10:27

preserve comments

0820316

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

lint

ccc8d66

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

goutamvenkat-anyscale approved these changes Mar 5, 2026

View reviewed changes

fix docstring

afea96d

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

goutamvenkat-anyscale merged commit 5659909 into ray-project:master Mar 10, 2026
6 checks passed

iamjustinhsu deleted the jhsu/fast-path-pyarrow-concat branch March 10, 2026 00:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] Optimize concat tables further for happy path#61315

[data] Optimize concat tables further for happy path#61315
goutamvenkat-anyscale merged 13 commits into
ray-project:masterfrom
iamjustinhsu:jhsu/fast-path-pyarrow-concat

iamjustinhsu commented Feb 25, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

iamjustinhsu commented Feb 25, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Labels

2 participants

Uh oh!

Conversation

iamjustinhsu commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Main change

Other changes

Benchmarks

Related issues

Additional information

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

iamjustinhsu commented Feb 25, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Labels

2 participants

iamjustinhsu commented Feb 25, 2026 •

edited

Loading