Skip to content

[data] Optimize concat tables further for happy path#61315

Merged
goutamvenkat-anyscale merged 13 commits into
ray-project:masterfrom
iamjustinhsu:jhsu/fast-path-pyarrow-concat
Mar 10, 2026
Merged

[data] Optimize concat tables further for happy path#61315
goutamvenkat-anyscale merged 13 commits into
ray-project:masterfrom
iamjustinhsu:jhsu/fast-path-pyarrow-concat

Conversation

@iamjustinhsu

@iamjustinhsu iamjustinhsu commented Feb 25, 2026

Copy link
Copy Markdown
Contributor

Description

Currently, we concat tables together in every map_task. In the worst case, blocks have different schemas, so their schemas must be unified (and hence the block's columns too). However, most cases will encounter the happy path, where all blocks have the same column type.

Main change

This PR creates a happy path optimization to use the built-in pa.concat_tables when all blocks share the same schema.

  • if all table's type are the same for one column, we use pa.concat_tables
  • otherwise, we concat the tables that do share the same unified schema type. Call that result A (fast). Then we concat all different tables manually (slow). Call that result B. That we concat A and B together

Other changes

  • Allows extension types (Python objects, tensors) to be concat'ed together via fast path pa.concat_tables so long as the all the extension types are the same. Essentially, promote_types=permissive does not work for tensor or object extensions, so we can only do that if the column types across the blocks are equal
  • Adds more docstring

Benchmarks

for 3000 individual tables with the same tensor schema, it is about 1.5x slower than pa.concat_tables, which is faster from like 25x slower in my original testing

script: https://gist.github.com/iamjustinhsu/2f35c7802101876598ef51325f31f772

Related issues

None

Additional information

None

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner February 25, 2026 18:40

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant optimization to table concatenation by adding a fast path for blocks with identical schemas, which is a great improvement. The refactoring to allow extension types to use this fast path when their types are uniform across blocks is also a solid enhancement. I appreciate the improved docstrings and examples, which increase code clarity. I've left a couple of minor suggestions regarding type hints to align them with the new, more memory-efficient generator-based calls. Overall, these are excellent changes that improve both performance and maintainability.

Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py
Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated
@iamjustinhsu

Copy link
Copy Markdown
Contributor Author

bugbot run

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Feb 25, 2026
Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py
@iamjustinhsu iamjustinhsu force-pushed the jhsu/fast-path-pyarrow-concat branch from 1f8f2ba to 09ce535 Compare February 25, 2026 23:16
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/fast-path-pyarrow-concat branch from 09ce535 to 1b63642 Compare February 25, 2026 23:50
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Feb 27, 2026
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu changed the title [data] Optimize concat tables further for fast path Mar 4, 2026
Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py
Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py
Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale merged commit 5659909 into ray-project:master Mar 10, 2026
6 checks passed
@iamjustinhsu iamjustinhsu deleted the jhsu/fast-path-pyarrow-concat branch March 10, 2026 00:07
ParagEkbote pushed a commit to ParagEkbote/ray that referenced this pull request Mar 10, 2026
## Description
Currently, we concat tables together in every map_task. In the worst
case, blocks have different schemas, so their schemas must be unified
(and hence the block's columns too). However, most cases will encounter
the happy path, where all blocks have the same column type.

### Main change
This PR creates a happy path optimization to use the built-in
`pa.concat_tables` when all blocks share the same schema.

- if all table's type are the same for one column, we use
`pa.concat_tables`
- otherwise, we concat the tables that do share the same unified schema
type. Call that result A (fast). Then we concat all different tables
manually (slow). Call that result B. That we concat A and B together

### Other changes
- Allows extension types (Python objects, tensors) to be concat'ed
together via fast path `pa.concat_tables` so long as the all the
extension types are the same. Essentially, `promote_types=permissive`
does not work for tensor or object extensions, so we can only do that if
the column types across the blocks are equal
- Adds more docstring

### Benchmarks
for 3000 individual tables with the same tensor schema, it is about 1.5x
slower than `pa.concat_tables`, which is faster from like 25x slower in
my original testing

script:
https://gist.github.com/iamjustinhsu/2f35c7802101876598ef51325f31f772

## Related issues
None

## Additional information
None

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
abrarsheikh pushed a commit that referenced this pull request Mar 11, 2026
## Description
Currently, we concat tables together in every map_task. In the worst
case, blocks have different schemas, so their schemas must be unified
(and hence the block's columns too). However, most cases will encounter
the happy path, where all blocks have the same column type.

### Main change
This PR creates a happy path optimization to use the built-in
`pa.concat_tables` when all blocks share the same schema.

- if all table's type are the same for one column, we use
`pa.concat_tables`
- otherwise, we concat the tables that do share the same unified schema
type. Call that result A (fast). Then we concat all different tables
manually (slow). Call that result B. That we concat A and B together

### Other changes
- Allows extension types (Python objects, tensors) to be concat'ed
together via fast path `pa.concat_tables` so long as the all the
extension types are the same. Essentially, `promote_types=permissive`
does not work for tensor or object extensions, so we can only do that if
the column types across the blocks are equal
- Adds more docstring 

### Benchmarks
for 3000 individual tables with the same tensor schema, it is about 1.5x
slower than `pa.concat_tables`, which is faster from like 25x slower in
my original testing

script:
https://gist.github.com/iamjustinhsu/2f35c7802101876598ef51325f31f772


## Related issues
None

## Additional information
None

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants