[data] Optimize concat tables further for happy path#61315
Merged
goutamvenkat-anyscale merged 13 commits intoMar 10, 2026
Merged
[data] Optimize concat tables further for happy path#61315goutamvenkat-anyscale merged 13 commits into
goutamvenkat-anyscale merged 13 commits into
Conversation
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a significant optimization to table concatenation by adding a fast path for blocks with identical schemas, which is a great improvement. The refactoring to allow extension types to use this fast path when their types are uniform across blocks is also a solid enhancement. I appreciate the improved docstrings and examples, which increase code clarity. I've left a couple of minor suggestions regarding type hints to align them with the new, more memory-efficient generator-based calls. Overall, these are excellent changes that improve both performance and maintainability.
Contributor
Author
|
bugbot run |
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
1f8f2ba to
09ce535
Compare
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
09ce535 to
1b63642
Compare
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
goutamvenkat-anyscale
approved these changes
Mar 5, 2026
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
ParagEkbote
pushed a commit
to ParagEkbote/ray
that referenced
this pull request
Mar 10, 2026
## Description Currently, we concat tables together in every map_task. In the worst case, blocks have different schemas, so their schemas must be unified (and hence the block's columns too). However, most cases will encounter the happy path, where all blocks have the same column type. ### Main change This PR creates a happy path optimization to use the built-in `pa.concat_tables` when all blocks share the same schema. - if all table's type are the same for one column, we use `pa.concat_tables` - otherwise, we concat the tables that do share the same unified schema type. Call that result A (fast). Then we concat all different tables manually (slow). Call that result B. That we concat A and B together ### Other changes - Allows extension types (Python objects, tensors) to be concat'ed together via fast path `pa.concat_tables` so long as the all the extension types are the same. Essentially, `promote_types=permissive` does not work for tensor or object extensions, so we can only do that if the column types across the blocks are equal - Adds more docstring ### Benchmarks for 3000 individual tables with the same tensor schema, it is about 1.5x slower than `pa.concat_tables`, which is faster from like 25x slower in my original testing script: https://gist.github.com/iamjustinhsu/2f35c7802101876598ef51325f31f772 ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
abrarsheikh
pushed a commit
that referenced
this pull request
Mar 11, 2026
## Description Currently, we concat tables together in every map_task. In the worst case, blocks have different schemas, so their schemas must be unified (and hence the block's columns too). However, most cases will encounter the happy path, where all blocks have the same column type. ### Main change This PR creates a happy path optimization to use the built-in `pa.concat_tables` when all blocks share the same schema. - if all table's type are the same for one column, we use `pa.concat_tables` - otherwise, we concat the tables that do share the same unified schema type. Call that result A (fast). Then we concat all different tables manually (slow). Call that result B. That we concat A and B together ### Other changes - Allows extension types (Python objects, tensors) to be concat'ed together via fast path `pa.concat_tables` so long as the all the extension types are the same. Essentially, `promote_types=permissive` does not work for tensor or object extensions, so we can only do that if the column types across the blocks are equal - Adds more docstring ### Benchmarks for 3000 individual tables with the same tensor schema, it is about 1.5x slower than `pa.concat_tables`, which is faster from like 25x slower in my original testing script: https://gist.github.com/iamjustinhsu/2f35c7802101876598ef51325f31f772 ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Currently, we concat tables together in every map_task. In the worst case, blocks have different schemas, so their schemas must be unified (and hence the block's columns too). However, most cases will encounter the happy path, where all blocks have the same column type.
Main change
This PR creates a happy path optimization to use the built-in
pa.concat_tableswhen all blocks share the same schema.pa.concat_tablesOther changes
pa.concat_tablesso long as the all the extension types are the same. Essentially,promote_types=permissivedoes not work for tensor or object extensions, so we can only do that if the column types across the blocks are equalBenchmarks
for 3000 individual tables with the same tensor schema, it is about 1.5x slower than
pa.concat_tables, which is faster from like 25x slower in my original testingscript: https://gist.github.com/iamjustinhsu/2f35c7802101876598ef51325f31f772
Related issues
None
Additional information
None