Skip to content

[data] Include column name and target type in ArrowConversionError#62407

Merged
goutamvenkat-anyscale merged 4 commits into
ray-project:masterfrom
goutamvenkat-anyscale:goutam/improve-arrow-conversion-error-message
Apr 7, 2026
Merged

[data] Include column name and target type in ArrowConversionError#62407
goutamvenkat-anyscale merged 4 commits into
ray-project:masterfrom
goutamvenkat-anyscale:goutam/improve-arrow-conversion-error-message

Conversation

@goutamvenkat-anyscale

@goutamvenkat-anyscale goutamvenkat-anyscale commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Why are these changes needed?

ArrowConversionError previously only showed the data that failed to convert, making it hard to identify which column caused the issue. For example, the raw Arrow error looks like:

File "pyarrow/array.pxi", line 405, in pyarrow.lib.asarray
File "pyarrow/array.pxi", line 375, in pyarrow.lib.array
File "pyarrow/array.pxi", line 46, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'numpy.float32' object

This gives no indication of which column or what type conversion was attempted. With this change, the wrapped error now includes the column name and inferred target type:

Before:

Error converting data to Arrow: [b'hello', 2.0]

After:

Error converting column 'my_column' (target type: binary) to Arrow: [b'hello', 2.0]

Repro script

import ray
import numpy as np

ray.init()

ds = ray.data.range(2)

def mix_types(batch):
    return {"my_column": [b"hello", np.float32(2.0)]}

ds.map_batches(mix_types, batch_size=2).write_parquet("/tmp/out")

Related issue number

N/A

Checks

  • I've signed off every commit.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've added any new APIs to the API Reference.
  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
…ssage

Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner April 7, 2026 21:09
@goutamvenkat-anyscale goutamvenkat-anyscale added the data Ray Data-related issues label Apr 7, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 93cb773. Configure here.

Comment thread python/ray/data/_internal/tensor_extensions/arrow.py
@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Apr 7, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the ArrowConversionError class to include optional column name and target type information, providing more context in error messages. The review feedback identifies a critical issue where a NameError could occur in _convert_to_pyarrow_native_array if an exception is raised before pa_type is defined. Additionally, there is a suggestion to refactor the error message construction logic for improved readability.

Comment thread python/ray/data/_internal/tensor_extensions/arrow.py
Comment on lines +249 to +256
if column_name is not None:
type_info = f" (target type: {pa_type})" if pa_type is not None else ""
message = (
f"Error converting column '{column_name}'{type_info}"
f" to Arrow: {data_str}"
)
else:
message = f"Error converting data to Arrow: {data_str}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The message construction logic can be made more concise and arguably more readable by handling the column_name is None case first, and combining the f-string for the other case.

        if column_name is None:
            message = f"Error converting data to Arrow: {data_str}"
        else:
            type_info = f" (target type: {pa_type})" if pa_type is not None else ""
            message = f"Error converting column '{column_name}'{type_info} to Arrow: {data_str}"
data_str = data_str[: self.MAX_DATA_STR_LEN] + "..."
message = f"Error converting data to Arrow: {data_str}"
if column_name is not None:
type_info = f" (target type: {pa_type})" if pa_type is not None else ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when can column_name be present but pa_type None?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, what are the implications if you did this instead?

type_info = f" (target type: {pa_type})"

regardless of pa_type check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pa_type can be None if it fails before _infer_pyarrow_type.

@goutamvenkat-anyscale goutamvenkat-anyscale enabled auto-merge (squash) April 7, 2026 21:34
Signed-off-by: Goutam <goutam@anyscale.com>
@github-actions github-actions Bot disabled auto-merge April 7, 2026 22:38
@goutamvenkat-anyscale goutamvenkat-anyscale enabled auto-merge (squash) April 7, 2026 22:38
@goutamvenkat-anyscale goutamvenkat-anyscale merged commit 6f6aa90 into ray-project:master Apr 7, 2026
7 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/improve-arrow-conversion-error-message branch April 21, 2026 17:37
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…ay-project#62407)

## Why are these changes needed?

`ArrowConversionError` previously only showed the data that failed to
convert, making it hard to identify which column caused the issue. For
example, the raw Arrow error looks like:

```
File "pyarrow/array.pxi", line 405, in pyarrow.lib.asarray
File "pyarrow/array.pxi", line 375, in pyarrow.lib.array
File "pyarrow/array.pxi", line 46, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'numpy.float32' object
```

This gives no indication of which column or what type conversion was
attempted. With this change, the wrapped error now includes the column
name and inferred target type:

Before:
```
Error converting data to Arrow: [b'hello', 2.0]
```

After:
```
Error converting column 'my_column' (target type: binary) to Arrow: [b'hello', 2.0]
```

### Repro script

```python
import ray
import numpy as np

ray.init()

ds = ray.data.range(2)

def mix_types(batch):
    return {"my_column": [b"hello", np.float32(2.0)]}

ds.map_batches(mix_types, batch_size=2).write_parquet("/tmp/out")
```

## Related issue number

N/A

## Checks

- [x] I've signed off every commit.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference.
- [x] I've made sure the tests are passing.
- Testing Strategy
   - [x] Unit tests

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants