[Train] Handle Arrow-backed pandas dtypes in LightGBM examples#63427
Conversation
Signed-off-by: Mark Towers <mark@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces the use of convert_dtypes() across LightGBM trainer implementations, documentation examples, and tests to ensure compatibility with Arrow-backed pandas DataFrames from Ray Data. It also re-enables the test_trainer_restore test. The review feedback recommends simplifying the code by removing redundant materialize() calls and omitting the explicit dtype_backend argument to maintain compatibility with pandas versions prior to 2.0.0. Additionally, the reviewer suggests applying these fixes to the legacy LightGBM trainer and warns about potential performance overhead when using convert_dtypes() on very large datasets.
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 17abd91. Configure here.
Signed-off-by: Mark Towers <mark@anyscale.com>
| resulting frame to ``lightgbm.Dataset`` must normalize first. | ||
|
|
||
| This helper is a faster alternative to | ||
| ``df.convert_dtypes(dtype_backend="numpy_nullable")``: |
There was a problem hiding this comment.
do we have benchmarks on this function comparison?
There was a problem hiding this comment.
Using lightgbm_train_batch_inference_benchmark_100G's release tests then I've tested both option which shows almost no difference so I would suspect that we don't have a good test to check this currently
normalize_pandas_for_lightgbm
real 4m35.544s
user 0m22.508s
sys 0m4.320s
df.convert_dtypes(dtype_backend="numpy_nullable")
real 4m35.533s
user 0m22.264s
sys 0m4.343sThere was a problem hiding this comment.
If that's the case should we either:
- Just tell users to use
df.convert_dtypes(dtype_backend="numpy_nullable")instead of adding a new API? That seems more stable than this API e.g. what if lightgbm starts acceptingtimestampin the future? - If we still want this API, should we still call out that it's faster than
df.convert_dtypes(dtype_backend="numpy_nullable")when the benchmark shows a negligible difference?
Signed-off-by: Mark Towers <mark@anyscale.com>
TimothySeah
left a comment
There was a problem hiding this comment.
Approving to unblock for now - main question is why not just use df.convert_dtypes(dtype_backend="numpy_nullable")?
| resulting frame to ``lightgbm.Dataset`` must normalize first. | ||
|
|
||
| This helper is a faster alternative to | ||
| ``df.convert_dtypes(dtype_backend="numpy_nullable")``: |
There was a problem hiding this comment.
If that's the case should we either:
- Just tell users to use
df.convert_dtypes(dtype_backend="numpy_nullable")instead of adding a new API? That seems more stable than this API e.g. what if lightgbm starts acceptingtimestampin the future? - If we still want this API, should we still call out that it's faster than
df.convert_dtypes(dtype_backend="numpy_nullable")when the benchmark shows a negligible difference?
| if not isinstance(dtype, pd.ArrowDtype): | ||
| continue | ||
| arrow_dtype = dtype.pyarrow_dtype | ||
| if pa.types.is_signed_integer(arrow_dtype): |
There was a problem hiding this comment.
nit: does https://arrow.apache.org/docs/python/generated/pyarrow.types.is_integer.html work or do we need to explicitly check both is_signed and is_unsigned?
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>

Description
#63017 updated Ray Data's Arrow-to-pandas conversion to preserve Arrow-backed pandas dtypes, such as
int64[pyarrow], so dtypes can roundtrip more faithfully.This exposed an incompatibility with LightGBM's pandas input path. Ray Train's LightGBM examples and legacy trainer code convert Ray Data shards to pandas before constructing
lightgbm.Dataset. With Arrow-backed Ray Data inputs, those pandas DataFrames can now contain Arrow-backed dtypes, and LightGBM rejects them during pandas dtype validation even when the logical column type is numeric.This PR updates the LightGBM paths to normalize pandas DataFrames to NumPy-nullable pandas dtypes before passing them to LightGBM. It also updates documentation and examples to show the same conversion for user-authored LightGBM training loops.
Changes
LightGBMTrainerpath before constructinglightgbm.Dataset.convert_dtypes(dtype_backend="numpy_nullable")afterto_pandas().ray.data.from_items(...).