[Train] Handle Arrow-backed pandas dtypes in LightGBM examples by pseudo-rnd-thoughts · Pull Request #63427 · ray-project/ray

pseudo-rnd-thoughts · 2026-05-18T10:04:15Z

Description

#63017 updated Ray Data's Arrow-to-pandas conversion to preserve Arrow-backed pandas dtypes, such as int64[pyarrow], so dtypes can roundtrip more faithfully.

This exposed an incompatibility with LightGBM's pandas input path. Ray Train's LightGBM examples and legacy trainer code convert Ray Data shards to pandas before constructing lightgbm.Dataset. With Arrow-backed Ray Data inputs, those pandas DataFrames can now contain Arrow-backed dtypes, and LightGBM rejects them during pandas dtype validation even when the logical column type is numeric.

This PR updates the LightGBM paths to normalize pandas DataFrames to NumPy-nullable pandas dtypes before passing them to LightGBM. It also updates documentation and examples to show the same conversion for user-authored LightGBM training loops.

Changes

Normalize pandas DataFrames in the legacy LightGBMTrainer path before constructing lightgbm.Dataset.
Update LightGBM docs and docstring examples to call convert_dtypes(dtype_backend="numpy_nullable") after to_pandas().
Add a regression test covering Arrow-backed Ray Data input from ray.data.from_items(...).
Keep the restore test focused on trainer restore behavior by using a pandas-backed test dataset.
Update the LightGBM release benchmark to avoid passing Arrow-backed pandas dtypes to LightGBM.

Signed-off-by: Mark Towers <mark@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces the use of convert_dtypes() across LightGBM trainer implementations, documentation examples, and tests to ensure compatibility with Arrow-backed pandas DataFrames from Ray Data. It also re-enables the test_trainer_restore test. The review feedback recommends simplifying the code by removing redundant materialize() calls and omitting the explicit dtype_backend argument to maintain compatibility with pandas versions prior to 2.0.0. Additionally, the reviewer suggests applying these fixes to the legacy LightGBM trainer and warns about potential performance overhead when using convert_dtypes() on very large datasets.

Signed-off-by: Mark Towers <mark@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 17abd91. Configure here.}

Signed-off-by: Mark Towers <mark@anyscale.com>

goutamvenkat-anyscale · 2026-05-20T07:51:30Z

+    resulting frame to ``lightgbm.Dataset`` must normalize first.
+
+    This helper is a faster alternative to
+    ``df.convert_dtypes(dtype_backend="numpy_nullable")``:


do we have benchmarks on this function comparison?

Using lightgbm_train_batch_inference_benchmark_100G's release tests then I've tested both option which shows almost no difference so I would suspect that we don't have a good test to check this currently

normalize_pandas_for_lightgbm real 4m35.544s user 0m22.508s sys 0m4.320s df.convert_dtypes(dtype_backend="numpy_nullable") real 4m35.533s user 0m22.264s sys 0m4.343s

If that's the case should we either:

Just tell users to use df.convert_dtypes(dtype_backend="numpy_nullable") instead of adding a new API? That seems more stable than this API e.g. what if lightgbm starts accepting timestamp in the future?

If we still want this API, should we still call out that it's faster than df.convert_dtypes(dtype_backend="numpy_nullable") when the benchmark shows a negligible difference?

Signed-off-by: Mark Towers <mark@anyscale.com>

TimothySeah

Approving to unblock for now - main question is why not just use df.convert_dtypes(dtype_backend="numpy_nullable")?

TimothySeah · 2026-05-22T19:27:37Z

+    resulting frame to ``lightgbm.Dataset`` must normalize first.
+
+    This helper is a faster alternative to
+    ``df.convert_dtypes(dtype_backend="numpy_nullable")``:


If that's the case should we either:

Just tell users to use df.convert_dtypes(dtype_backend="numpy_nullable") instead of adding a new API? That seems more stable than this API e.g. what if lightgbm starts accepting timestamp in the future?

If we still want this API, should we still call out that it's faster than df.convert_dtypes(dtype_backend="numpy_nullable") when the benchmark shows a negligible difference?

TimothySeah · 2026-05-22T19:29:40Z

+        if not isinstance(dtype, pd.ArrowDtype):
+            continue
+        arrow_dtype = dtype.pyarrow_dtype
+        if pa.types.is_signed_integer(arrow_dtype):


nit: does https://arrow.apache.org/docs/python/generated/pyarrow.types.is_integer.html work or do we need to explicitly check both is_signed and is_unsigned?

Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>

[Train] Update LightGBM trainer to support PyArrow Dtypes

f50e0b4

Signed-off-by: Mark Towers <mark@anyscale.com>

pseudo-rnd-thoughts requested a review from a team as a code owner May 18, 2026 10:04

pseudo-rnd-thoughts added the train Ray Train Related Issue label May 18, 2026

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

Mark Towers added 4 commits May 18, 2026 13:44

code review

e82c7ad

Signed-off-by: Mark Towers <mark@anyscale.com>

Merge branch 'master' into fix-lightgbm-pyarrow-types

d2bd881

code review

995edfb

Signed-off-by: Mark Towers <mark@anyscale.com>

code review

17abd91

Signed-off-by: Mark Towers <mark@anyscale.com>

cursor Bot reviewed May 19, 2026

View reviewed changes

Comment thread python/ray/train/lightgbm/_lightgbm_utils.py

Mark Towers added 3 commits May 19, 2026 13:45

Merge branch 'master' into fix-lightgbm-pyarrow-types

c7fb0df

Fix test

50f255f

Signed-off-by: Mark Towers <mark@anyscale.com>

Add normalize_pandas_for_lightgbm to predictor

b178132

Signed-off-by: Mark Towers <mark@anyscale.com>

pseudo-rnd-thoughts added the go add ONLY when ready to merge, run all tests label May 19, 2026

goutamvenkat-anyscale reviewed May 20, 2026

View reviewed changes

pseudo-rnd-thoughts and others added 2 commits May 20, 2026 14:25

Merge branch 'master' into fix-lightgbm-pyarrow-types

f3269a9

Add normalize_pandas_for_lightgbm to api.rst

2f24b79

Signed-off-by: Mark Towers <mark@anyscale.com>

goutamvenkat-anyscale approved these changes May 20, 2026

View reviewed changes

TimothySeah approved these changes May 22, 2026

View reviewed changes

matthewdeng reviewed May 22, 2026

View reviewed changes

Comment thread python/ray/train/lightgbm/_lightgbm_utils.py Outdated

Apply suggestion from @matthewdeng

c49a1ea

Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>

matthewdeng merged commit b5e065f into ray-project:master May 22, 2026
6 checks passed

This was referenced Jun 1, 2026

Release test lightgbm_train_batch_inference_benchmark_10G failed anyscale/ray#1675

Closed

Release test lightgbm_train_batch_inference_benchmark_100G failed anyscale/ray#1676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Train] Handle Arrow-backed pandas dtypes in LightGBM examples#63427

[Train] Handle Arrow-backed pandas dtypes in LightGBM examples#63427
matthewdeng merged 11 commits into
ray-project:masterfrom
pseudo-rnd-thoughts:fix-lightgbm-pyarrow-types

pseudo-rnd-thoughts commented May 18, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

goutamvenkat-anyscale May 20, 2026

pseudo-rnd-thoughts May 20, 2026

TimothySeah May 22, 2026

TimothySeah left a comment

TimothySeah May 22, 2026

TimothySeah May 22, 2026

Uh oh!

Uh oh!

Labels

4 participants

Uh oh!

Conversation

pseudo-rnd-thoughts commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale May 20, 2026

Choose a reason for hiding this comment

pseudo-rnd-thoughts May 20, 2026

Choose a reason for hiding this comment

TimothySeah May 22, 2026

Choose a reason for hiding this comment

TimothySeah left a comment

Choose a reason for hiding this comment

TimothySeah May 22, 2026

Choose a reason for hiding this comment

TimothySeah May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

4 participants

pseudo-rnd-thoughts commented May 18, 2026 •

edited

Loading