Skip to content

[Data] Avoid importing cudf in _is_cudf_dataframe when cudf is not loaded#62302

Merged
bveeramani merged 8 commits into
ray-project:masterfrom
rayhhome:cudf-optional-import
Apr 6, 2026
Merged

[Data] Avoid importing cudf in _is_cudf_dataframe when cudf is not loaded#62302
bveeramani merged 8 commits into
ray-project:masterfrom
rayhhome:cudf-optional-import

Conversation

@rayhhome

@rayhhome rayhhome commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

Description

_is_cudf_dataframe() is called on every batch in the map_batches hot path (validation + type dispatch). Previously it did try: import cudf unconditionally, which on environments with cudf installed (e.g. the ray-ml BYOD image) loads the full CUDA runtime — adding ~1.5 GiB RSS per worker even when no GPU is used.

This adds a sys.modules guard so cudf is only imported when it has already been loaded by someone else in the process. If cudf isn't in sys.modules, no object can be a cudf.DataFrame, so we return False immediately.

This eliminates OOM kills on CPU-only benchmarks running on the ray-ml image, where 8 workers × 1.5 GiB of unnecessary cudf overhead was pushing 30 GiB nodes past the 95% memory threshold.

Related issues

Related to the map_batches_fixed_size_tasks_numpy_once nightly benchmark OOM failures.

Additional information

The benchmark inherits type: gpu (ray-ml image) from the data test DEFAULTS in release_data_tests.yaml, which includes cudf-cu12 via dl-gpu-requirements.txt. The actual cluster uses CPU instances (m5.2xlarge). Every worker was importing cudf through _validate_batch_output -> _is_cudf_dataframe (line 519 in plan_udf_map_op.py), which runs on every UDF output batch regardless of batch format.

rayhhome added 2 commits April 2, 2026 09:15
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
@rayhhome rayhhome self-assigned this Apr 2, 2026
@rayhhome rayhhome added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Apr 2, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the _is_cudf_dataframe function in python/ray/data/block.py by checking sys.modules before performing a lazy import of cudf. This change prevents the unnecessary loading of CUDA and the associated memory overhead when cudf has not been previously imported. I have no feedback to provide.

@rayhhome rayhhome marked this pull request as ready for review April 2, 2026 23:39
@rayhhome rayhhome requested a review from a team as a code owner April 2, 2026 23:39
Copilot AI review requested due to automatic review settings April 2, 2026 23:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents unintended CUDA/cuDF initialization in Ray Data’s map_batches hot path by avoiding an import cudf unless cuDF has already been imported in the current process.

Changes:

  • Add a sys.modules guard in _is_cudf_dataframe() to return early when cudf hasn’t been imported yet.
  • Update _is_cudf_dataframe() docstring to document the rationale and memory impact.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/ray/data/block.py
Comment on lines +139 to 143
if "cudf" not in sys.modules:
return False
try:
import cudf

Copilot AI Apr 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a unit test to prevent regressions of this optimization: when "cudf" is absent from sys.modules, _is_cudf_dataframe() should return False without attempting to import cudf (e.g., by patching sys.modules and asserting the import hook isn’t invoked for cudf). This is a hot-path check and the memory-impact regression described in the PR would be hard to notice without an explicit test.

Copilot uses AI. Check for mistakes.
@bveeramani bveeramani enabled auto-merge (squash) April 3, 2026 23:12
@github-actions github-actions Bot disabled auto-merge April 3, 2026 23:55
@bveeramani bveeramani enabled auto-merge (squash) April 6, 2026 16:38
@bveeramani bveeramani merged commit 5300731 into ray-project:master Apr 6, 2026
7 checks passed
@rayhhome rayhhome deleted the cudf-optional-import branch April 7, 2026 18:42
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…aded (ray-project#62302)

## Description
`_is_cudf_dataframe()` is called on every batch in the map_batches hot
path (validation + type dispatch). Previously it did try: import cudf
unconditionally, which on environments with cudf installed (e.g. the
ray-ml BYOD image) loads the full CUDA runtime — adding ~1.5 GiB RSS per
worker even when no GPU is used.

This adds a `sys.modules` guard so cudf is only imported when it has
already been loaded by someone else in the process. If cudf isn't in
`sys.modules`, no object can be a `cudf.DataFrame`, so we return False
immediately.

This eliminates OOM kills on CPU-only benchmarks running on the ray-ml
image, where 8 workers × 1.5 GiB of unnecessary cudf overhead was
pushing 30 GiB nodes past the 95% memory threshold.

## Related issues
Related to the `map_batches_fixed_size_tasks_numpy_once` nightly
benchmark OOM failures.

## Additional information
The benchmark inherits type: gpu (ray-ml image) from the data test
DEFAULTS in `release_data_tests.yaml`, which includes `cudf-cu12` via
`dl-gpu-requirements.txt`. The actual cluster uses CPU instances
(m5.2xlarge). Every worker was importing cudf through
`_validate_batch_output` -> `_is_cudf_dataframe` (line 519 in
plan_udf_map_op.py), which runs on every UDF output batch regardless of
batch format.

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
bveeramani added a commit that referenced this pull request May 19, 2026
## Description

The map_batches release benchmark had `RAYTEST_FAIL_ON_WORKER_OOM=0`.
After we landed some changes to minimize memory bloat like
#62302, the test no longer OOMs,
so I'm re-enabling the flag.

## Related issues

None

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
TruongQuangPhat pushed a commit to cyhapun/ray-fix-issue that referenced this pull request May 27, 2026
## Description

The map_batches release benchmark had `RAYTEST_FAIL_ON_WORKER_OOM=0`.
After we landed some changes to minimize memory bloat like
ray-project#62302, the test no longer OOMs,
so I'm re-enabling the flag.

## Related issues

None

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

4 participants