[Data] Avoid importing cudf in _is_cudf_dataframe when cudf is not loaded#62302
Conversation
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request optimizes the _is_cudf_dataframe function in python/ray/data/block.py by checking sys.modules before performing a lazy import of cudf. This change prevents the unnecessary loading of CUDA and the associated memory overhead when cudf has not been previously imported. I have no feedback to provide.
There was a problem hiding this comment.
Pull request overview
This PR prevents unintended CUDA/cuDF initialization in Ray Data’s map_batches hot path by avoiding an import cudf unless cuDF has already been imported in the current process.
Changes:
- Add a
sys.modulesguard in_is_cudf_dataframe()to return early whencudfhasn’t been imported yet. - Update
_is_cudf_dataframe()docstring to document the rationale and memory impact.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if "cudf" not in sys.modules: | ||
| return False | ||
| try: | ||
| import cudf | ||
|
|
There was a problem hiding this comment.
Consider adding a unit test to prevent regressions of this optimization: when "cudf" is absent from sys.modules, _is_cudf_dataframe() should return False without attempting to import cudf (e.g., by patching sys.modules and asserting the import hook isn’t invoked for cudf). This is a hot-path check and the memory-impact regression described in the PR would be hard to notice without an explicit test.
…aded (ray-project#62302) ## Description `_is_cudf_dataframe()` is called on every batch in the map_batches hot path (validation + type dispatch). Previously it did try: import cudf unconditionally, which on environments with cudf installed (e.g. the ray-ml BYOD image) loads the full CUDA runtime — adding ~1.5 GiB RSS per worker even when no GPU is used. This adds a `sys.modules` guard so cudf is only imported when it has already been loaded by someone else in the process. If cudf isn't in `sys.modules`, no object can be a `cudf.DataFrame`, so we return False immediately. This eliminates OOM kills on CPU-only benchmarks running on the ray-ml image, where 8 workers × 1.5 GiB of unnecessary cudf overhead was pushing 30 GiB nodes past the 95% memory threshold. ## Related issues Related to the `map_batches_fixed_size_tasks_numpy_once` nightly benchmark OOM failures. ## Additional information The benchmark inherits type: gpu (ray-ml image) from the data test DEFAULTS in `release_data_tests.yaml`, which includes `cudf-cu12` via `dl-gpu-requirements.txt`. The actual cluster uses CPU instances (m5.2xlarge). Every worker was importing cudf through `_validate_batch_output` -> `_is_cudf_dataframe` (line 519 in plan_udf_map_op.py), which runs on every UDF output batch regardless of batch format. --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
## Description The map_batches release benchmark had `RAYTEST_FAIL_ON_WORKER_OOM=0`. After we landed some changes to minimize memory bloat like #62302, the test no longer OOMs, so I'm re-enabling the flag. ## Related issues None Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
## Description The map_batches release benchmark had `RAYTEST_FAIL_ON_WORKER_OOM=0`. After we landed some changes to minimize memory bloat like ray-project#62302, the test no longer OOMs, so I'm re-enabling the flag. ## Related issues None Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
Description
_is_cudf_dataframe()is called on every batch in the map_batches hot path (validation + type dispatch). Previously it did try: import cudf unconditionally, which on environments with cudf installed (e.g. the ray-ml BYOD image) loads the full CUDA runtime — adding ~1.5 GiB RSS per worker even when no GPU is used.This adds a
sys.modulesguard so cudf is only imported when it has already been loaded by someone else in the process. If cudf isn't insys.modules, no object can be acudf.DataFrame, so we return False immediately.This eliminates OOM kills on CPU-only benchmarks running on the ray-ml image, where 8 workers × 1.5 GiB of unnecessary cudf overhead was pushing 30 GiB nodes past the 95% memory threshold.
Related issues
Related to the
map_batches_fixed_size_tasks_numpy_oncenightly benchmark OOM failures.Additional information
The benchmark inherits type: gpu (ray-ml image) from the data test DEFAULTS in
release_data_tests.yaml, which includescudf-cu12viadl-gpu-requirements.txt. The actual cluster uses CPU instances (m5.2xlarge). Every worker was importing cudf through_validate_batch_output->_is_cudf_dataframe(line 519 in plan_udf_map_op.py), which runs on every UDF output batch regardless of batch format.