Skip to content

[Core] Consider cgroup limit when fetching cpu#63685

Merged
edoakes merged 12 commits into
ray-project:masterfrom
Kunchd:get_thread_consider_cgroup
Jun 3, 2026
Merged

[Core] Consider cgroup limit when fetching cpu#63685
edoakes merged 12 commits into
ray-project:masterfrom
Kunchd:get_thread_consider_cgroup

Conversation

@Kunchd

@Kunchd Kunchd commented May 28, 2026

Copy link
Copy Markdown
Contributor

Description

Ray determines the number of background threads available for handling grpc requests via checking the amount of cpu available on the hardware and selecting a proportional amount (e.g. std::thread::hardware_concurrency() / 4U on the gcs). However, within containers, it's possible that the amount of cores available to a specific ray node is significantly lower than the actual hardware amount available due to cgroup constraints. If we continue to only rely on the hardware core count to determine the number of threads, we may end up over-allocating background threads.

This PR introduces logic to consider the cgroup cpu limit when determining the number of cpus actually available to a specific ray node, preventing the overallocation of background threads when the ray node has less cpu available compared to the actual number of cores on the host.

Related issues

Closes #63476

Additional information

Signed-off-by: davik <davik@anyscale.com>
@Kunchd Kunchd requested a review from a team as a code owner May 28, 2026 00:42
@Kunchd Kunchd added the go add ONLY when ready to merge, run all tests label May 28, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces CpuMonitorUtils to dynamically query the effective CPU limit of a cgroup (supporting both cgroup v1 and v2) instead of relying solely on std::thread::hardware_concurrency(). This utility is integrated into ray_config_def.h to dynamically size thread pools based on the cgroup's CPU limits. The feedback highlights a critical issue where std::filesystem::exists can throw exceptions and crash the process in restricted environments, suggesting the use of the noexcept overload. Additionally, it recommends passing root_cgroup_path by const std::string & to avoid unnecessary copies and explicitly including ray/util/logging.h to prevent fragile transitive dependency issues.

Comment thread src/ray/common/monitors/cpu_monitor_utils.cc Outdated
Comment thread src/ray/common/monitors/cpu_monitor_utils.cc Outdated
Comment thread src/ray/common/monitors/cpu_monitor_utils.h Outdated
Comment thread src/ray/common/monitors/cpu_monitor_utils.cc
@ray-gardener ray-gardener Bot added the core Issues that should be addressed in Ray Core label May 28, 2026
protected:
void SetUp() override {
StatusOr<std::unique_ptr<TempDirectory>> root_or = TempDirectory::Create();
ASSERT_TRUE(root_or.ok()) << root_or.ToString();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
ASSERT_TRUE(root_or.ok()) << root_or.ToString();
RAY_ASSIGN_OF_CHECK(root_, TempDirectory::Create());

so we could assign and throw exception with detailed error in one line.
Same below, should be easy work for agent.

Comment thread src/ray/common/monitors/tests/cpu_monitor_utils_test.cc Outdated
Comment thread src/ray/common/monitors/cpu_monitor_utils.h Outdated
Comment thread src/ray/common/ray_config_def.h Outdated
Comment thread src/ray/common/ray_config_def.h Outdated
davik and others added 2 commits May 28, 2026 22:00
Comment thread src/ray/common/monitors/cpu_monitor_utils.cc
davik added 4 commits May 28, 2026 22:45
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
@Kunchd Kunchd requested a review from dentiny May 29, 2026 04:02

@dentiny dentiny left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment thread src/ray/common/monitors/cpu_monitor_utils.cc Outdated
Comment thread src/ray/common/monitors/cpu_monitor_utils.cc Outdated
Comment thread src/ray/common/monitors/cpu_monitor_utils.cc Outdated
Comment thread src/ray/common/monitors/cpu_monitor_utils.h
* either file cannot be read, or StatusT::Invalid if a file contains
* invalid values.
*/
static CpuCountOr GetCpuCountV1(const std::string &cfs_quota_path,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's necessary to expose GetCpuCountV1 and GetCpuCountV2 as public interface

  • It's not use anywhere outside of the class
  • From ray user and developer's perspective, having a unified API which handles V1 and V2 seems better

Actually I'm thinking if it's worthy to provide a util function which gets CPU limit with default root path (GetCpuLimit with no argument)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions should already be private. I will make the default /sys/fs/cgroup path the default argument to the common function

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally it's preferred to put them in the anonymous namespace, for example, symbols are visible outside of the translation unit.

Comment thread src/ray/common/ray_config_def.h Outdated
Signed-off-by: davik <davik@anyscale.com>
Comment thread src/ray/common/ray_config_def.h Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 6912a2f. Configure here.

Comment thread src/ray/common/monitors/cpu_monitor_utils.cc Outdated
@Kunchd Kunchd requested a review from dentiny June 1, 2026 21:30
@Kunchd

Kunchd commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

@dentiny Just wanted to double check, does the current changes address the issue you were experiencing in #63476?

@dentiny dentiny left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Comment thread src/ray/common/monitors/cpu_monitor_utils.cc

@rueian rueian left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@edoakes edoakes merged commit daffcdc into ray-project:master Jun 3, 2026
6 checks passed
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
## Description
Ray determines the number of background threads available for handling
grpc requests via checking the amount of cpu available on the hardware
and selecting a proportional amount (e.g.
`std::thread::hardware_concurrency() / 4U` on the gcs). However, within
containers, it's possible that the amount of cores available to a
specific ray node is significantly lower than the actual hardware amount
available due to cgroup constraints. If we continue to only rely on the
hardware core count to determine the number of threads, we may end up
over-allocating background threads.

This PR introduces logic to consider the cgroup cpu limit when
determining the number of cpus actually available to a specific ray
node, preventing the overallocation of background threads when the ray
node has less cpu available compared to the actual number of cores on
the host.

## Related issues
Closes ray-project#63476

## Additional information

---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
## Description
Ray determines the number of background threads available for handling
grpc requests via checking the amount of cpu available on the hardware
and selecting a proportional amount (e.g.
`std::thread::hardware_concurrency() / 4U` on the gcs). However, within
containers, it's possible that the amount of cores available to a
specific ray node is significantly lower than the actual hardware amount
available due to cgroup constraints. If we continue to only rely on the
hardware core count to determine the number of threads, we may end up
over-allocating background threads.

This PR introduces logic to consider the cgroup cpu limit when
determining the number of cpus actually available to a specific ray
node, preventing the overallocation of background threads when the ray
node has less cpu available compared to the actual number of cores on
the host.

## Related issues
Closes ray-project#63476

## Additional information

---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
@testinfected

Copy link
Copy Markdown

Hey guys, thank you for this. Do you think it fixes this #48700?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

5 participants