[Core] Consider cgroup limit when fetching cpu#63685
Conversation
Signed-off-by: davik <davik@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces CpuMonitorUtils to dynamically query the effective CPU limit of a cgroup (supporting both cgroup v1 and v2) instead of relying solely on std::thread::hardware_concurrency(). This utility is integrated into ray_config_def.h to dynamically size thread pools based on the cgroup's CPU limits. The feedback highlights a critical issue where std::filesystem::exists can throw exceptions and crash the process in restricted environments, suggesting the use of the noexcept overload. Additionally, it recommends passing root_cgroup_path by const std::string & to avoid unnecessary copies and explicitly including ray/util/logging.h to prevent fragile transitive dependency issues.
| protected: | ||
| void SetUp() override { | ||
| StatusOr<std::unique_ptr<TempDirectory>> root_or = TempDirectory::Create(); | ||
| ASSERT_TRUE(root_or.ok()) << root_or.ToString(); |
There was a problem hiding this comment.
nit
| ASSERT_TRUE(root_or.ok()) << root_or.ToString(); | |
| RAY_ASSIGN_OF_CHECK(root_, TempDirectory::Create()); |
so we could assign and throw exception with detailed error in one line.
Same below, should be easy work for agent.
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
| * either file cannot be read, or StatusT::Invalid if a file contains | ||
| * invalid values. | ||
| */ | ||
| static CpuCountOr GetCpuCountV1(const std::string &cfs_quota_path, |
There was a problem hiding this comment.
I'm not sure if it's necessary to expose GetCpuCountV1 and GetCpuCountV2 as public interface
- It's not use anywhere outside of the class
- From ray user and developer's perspective, having a unified API which handles V1 and V2 seems better
Actually I'm thinking if it's worthy to provide a util function which gets CPU limit with default root path (GetCpuLimit with no argument)
There was a problem hiding this comment.
These two functions should already be private. I will make the default /sys/fs/cgroup path the default argument to the common function
There was a problem hiding this comment.
generally it's preferred to put them in the anonymous namespace, for example, symbols are visible outside of the translation unit.
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 6912a2f. Configure here.
Signed-off-by: davik <davik@anyscale.com>
## Description Ray determines the number of background threads available for handling grpc requests via checking the amount of cpu available on the hardware and selecting a proportional amount (e.g. `std::thread::hardware_concurrency() / 4U` on the gcs). However, within containers, it's possible that the amount of cores available to a specific ray node is significantly lower than the actual hardware amount available due to cgroup constraints. If we continue to only rely on the hardware core count to determine the number of threads, we may end up over-allocating background threads. This PR introduces logic to consider the cgroup cpu limit when determining the number of cpus actually available to a specific ray node, preventing the overallocation of background threads when the ray node has less cpu available compared to the actual number of cores on the host. ## Related issues Closes ray-project#63476 ## Additional information --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>
## Description Ray determines the number of background threads available for handling grpc requests via checking the amount of cpu available on the hardware and selecting a proportional amount (e.g. `std::thread::hardware_concurrency() / 4U` on the gcs). However, within containers, it's possible that the amount of cores available to a specific ray node is significantly lower than the actual hardware amount available due to cgroup constraints. If we continue to only rely on the hardware core count to determine the number of threads, we may end up over-allocating background threads. This PR introduces logic to consider the cgroup cpu limit when determining the number of cpus actually available to a specific ray node, preventing the overallocation of background threads when the ray node has less cpu available compared to the actual number of cores on the host. ## Related issues Closes ray-project#63476 ## Additional information --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>
|
Hey guys, thank you for this. Do you think it fixes this #48700? |

Description
Ray determines the number of background threads available for handling grpc requests via checking the amount of cpu available on the hardware and selecting a proportional amount (e.g.
std::thread::hardware_concurrency() / 4Uon the gcs). However, within containers, it's possible that the amount of cores available to a specific ray node is significantly lower than the actual hardware amount available due to cgroup constraints. If we continue to only rely on the hardware core count to determine the number of threads, we may end up over-allocating background threads.This PR introduces logic to consider the cgroup cpu limit when determining the number of cpus actually available to a specific ray node, preventing the overallocation of background threads when the ray node has less cpu available compared to the actual number of cores on the host.
Related issues
Closes #63476
Additional information