[Core] (Resource Isolation 10/n) Add event based memory monitor#62060
Conversation
Signed-off-by: davik <davik@anyscale.com>
|
|
||
| if (ret < 0) { | ||
| if (errno == EINTR) { | ||
| continue; |
There was a problem hiding this comment.
Here we continue upon receiving EINTR as the event monitor is the main memory monitor and should not by interrupted. In the case where EINTR signals process termination or kill, the parent node manager will signal the event monitor to tear down instead.
Signed-off-by: davik <davik@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new EventMemoryMonitor component designed to detect high memory pressure using cgroupv2's memory.events file and inotify. The monitor operates in a dedicated thread, triggering a KillWorkersCallback when the high memory event count increases. The changes include the implementation, header, build system integration, and comprehensive unit tests for this new monitor. Additionally, the KillWorkersCallback signature was updated across various related components to pass SystemMemorySnapshot by value, improving move semantics. A review comment highlighted critical robustness issues in the EventMemoryMonitor's parsing logic for memory.events, including imprecise string matching, potential integer overflow, and a lack of exception handling during string-to-integer conversion.
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
| << "Callback should not be called when irrelevant event value changes"; | ||
| } | ||
|
|
||
| TEST_F(EventMemoryMonitorTest, TestMultipleCallbacksOnMultipleChanges) { |
There was a problem hiding this comment.
nit: Is there a point in using multiple latches here instead of just one? If not would prob just combine it with TestCallbackCalledWhenHighEventChanges and just have the latch count start at 3
There was a problem hiding this comment.
The goal here is to test that the monitor will be able to be repeatedly triggered. Latch wait will only unblock when count reaches 0. So if the latch was set to 3, we would be unable to decrement the events file multiple time.
| // only return on next new event. | ||
| DrainResult drain_result = DrainInotifyBuffer(inotify_fd_); | ||
| if (drain_result == DrainResult::kInterrupted) { | ||
| // Re-enter poll loop if interrupt was signaled in case a terminate |
There was a problem hiding this comment.
This comment doesn't really make sense to me, didn't you mention above that in the case of a terminate signal the node manager handles it? Isn't this just because the memory monitor shouldn't be interrupted?
There was a problem hiding this comment.
I've updated the comment to clarify this confusion. Basically, if interrupt is signaled, three cases we care about can happen:
- The signal is a sig kill. In this case, the thread will be killed and we don't need any special handling.
- The signal is a sig term. In this case, we would like the thread to gracefull shutdown via the destructor. Thus, we return to the polling loop in anticipation of the shutdown event fd firing.
- For other signals, we do not want the monitor to be interrupted.
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
Sparks0219
left a comment
There was a problem hiding this comment.
LGTM, just super minor nits that can be done if are in an agreeable mood
Signed-off-by: davik <davik@anyscale.com>
|
@edoakes Could you help me merge this. |
MengjinYan
left a comment
There was a problem hiding this comment.
Trusting Josh's review.
…ry to create new memory monitoring system (#62705) ## Description This PR creates the wiring needed to support a memory monitoring system with multiple memory monitors and sets up the cgroup constraints needed for the resource isolation design described below. Specifically this PR creates the `multi_monitor_factory` responsible for creating the correct combination of memory monitors depending on user configuration. Additionally it modifies the `cgroup_manager` to set up the `memory.high` constraint needed for providing resource isolation. For more detail on the expected configuration, please see the descriptions of resource isolation below. **Note**: This PR only introduces the wiring needed for the system below for ease of review. ### Ray's Memory Model Before we discuss the problem resource isolation is attempting to resolve, let's start with an overview of ray's memory model. At a high level, ray's memory usage on each node can be broken down into three parts. <img width="872" height="284" alt="image" src="https://github.com/user-attachments/assets/1e786a02-d2e7-442e-acf0-83f6ace18359" /> * System memory: the memory usage of ray system processes. This includes the raylet used to manage all ray processes running on the node and the agents responsible for emitting observability metrics (and more...). * Object store memory: the shared memory used for storing the objects produced by the user function. This includes the objects put into the object store via `ray.put`, and the objects you return from a ray function. * User memory: the heap memory used by all the workers running user defined tasks (including actor tasks) on the host. The following sections will focus on isolating the user memory segment from impacting the system processes while enhancing user slice performance under even memory oversubscription. ### The Problem Ray currently lacks a means to isolate user application processes from system processes that are critical to cluster health. This results in the following problems: * Under significant resource contention caused by workload oversubscription, critical processes such as the raylet can become starved for resources, which snow balls into raylet stalling and ultimately leading to node deaths. * When the host itself is under memory contention, the kernel OOM killer will trigger, killing arbitrary processes. As the kernel OOM killer is not workload aware, this may result in significant work lost. ### Why is the existing solution insufficient Our goal is to provide two guarantees when user run workloads on Ray. * The system can continue to make process regardless of the resource usage of the user tasks. * User workloads should continue to make progress even when under resource contention and OOMs. To address the first issue, the existing system introduces the `ThresholdMemoryMonitor`. This monitor works by periodically polling the host system's memory usage information to determine the current state of memory utilization, and it will kick off Ray's oom killing policy when the utilization exceeds a certain threshold. The hopes of this system is that we always reserve some amount of free memory (`total_memory - threshold`) for the system processes on the host to make progress and will trigger the Ray OOM killer to kill off workers if the threshold is exceeded. To address the second issue, the existing Ray OOM kill policy will select a single worker to kill each time the threshold is exceeded. This selection is based on the time of the start of execution and attempt to preserve worker that runs longer. However, we have observed that the poll based memory monitor alone is insufficient for enforcing the memory threshold. This is due to the following issue: * The poll based model can potentially miss memory burst events between intervals. * The killing policy may fail to kill aggressively enough to put us back under the threshold. Overall, the existing solution fails to guarantee that the workload memory usage won't impact the system processes. ### Solution/What we introduce Cgroups to the rescue! Unlike our existing memory monitoring system which needs to constantly poll the host system in hopes that we don't miss a memory hungry process, cgroups provides us with tools that enforces memory usage limits for groups of processes. <img width="1084" height="848" alt="image" src="https://github.com/user-attachments/assets/1a83ffae-b1c1-4ef6-8ccc-17000d0c80b6" /> With this tool, let's first tackle the problem of protecting critical system processes from memory hungry workers. Let's return to our previously described memory model. The system memory slice will remain relatively consistent as Ray is responsible for the system processes, so setting aside a fixed amount of memory for it should be sufficient. The object store and user application memory usage are both dynamic and dependent on user workloads, so it is natural to put both under an upper bound memory constraint that prevents them from eating into system reserved memory. This all seems great, perhaps a little too good to be true. And unfortunately, cgroup's memory model decided to throw us a [curve ball](https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-ownership). Since object store is memory shared between the raylet in the system slice and user applications, it can belong to either the system or user slice. So, we address this issue with two separate memory monitors. * In the top diagram, we consider the case where both object store memory and user application remain within the user cgroup. In this scenario, we set a `memory.high` upper bound constraint that prevents the two from exceeding the memory limit, and kick in the event memory monitor to select workers to kill when it is met. * In the bottom diagram, we consider the case where a portion of the object store memory may have escaped the user cgroup. Since this usage is no longer visible to the user cgroup enforcing the `memory.high`, we introduce the threshold memory monitor to catch this by monitoring the system wide object store usage and the user application usage. This way, we can still catch the scenario where object store memory has escaped our user cgroup protection. So, the issue of protecting the system processes is resolved. What about ensuring workloads can continue to make progress even under resource contention. This is accomplished with our design as well as it ensures that the ray OOM killer will trigger before the kernel OOM killer. This is particularly useful as the ray OOM killer is workload aware and selects workers to kill based on time since start of execution to approximate killing the worker with the least amount of work done. ### What will change At the completion of this project, when resource isolation is enabled, the above discussed memory monitoring system will be enabled. When resource isolation is disabled, we will maintain the same behavior as before. However, the new killing policy will be applied to both the existing memory monitoring system with resource isolation disabled, and our new memory monitoring system. ### Performance So how well do the changes actually protect Ray from kernel OOMs which are detrimental for performance? Here we show our experiments across simulated (first 4) and real world (last 3) memory heavy workloads. <img width="1053" height="606" alt="image" src="https://github.com/user-attachments/assets/d89971fe-5a84-4f03-abef-3413ad1e0c1b" /> Additionally, we have also observed while running the workloads above that memory throttling mode successfully eliminates node failures (caused by memory starvation of critical ray system processes) compared to the existing monitoring system without resource isolation, where we typically observe node failures through out the video object detection workload. ## Additional information * PR which introduced the new worker killing policy: #61323 * PR which introduced the pressure memory monitor: #61361 * PR which introduced the event memory monitor on memory throttling mode: #62060 --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>
…project#62060) ## Description **Note**: This PR is a no-op. It does not switch out the existing memory monitor with the event memory monitor. The event memory monitor will be integrated into the system when all components of the new memory monitoring system are in place. The existing threshold based memory monitor works by polling the host's meminfo files periodically for memory usage and triggers the killing policy when the configured memory threshold is crossed. However, this polling scheme is susceptible to missing fast memory burst events that occurs between the polling period and depending on the configuration may not kill aggressive enough to enforce the memory threshold, resulting in the kernel OOM killer triggering instead. This is undesirable as kernel OOM kills does not attempt to preserve as much work done as possible. This PR takes introduces the event based memory monitor which is designed to work in conjunction with cgroupv2 constraints to enforce a memory limit that can't be surpassed. Specifically, the cgroupv2 `memory.high` constraint will prevent worker processes from being able to grab more that the limit amount of memory by heavily throttling worker processes when the limit is reached. This event based memory monitor will trigger at the same time when memory limit is reached and kill processes to free up resources while trying to preserve as much work done as possible, guaranteeing that the worker processes will not be live-locked on resources. ## Related issues ## Additional information For more information on cgroupv2: https://docs.kernel.org/admin-guide/cgroup-v2.html --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>
…ry to create new memory monitoring system (ray-project#62705) ## Description This PR creates the wiring needed to support a memory monitoring system with multiple memory monitors and sets up the cgroup constraints needed for the resource isolation design described below. Specifically this PR creates the `multi_monitor_factory` responsible for creating the correct combination of memory monitors depending on user configuration. Additionally it modifies the `cgroup_manager` to set up the `memory.high` constraint needed for providing resource isolation. For more detail on the expected configuration, please see the descriptions of resource isolation below. **Note**: This PR only introduces the wiring needed for the system below for ease of review. ### Ray's Memory Model Before we discuss the problem resource isolation is attempting to resolve, let's start with an overview of ray's memory model. At a high level, ray's memory usage on each node can be broken down into three parts. <img width="872" height="284" alt="image" src="https://github.com/user-attachments/assets/1e786a02-d2e7-442e-acf0-83f6ace18359" /> * System memory: the memory usage of ray system processes. This includes the raylet used to manage all ray processes running on the node and the agents responsible for emitting observability metrics (and more...). * Object store memory: the shared memory used for storing the objects produced by the user function. This includes the objects put into the object store via `ray.put`, and the objects you return from a ray function. * User memory: the heap memory used by all the workers running user defined tasks (including actor tasks) on the host. The following sections will focus on isolating the user memory segment from impacting the system processes while enhancing user slice performance under even memory oversubscription. ### The Problem Ray currently lacks a means to isolate user application processes from system processes that are critical to cluster health. This results in the following problems: * Under significant resource contention caused by workload oversubscription, critical processes such as the raylet can become starved for resources, which snow balls into raylet stalling and ultimately leading to node deaths. * When the host itself is under memory contention, the kernel OOM killer will trigger, killing arbitrary processes. As the kernel OOM killer is not workload aware, this may result in significant work lost. ### Why is the existing solution insufficient Our goal is to provide two guarantees when user run workloads on Ray. * The system can continue to make process regardless of the resource usage of the user tasks. * User workloads should continue to make progress even when under resource contention and OOMs. To address the first issue, the existing system introduces the `ThresholdMemoryMonitor`. This monitor works by periodically polling the host system's memory usage information to determine the current state of memory utilization, and it will kick off Ray's oom killing policy when the utilization exceeds a certain threshold. The hopes of this system is that we always reserve some amount of free memory (`total_memory - threshold`) for the system processes on the host to make progress and will trigger the Ray OOM killer to kill off workers if the threshold is exceeded. To address the second issue, the existing Ray OOM kill policy will select a single worker to kill each time the threshold is exceeded. This selection is based on the time of the start of execution and attempt to preserve worker that runs longer. However, we have observed that the poll based memory monitor alone is insufficient for enforcing the memory threshold. This is due to the following issue: * The poll based model can potentially miss memory burst events between intervals. * The killing policy may fail to kill aggressively enough to put us back under the threshold. Overall, the existing solution fails to guarantee that the workload memory usage won't impact the system processes. ### Solution/What we introduce Cgroups to the rescue! Unlike our existing memory monitoring system which needs to constantly poll the host system in hopes that we don't miss a memory hungry process, cgroups provides us with tools that enforces memory usage limits for groups of processes. <img width="1084" height="848" alt="image" src="https://github.com/user-attachments/assets/1a83ffae-b1c1-4ef6-8ccc-17000d0c80b6" /> With this tool, let's first tackle the problem of protecting critical system processes from memory hungry workers. Let's return to our previously described memory model. The system memory slice will remain relatively consistent as Ray is responsible for the system processes, so setting aside a fixed amount of memory for it should be sufficient. The object store and user application memory usage are both dynamic and dependent on user workloads, so it is natural to put both under an upper bound memory constraint that prevents them from eating into system reserved memory. This all seems great, perhaps a little too good to be true. And unfortunately, cgroup's memory model decided to throw us a [curve ball](https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-ownership). Since object store is memory shared between the raylet in the system slice and user applications, it can belong to either the system or user slice. So, we address this issue with two separate memory monitors. * In the top diagram, we consider the case where both object store memory and user application remain within the user cgroup. In this scenario, we set a `memory.high` upper bound constraint that prevents the two from exceeding the memory limit, and kick in the event memory monitor to select workers to kill when it is met. * In the bottom diagram, we consider the case where a portion of the object store memory may have escaped the user cgroup. Since this usage is no longer visible to the user cgroup enforcing the `memory.high`, we introduce the threshold memory monitor to catch this by monitoring the system wide object store usage and the user application usage. This way, we can still catch the scenario where object store memory has escaped our user cgroup protection. So, the issue of protecting the system processes is resolved. What about ensuring workloads can continue to make progress even under resource contention. This is accomplished with our design as well as it ensures that the ray OOM killer will trigger before the kernel OOM killer. This is particularly useful as the ray OOM killer is workload aware and selects workers to kill based on time since start of execution to approximate killing the worker with the least amount of work done. ### What will change At the completion of this project, when resource isolation is enabled, the above discussed memory monitoring system will be enabled. When resource isolation is disabled, we will maintain the same behavior as before. However, the new killing policy will be applied to both the existing memory monitoring system with resource isolation disabled, and our new memory monitoring system. ### Performance So how well do the changes actually protect Ray from kernel OOMs which are detrimental for performance? Here we show our experiments across simulated (first 4) and real world (last 3) memory heavy workloads. <img width="1053" height="606" alt="image" src="https://github.com/user-attachments/assets/d89971fe-5a84-4f03-abef-3413ad1e0c1b" /> Additionally, we have also observed while running the workloads above that memory throttling mode successfully eliminates node failures (caused by memory starvation of critical ray system processes) compared to the existing monitoring system without resource isolation, where we typically observe node failures through out the video object detection workload. ## Additional information * PR which introduced the new worker killing policy: ray-project#61323 * PR which introduced the pressure memory monitor: ray-project#61361 * PR which introduced the event memory monitor on memory throttling mode: ray-project#62060 --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>

Description
Note: This PR is a no-op. It does not switch out the existing memory monitor with the event memory monitor. The event memory monitor will be integrated into the system when all components of the new memory monitoring system are in place.
The existing threshold based memory monitor works by polling the host's meminfo files periodically for memory usage and triggers the killing policy when the configured memory threshold is crossed. However, this polling scheme is susceptible to missing fast memory burst events that occurs between the polling period and depending on the configuration may not kill aggressive enough to enforce the memory threshold, resulting in the kernel OOM killer triggering instead. This is undesirable as kernel OOM kills does not attempt to preserve as much work done as possible.
This PR takes introduces the event based memory monitor which is designed to work in conjunction with cgroupv2 constraints to enforce a memory limit that can't be surpassed. Specifically, the cgroupv2
memory.highconstraint will prevent worker processes from being able to grab more that the limit amount of memory by heavily throttling worker processes when the limit is reached. This event based memory monitor will trigger at the same time when memory limit is reached and kill processes to free up resources while trying to preserve as much work done as possible, guaranteeing that the worker processes will not be live-locked on resources.Related issues
Additional information
For more information on cgroupv2: https://docs.kernel.org/admin-guide/cgroup-v2.html