Skip to content

[Core] (Resource Isolation 10/n) Add event based memory monitor#62060

Merged
MengjinYan merged 13 commits into
ray-project:masterfrom
Kunchd:event_monitor
Apr 4, 2026
Merged

[Core] (Resource Isolation 10/n) Add event based memory monitor#62060
MengjinYan merged 13 commits into
ray-project:masterfrom
Kunchd:event_monitor

Conversation

@Kunchd

@Kunchd Kunchd commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

Description

Note: This PR is a no-op. It does not switch out the existing memory monitor with the event memory monitor. The event memory monitor will be integrated into the system when all components of the new memory monitoring system are in place.

The existing threshold based memory monitor works by polling the host's meminfo files periodically for memory usage and triggers the killing policy when the configured memory threshold is crossed. However, this polling scheme is susceptible to missing fast memory burst events that occurs between the polling period and depending on the configuration may not kill aggressive enough to enforce the memory threshold, resulting in the kernel OOM killer triggering instead. This is undesirable as kernel OOM kills does not attempt to preserve as much work done as possible.

This PR takes introduces the event based memory monitor which is designed to work in conjunction with cgroupv2 constraints to enforce a memory limit that can't be surpassed. Specifically, the cgroupv2 memory.high constraint will prevent worker processes from being able to grab more that the limit amount of memory by heavily throttling worker processes when the limit is reached. This event based memory monitor will trigger at the same time when memory limit is reached and kill processes to free up resources while trying to preserve as much work done as possible, guaranteeing that the worker processes will not be live-locked on resources.

Related issues

Additional information

For more information on cgroupv2: https://docs.kernel.org/admin-guide/cgroup-v2.html

Signed-off-by: davik <davik@anyscale.com>
@Kunchd Kunchd requested a review from a team as a code owner March 25, 2026 20:29

if (ret < 0) {
if (errno == EINTR) {
continue;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we continue upon receiving EINTR as the event monitor is the main memory monitor and should not by interrupted. In the case where EINTR signals process termination or kill, the parent node manager will signal the event monitor to tear down instead.

Signed-off-by: davik <davik@anyscale.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new EventMemoryMonitor component designed to detect high memory pressure using cgroupv2's memory.events file and inotify. The monitor operates in a dedicated thread, triggering a KillWorkersCallback when the high memory event count increases. The changes include the implementation, header, build system integration, and comprehensive unit tests for this new monitor. Additionally, the KillWorkersCallback signature was updated across various related components to pass SystemMemorySnapshot by value, improving move semantics. A review comment highlighted critical robustness issues in the EventMemoryMonitor's parsing logic for memory.events, including imprecise string matching, potential integer overflow, and a lack of exception handling during string-to-integer conversion.

Comment thread src/ray/common/event_memory_monitor.cc Outdated
Comment thread src/ray/common/event_memory_monitor.cc Outdated
Comment thread src/ray/common/event_memory_monitor.cc
Signed-off-by: davik <davik@anyscale.com>
Comment thread src/ray/common/event_memory_monitor.cc Outdated
Signed-off-by: davik <davik@anyscale.com>
Comment thread src/ray/common/event_memory_monitor.cc
Comment thread src/ray/common/event_memory_monitor.cc
Signed-off-by: davik <davik@anyscale.com>
@Kunchd Kunchd added the go add ONLY when ready to merge, run all tests label Mar 25, 2026
@ray-gardener ray-gardener Bot added the core Issues that should be addressed in Ray Core label Mar 26, 2026
Comment thread src/ray/common/tests/event_memory_monitor_test.cc Outdated
Comment thread src/ray/common/tests/event_memory_monitor_test.cc Outdated
Comment thread src/ray/common/tests/event_memory_monitor_test.cc
<< "Callback should not be called when irrelevant event value changes";
}

TEST_F(EventMemoryMonitorTest, TestMultipleCallbacksOnMultipleChanges) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is there a point in using multiple latches here instead of just one? If not would prob just combine it with TestCallbackCalledWhenHighEventChanges and just have the latch count start at 3

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is to test that the monitor will be able to be repeatedly triggered. Latch wait will only unblock when count reaches 0. So if the latch was set to 3, we would be unable to decrement the events file multiple time.

Comment thread src/ray/common/event_memory_monitor.cc Outdated
Comment thread src/ray/common/event_memory_monitor.cc Outdated
Comment thread src/ray/common/event_memory_monitor.cc Outdated
// only return on next new event.
DrainResult drain_result = DrainInotifyBuffer(inotify_fd_);
if (drain_result == DrainResult::kInterrupted) {
// Re-enter poll loop if interrupt was signaled in case a terminate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't really make sense to me, didn't you mention above that in the case of a terminate signal the node manager handles it? Isn't this just because the memory monitor shouldn't be interrupted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the comment to clarify this confusion. Basically, if interrupt is signaled, three cases we care about can happen:

  • The signal is a sig kill. In this case, the thread will be killed and we don't need any special handling.
  • The signal is a sig term. In this case, we would like the thread to gracefull shutdown via the destructor. Thus, we return to the polling loop in anticipation of the shutdown event fd firing.
  • For other signals, we do not want the monitor to be interrupted.
Comment thread src/ray/common/event_memory_monitor.cc Outdated
Comment thread src/ray/common/event_memory_monitor.cc Outdated
Comment thread src/ray/common/pressure_memory_monitor.cc Outdated
Comment thread src/ray/common/event_memory_monitor.cc

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Comment thread src/ray/common/event_memory_monitor.cc
Comment thread src/ray/common/tests/event_memory_monitor_test.cc Outdated
Comment thread src/ray/common/event_memory_monitor.cc

@Sparks0219 Sparks0219 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just super minor nits that can be done if are in an agreeable mood

@Kunchd

Kunchd commented Apr 2, 2026

Copy link
Copy Markdown
Contributor Author

@edoakes Could you help me merge this.

@MengjinYan MengjinYan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trusting Josh's review.

@MengjinYan MengjinYan merged commit 8558194 into ray-project:master Apr 4, 2026
6 checks passed
MengjinYan pushed a commit that referenced this pull request Apr 29, 2026
…ry to create new memory monitoring system (#62705)

## Description
This PR creates the wiring needed to support a memory monitoring system
with multiple memory monitors and sets up the cgroup constraints needed
for the resource isolation design described below. Specifically this PR
creates the `multi_monitor_factory` responsible for creating the correct
combination of memory monitors depending on user configuration.
Additionally it modifies the `cgroup_manager` to set up the
`memory.high` constraint needed for providing resource isolation. For
more detail on the expected configuration, please see the descriptions
of resource isolation below.

**Note**: This PR only introduces the wiring needed for the system below
for ease of review.

### Ray's Memory Model
Before we discuss the problem resource isolation is attempting to
resolve, let's start with an overview of ray's memory model. At a high
level, ray's memory usage on each node can be broken down into three
parts.
<img width="872" height="284" alt="image"
src="https://github.com/user-attachments/assets/1e786a02-d2e7-442e-acf0-83f6ace18359"
/>
* System memory: the memory usage of ray system processes. This includes
the raylet used to manage all ray processes running on the node and the
agents responsible for emitting observability metrics (and more...).
* Object store memory: the shared memory used for storing the objects
produced by the user function. This includes the objects put into the
object store via `ray.put`, and the objects you return from a ray
function.
* User memory: the heap memory used by all the workers running user
defined tasks (including actor tasks) on the host.

The following sections will focus on isolating the user memory segment
from impacting the system processes while enhancing user slice
performance under even memory oversubscription.

### The Problem
Ray currently lacks a means to isolate user application processes from
system processes that are critical to cluster health. This results in
the following problems:
* Under significant resource contention caused by workload
oversubscription, critical processes such as the raylet can become
starved for resources, which snow balls into raylet stalling and
ultimately leading to node deaths.
* When the host itself is under memory contention, the kernel OOM killer
will trigger, killing arbitrary processes. As the kernel OOM killer is
not workload aware, this may result in significant work lost.

### Why is the existing solution insufficient
Our goal is to provide two guarantees when user run workloads on Ray.
* The system can continue to make process regardless of the resource
usage of the user tasks.
* User workloads should continue to make progress even when under
resource contention and OOMs.

To address the first issue, the existing system introduces the
`ThresholdMemoryMonitor`. This monitor works by periodically polling the
host system's memory usage information to determine the current state of
memory utilization, and it will kick off Ray's oom killing policy when
the utilization exceeds a certain threshold. The hopes of this system is
that we always reserve some amount of free memory (`total_memory -
threshold`) for the system processes on the host to make progress and
will trigger the Ray OOM killer to kill off workers if the threshold is
exceeded.

To address the second issue, the existing Ray OOM kill policy will
select a single worker to kill each time the threshold is exceeded. This
selection is based on the time of the start of execution and attempt to
preserve worker that runs longer.

However, we have observed that the poll based memory monitor alone is
insufficient for enforcing the memory threshold. This is due to the
following issue:
* The poll based model can potentially miss memory burst events between
intervals.
* The killing policy may fail to kill aggressively enough to put us back
under the threshold.
Overall, the existing solution fails to guarantee that the workload
memory usage won't impact the system processes.

### Solution/What we introduce
Cgroups to the rescue! Unlike our existing memory monitoring system
which needs to constantly poll the host system in hopes that we don't
miss a memory hungry process, cgroups provides us with tools that
enforces memory usage limits for groups of processes.

<img width="1084" height="848" alt="image"
src="https://github.com/user-attachments/assets/1a83ffae-b1c1-4ef6-8ccc-17000d0c80b6"
/>
With this tool, let's first tackle the problem of protecting critical
system processes from memory hungry workers. Let's return to our
previously described memory model.

The system memory slice will remain relatively consistent as Ray is
responsible for the system processes, so setting aside a fixed amount of
memory for it should be sufficient. The object store and user
application memory usage are both dynamic and dependent on user
workloads, so it is natural to put both under an upper bound memory
constraint that prevents them from eating into system reserved memory.

This all seems great, perhaps a little too good to be true. And
unfortunately, cgroup's memory model decided to throw us a [curve
ball](https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-ownership).
Since object store is memory shared between the raylet in the system
slice and user applications, it can belong to either the system or user
slice.

So, we address this issue with two separate memory monitors. 
* In the top diagram, we consider the case where both object store
memory and user application remain within the user cgroup. In this
scenario, we set a `memory.high` upper bound constraint that prevents
the two from exceeding the memory limit, and kick in the event memory
monitor to select workers to kill when it is met.
* In the bottom diagram, we consider the case where a portion of the
object store memory may have escaped the user cgroup. Since this usage
is no longer visible to the user cgroup enforcing the `memory.high`, we
introduce the threshold memory monitor to catch this by monitoring the
system wide object store usage and the user application usage. This way,
we can still catch the scenario where object store memory has escaped
our user cgroup protection.

So, the issue of protecting the system processes is resolved. What about
ensuring workloads can continue to make progress even under resource
contention. This is accomplished with our design as well as it ensures
that the ray OOM killer will trigger before the kernel OOM killer. This
is particularly useful as the ray OOM killer is workload aware and
selects workers to kill based on time since start of execution to
approximate killing the worker with the least amount of work done.

### What will change
At the completion of this project, when resource isolation is enabled,
the above discussed memory monitoring system will be enabled. When
resource isolation is disabled, we will maintain the same behavior as
before. However, the new killing policy will be applied to both the
existing memory monitoring system with resource isolation disabled, and
our new memory monitoring system.

### Performance
So how well do the changes actually protect Ray from kernel OOMs which
are detrimental for performance? Here we show our experiments across
simulated (first 4) and real world (last 3) memory heavy workloads.
<img width="1053" height="606" alt="image"
src="https://github.com/user-attachments/assets/d89971fe-5a84-4f03-abef-3413ad1e0c1b"
/>


Additionally, we have also observed while running the workloads above
that memory throttling mode successfully eliminates node failures
(caused by memory starvation of critical ray system processes) compared
to the existing monitoring system without resource isolation, where we
typically observe node failures through out the video object detection
workload.

## Additional information
* PR which introduced the new worker killing policy:
#61323
* PR which introduced the pressure memory monitor:
#61361
* PR which introduced the event memory monitor on memory throttling
mode: #62060

---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
@Kunchd Kunchd deleted the event_monitor branch May 13, 2026 00:32
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…project#62060)

## Description
**Note**: This PR is a no-op. It does not switch out the existing memory
monitor with the event memory monitor. The event memory monitor will be
integrated into the system when all components of the new memory
monitoring system are in place.

The existing threshold based memory monitor works by polling the host's
meminfo files periodically for memory usage and triggers the killing
policy when the configured memory threshold is crossed. However, this
polling scheme is susceptible to missing fast memory burst events that
occurs between the polling period and depending on the configuration may
not kill aggressive enough to enforce the memory threshold, resulting in
the kernel OOM killer triggering instead. This is undesirable as kernel
OOM kills does not attempt to preserve as much work done as possible.

This PR takes introduces the event based memory monitor which is
designed to work in conjunction with cgroupv2 constraints to enforce a
memory limit that can't be surpassed. Specifically, the cgroupv2
`memory.high` constraint will prevent worker processes from being able
to grab more that the limit amount of memory by heavily throttling
worker processes when the limit is reached. This event based memory
monitor will trigger at the same time when memory limit is reached and
kill processes to free up resources while trying to preserve as much
work done as possible, guaranteeing that the worker processes will not
be live-locked on resources.

## Related issues

## Additional information
For more information on cgroupv2:
https://docs.kernel.org/admin-guide/cgroup-v2.html

---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…ry to create new memory monitoring system (ray-project#62705)

## Description
This PR creates the wiring needed to support a memory monitoring system
with multiple memory monitors and sets up the cgroup constraints needed
for the resource isolation design described below. Specifically this PR
creates the `multi_monitor_factory` responsible for creating the correct
combination of memory monitors depending on user configuration.
Additionally it modifies the `cgroup_manager` to set up the
`memory.high` constraint needed for providing resource isolation. For
more detail on the expected configuration, please see the descriptions
of resource isolation below.

**Note**: This PR only introduces the wiring needed for the system below
for ease of review.

### Ray's Memory Model
Before we discuss the problem resource isolation is attempting to
resolve, let's start with an overview of ray's memory model. At a high
level, ray's memory usage on each node can be broken down into three
parts.
<img width="872" height="284" alt="image"
src="https://github.com/user-attachments/assets/1e786a02-d2e7-442e-acf0-83f6ace18359"
/>
* System memory: the memory usage of ray system processes. This includes
the raylet used to manage all ray processes running on the node and the
agents responsible for emitting observability metrics (and more...).
* Object store memory: the shared memory used for storing the objects
produced by the user function. This includes the objects put into the
object store via `ray.put`, and the objects you return from a ray
function.
* User memory: the heap memory used by all the workers running user
defined tasks (including actor tasks) on the host.

The following sections will focus on isolating the user memory segment
from impacting the system processes while enhancing user slice
performance under even memory oversubscription.

### The Problem
Ray currently lacks a means to isolate user application processes from
system processes that are critical to cluster health. This results in
the following problems:
* Under significant resource contention caused by workload
oversubscription, critical processes such as the raylet can become
starved for resources, which snow balls into raylet stalling and
ultimately leading to node deaths.
* When the host itself is under memory contention, the kernel OOM killer
will trigger, killing arbitrary processes. As the kernel OOM killer is
not workload aware, this may result in significant work lost.

### Why is the existing solution insufficient
Our goal is to provide two guarantees when user run workloads on Ray.
* The system can continue to make process regardless of the resource
usage of the user tasks.
* User workloads should continue to make progress even when under
resource contention and OOMs.

To address the first issue, the existing system introduces the
`ThresholdMemoryMonitor`. This monitor works by periodically polling the
host system's memory usage information to determine the current state of
memory utilization, and it will kick off Ray's oom killing policy when
the utilization exceeds a certain threshold. The hopes of this system is
that we always reserve some amount of free memory (`total_memory -
threshold`) for the system processes on the host to make progress and
will trigger the Ray OOM killer to kill off workers if the threshold is
exceeded.

To address the second issue, the existing Ray OOM kill policy will
select a single worker to kill each time the threshold is exceeded. This
selection is based on the time of the start of execution and attempt to
preserve worker that runs longer.

However, we have observed that the poll based memory monitor alone is
insufficient for enforcing the memory threshold. This is due to the
following issue:
* The poll based model can potentially miss memory burst events between
intervals.
* The killing policy may fail to kill aggressively enough to put us back
under the threshold.
Overall, the existing solution fails to guarantee that the workload
memory usage won't impact the system processes.

### Solution/What we introduce
Cgroups to the rescue! Unlike our existing memory monitoring system
which needs to constantly poll the host system in hopes that we don't
miss a memory hungry process, cgroups provides us with tools that
enforces memory usage limits for groups of processes.

<img width="1084" height="848" alt="image"
src="https://github.com/user-attachments/assets/1a83ffae-b1c1-4ef6-8ccc-17000d0c80b6"
/>
With this tool, let's first tackle the problem of protecting critical
system processes from memory hungry workers. Let's return to our
previously described memory model.

The system memory slice will remain relatively consistent as Ray is
responsible for the system processes, so setting aside a fixed amount of
memory for it should be sufficient. The object store and user
application memory usage are both dynamic and dependent on user
workloads, so it is natural to put both under an upper bound memory
constraint that prevents them from eating into system reserved memory.

This all seems great, perhaps a little too good to be true. And
unfortunately, cgroup's memory model decided to throw us a [curve
ball](https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-ownership).
Since object store is memory shared between the raylet in the system
slice and user applications, it can belong to either the system or user
slice.

So, we address this issue with two separate memory monitors. 
* In the top diagram, we consider the case where both object store
memory and user application remain within the user cgroup. In this
scenario, we set a `memory.high` upper bound constraint that prevents
the two from exceeding the memory limit, and kick in the event memory
monitor to select workers to kill when it is met.
* In the bottom diagram, we consider the case where a portion of the
object store memory may have escaped the user cgroup. Since this usage
is no longer visible to the user cgroup enforcing the `memory.high`, we
introduce the threshold memory monitor to catch this by monitoring the
system wide object store usage and the user application usage. This way,
we can still catch the scenario where object store memory has escaped
our user cgroup protection.

So, the issue of protecting the system processes is resolved. What about
ensuring workloads can continue to make progress even under resource
contention. This is accomplished with our design as well as it ensures
that the ray OOM killer will trigger before the kernel OOM killer. This
is particularly useful as the ray OOM killer is workload aware and
selects workers to kill based on time since start of execution to
approximate killing the worker with the least amount of work done.

### What will change
At the completion of this project, when resource isolation is enabled,
the above discussed memory monitoring system will be enabled. When
resource isolation is disabled, we will maintain the same behavior as
before. However, the new killing policy will be applied to both the
existing memory monitoring system with resource isolation disabled, and
our new memory monitoring system.

### Performance
So how well do the changes actually protect Ray from kernel OOMs which
are detrimental for performance? Here we show our experiments across
simulated (first 4) and real world (last 3) memory heavy workloads.
<img width="1053" height="606" alt="image"
src="https://github.com/user-attachments/assets/d89971fe-5a84-4f03-abef-3413ad1e0c1b"
/>


Additionally, we have also observed while running the workloads above
that memory throttling mode successfully eliminates node failures
(caused by memory starvation of critical ray system processes) compared
to the existing monitoring system without resource isolation, where we
typically observe node failures through out the video object detection
workload.

## Additional information
* PR which introduced the new worker killing policy:
ray-project#61323
* PR which introduced the pressure memory monitor:
ray-project#61361
* PR which introduced the event memory monitor on memory throttling
mode: ray-project#62060

---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

3 participants