Skip to content

[core] Run GCS health check on io_service#62374

Merged
edoakes merged 30 commits into
ray-project:masterfrom
edoakes:eoakes/gcs-hc
Apr 8, 2026
Merged

[core] Run GCS health check on io_service#62374
edoakes merged 30 commits into
ray-project:masterfrom
edoakes:eoakes/gcs-hc

Conversation

@edoakes

@edoakes edoakes commented Apr 6, 2026

Copy link
Copy Markdown
Collaborator

Previously, the GCS health check was implemented using the builtin implementation from gRPC. This runs on the internal gRPC server threads rather than on our boost::asio event loop(s). This is a poor indicator of system health, as if the io_service is stuck, the GCS will make no progress but the health check will be passing.

This PR overrides the health check implementation to post a callback to the io_service instead. In a future PR I will extend this to also check the health of other control plane threads.

edoakes added 18 commits April 3, 2026 09:58
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Apr 6, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a custom gRPC health check service for the GCS server that operates on the io_context event loop, ensuring that health checks time out if the event loop becomes unresponsive. The review feedback identifies a significant thread-safety issue regarding the global toggling of the default gRPC health check service and recommends better adherence to the gRPC health checking protocol, specifically by handling the service field in requests and acknowledging the missing Watch method.

Comment thread src/ray/rpc/grpc_server.cc Outdated
Comment thread src/ray/gcs/grpc_services.h
Comment thread src/ray/gcs/grpc_services.h
edoakes added 6 commits April 7, 2026 08:05
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Comment thread src/ray/rpc/grpc_server.h Outdated
edoakes added 3 commits April 8, 2026 07:16
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
@edoakes edoakes changed the title [WIP][core] Run GCS health check on io_service Apr 8, 2026
@edoakes edoakes marked this pull request as ready for review April 8, 2026 15:42
@edoakes edoakes requested a review from a team as a code owner April 8, 2026 15:42
Comment thread src/ray/gcs/tests/gcs_health_check_service_test.cc Outdated
edoakes added 3 commits April 8, 2026 09:05
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Comment on lines +560 to +564
// The health check should time out because the handler is posted to the
// main io_context which is no longer processing events.
status = CheckHealth(std::chrono::milliseconds(100));
ASSERT_FALSE(status.ok());
EXPECT_EQ(status.error_code(), grpc::StatusCode::DEADLINE_EXCEEDED);

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that this check fails if I revert the change to gcs_server.cc, meaning that previously the health check was passing if the io_service was blocked and the test is guarding against this behavior.

@rueian rueian left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just left a few thoughts:

  1. This drops the support for /grpc.health.v1.Health/Watch, which may be okay because it is seldom used.
  2. A healthcheck client will never see a NOT_SERVING response if the io_context is blocked. It will need to wait until the timeout specified by itself is reached. I wonder if it would be better if we monitor the lag of the io_context (we did) and let the health check server return NOT_SERVING once we found the lag is too large.
@edoakes

edoakes commented Apr 8, 2026

Copy link
Copy Markdown
Collaborator Author

(2) is a very good point. I agree this would be likely preferable, and would also allow us to return a more detailed explanation of why it is returning NOT_SERVING. I think we should probably merge this PR as a simple fix, and I will prototype your suggestion.

@rueian

rueian commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

(2) is a very good point. I agree this would be likely preferable, and would also allow us to return a more detailed explanation of why it is returning NOT_SERVING. I think we should probably merge this PR as a simple fix, and I will prototype your suggestion.

Oh, then, I think we can still use the gRPC healthcheck implementation. We just need to toggle the SetServingStatus(...) method on the gRPC healthcheck implementation in our lag monitor.

@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Apr 8, 2026
@Yicheng-Lu-llll

Yicheng-Lu-llll commented Apr 8, 2026

Copy link
Copy Markdown
Member

I wonder if it would be better if we monitor the lag of the io_context (we did) and let the health check server return NOT_SERVING once we found the lag is too large.

Things might become complex here. if the io_context is completely stuck, the lag probe's callback itself is also posted on the io_context, so it won't be able to execute either or needs a long time, which means it can't call SetServingStatus(false) during this window.

@edoakes

edoakes commented Apr 8, 2026

Copy link
Copy Markdown
Collaborator Author

Things might become complex here. if the io_context is completely stuck, the lag probe's callback itself is also posted on the io_context, so it won't be able to execute either or needs a long time, which means it can't call SetServingStatus(false) during this window.

One option is to actively probe the io_context inside the health check but respond internally with NOT_SERVING if the callback doesn't run within a timeout.

@edoakes edoakes merged commit 43dcc92 into ray-project:master Apr 8, 2026
5 of 6 checks passed
@Yicheng-Lu-llll

Copy link
Copy Markdown
Member

One option is to actively probe the io_context inside the health check but respond internally with NOT_SERVING if the callback doesn't run within a timeout.

This sounds great!

@rueian

rueian commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

I wonder if it would be better if we monitor the lag of the io_context (we did) and let the health check server return NOT_SERVING once we found the lag is too large.

Things might become complex here. if the io_context is completely stuck, the lag probe's callback itself is also posted on the io_context, so it won't be able to execute either or needs a long time, which means it can't call SetServingStatus(false) during this window.

Oh, that means we need to fix the lag monitor as well; otherwise, we won't see the correct lag metric when the io_context is lagged.

@edoakes

edoakes commented Apr 8, 2026

Copy link
Copy Markdown
Collaborator Author

@rueian and I discussed offline and the plan is to move the lag probe and this health checking to a standalone thread. I will work on this.

Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
Previously, the GCS health check was implemented using the builtin
implementation from gRPC. This runs on the internal gRPC server threads
rather than on our `boost::asio` event loop(s). This is a poor indicator
of system health, as if the `io_service` is stuck, the GCS will make no
progress but the health check will be passing.

This PR overrides the health check implementation to post a callback to
the `io_service` instead. In a future PR I will extend this to also
check the health of other control plane threads.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
edoakes added a commit that referenced this pull request May 28, 2026
Adds implementations for `IOContextMonitor` (core logic, unit testable)
and `IOContextMonitorThread` (wraps the monitor to run periodically).

These will be used for two purposes:

- Replacing our existing "lag probe" implementation. We currently post
the lag probe from within the IO context itself. This means if the loop
itself is blocked/unhealthy, we cannot trust the metrics.
- Improving our health checks in the GCS & Raylet.
#62374 modified the GCS health
check to post to the io_context, but this still has limitations: it only
checks the io_service, not others, and if the io_service is blocked then
we will stop responding to the health check (ideally we would return
NOT_SERVING with a useful status message).

In addition, I am adding a new gauge to indicate if each IO context is
currently healthy vs. unhealthy.

In follow-up PRs, I will integrate this class into the GCS, Raylet, and
Core Worker. Then we can remove our existing lag probe implementation.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
Adds implementations for `IOContextMonitor` (core logic, unit testable)
and `IOContextMonitorThread` (wraps the monitor to run periodically).

These will be used for two purposes:

- Replacing our existing "lag probe" implementation. We currently post
the lag probe from within the IO context itself. This means if the loop
itself is blocked/unhealthy, we cannot trust the metrics.
- Improving our health checks in the GCS & Raylet.
ray-project#62374 modified the GCS health
check to post to the io_context, but this still has limitations: it only
checks the io_service, not others, and if the io_service is blocked then
we will stop responding to the health check (ideally we would return
NOT_SERVING with a useful status message).

In addition, I am adding a new gauge to indicate if each IO context is
currently healthy vs. unhealthy.

In follow-up PRs, I will integrate this class into the GCS, Raylet, and
Core Worker. Then we can remove our existing lag probe implementation.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
Adds implementations for `IOContextMonitor` (core logic, unit testable)
and `IOContextMonitorThread` (wraps the monitor to run periodically).

These will be used for two purposes:

- Replacing our existing "lag probe" implementation. We currently post
the lag probe from within the IO context itself. This means if the loop
itself is blocked/unhealthy, we cannot trust the metrics.
- Improving our health checks in the GCS & Raylet.
ray-project#62374 modified the GCS health
check to post to the io_context, but this still has limitations: it only
checks the io_service, not others, and if the io_service is blocked then
we will stop responding to the health check (ideally we would return
NOT_SERVING with a useful status message).

In addition, I am adding a new gauge to indicate if each IO context is
currently healthy vs. unhealthy.

In follow-up PRs, I will integrate this class into the GCS, Raylet, and
Core Worker. Then we can remove our existing lag probe implementation.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

3 participants