[core] Run GCS health check on `io_service` by edoakes · Pull Request #62374 · ray-project/ray

edoakes · 2026-04-06T21:18:26Z

Previously, the GCS health check was implemented using the builtin implementation from gRPC. This runs on the internal gRPC server threads rather than on our boost::asio event loop(s). This is a poor indicator of system health, as if the io_service is stuck, the GCS will make no progress but the health check will be passing.

This PR overrides the health check implementation to post a callback to the io_service instead. In a future PR I will extend this to also check the health of other control plane threads.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…es/unavailable-backoff

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…es/gcs-hc

gemini-code-assist

Code Review

This pull request introduces a custom gRPC health check service for the GCS server that operates on the io_context event loop, ensuring that health checks time out if the event loop becomes unresponsive. The review feedback identifies a significant thread-safety issue regarding the global toggling of the default gRPC health check service and recommends better adherence to the gRPC health checking protocol, specifically by handling the service field in requests and acknowledging the missing Watch method.

…es/gcs-hc

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…es/gcs-hc

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes · 2026-04-08T16:17:28Z

+  // The health check should time out because the handler is posted to the
+  // main io_context which is no longer processing events.
+  status = CheckHealth(std::chrono::milliseconds(100));
+  ASSERT_FALSE(status.ok());
+  EXPECT_EQ(status.error_code(), grpc::StatusCode::DEADLINE_EXCEEDED);


I verified that this check fails if I revert the change to gcs_server.cc, meaning that previously the health check was passing if the io_service was blocked and the test is guarding against this behavior.

rueian

LGTM. Just left a few thoughts:

This drops the support for /grpc.health.v1.Health/Watch, which may be okay because it is seldom used.
A healthcheck client will never see a NOT_SERVING response if the io_context is blocked. It will need to wait until the timeout specified by itself is reached. I wonder if it would be better if we monitor the lag of the io_context (we did) and let the health check server return NOT_SERVING once we found the lag is too large.

edoakes · 2026-04-08T18:49:06Z

(2) is a very good point. I agree this would be likely preferable, and would also allow us to return a more detailed explanation of why it is returning NOT_SERVING. I think we should probably merge this PR as a simple fix, and I will prototype your suggestion.

rueian · 2026-04-08T19:14:04Z

(2) is a very good point. I agree this would be likely preferable, and would also allow us to return a more detailed explanation of why it is returning NOT_SERVING. I think we should probably merge this PR as a simple fix, and I will prototype your suggestion.

Oh, then, I think we can still use the gRPC healthcheck implementation. We just need to toggle the SetServingStatus(...) method on the gRPC healthcheck implementation in our lag monitor.

Yicheng-Lu-llll · 2026-04-08T21:43:26Z

I wonder if it would be better if we monitor the lag of the io_context (we did) and let the health check server return NOT_SERVING once we found the lag is too large.

Things might become complex here. if the io_context is completely stuck, the lag probe's callback itself is also posted on the io_context, so it won't be able to execute either or needs a long time, which means it can't call SetServingStatus(false) during this window.

edoakes · 2026-04-08T22:17:47Z

Things might become complex here. if the io_context is completely stuck, the lag probe's callback itself is also posted on the io_context, so it won't be able to execute either or needs a long time, which means it can't call SetServingStatus(false) during this window.

One option is to actively probe the io_context inside the health check but respond internally with NOT_SERVING if the callback doesn't run within a timeout.

Yicheng-Lu-llll · 2026-04-08T22:31:08Z

One option is to actively probe the io_context inside the health check but respond internally with NOT_SERVING if the callback doesn't run within a timeout.

This sounds great!

rueian · 2026-04-08T22:33:35Z

I wonder if it would be better if we monitor the lag of the io_context (we did) and let the health check server return NOT_SERVING once we found the lag is too large.

Things might become complex here. if the io_context is completely stuck, the lag probe's callback itself is also posted on the io_context, so it won't be able to execute either or needs a long time, which means it can't call SetServingStatus(false) during this window.

Oh, that means we need to fix the lag monitor as well; otherwise, we won't see the correct lag metric when the io_context is lagged.

edoakes · 2026-04-08T23:11:47Z

@rueian and I discussed offline and the plan is to move the lag probe and this health checking to a standalone thread. I will work on this.

Previously, the GCS health check was implemented using the builtin implementation from gRPC. This runs on the internal gRPC server threads rather than on our `boost::asio` event loop(s). This is a poor indicator of system health, as if the `io_service` is stuck, the GCS will make no progress but the health check will be passing. This PR overrides the health check implementation to post a callback to the `io_service` instead. In a future PR I will extend this to also check the health of other control plane threads. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Adds implementations for `IOContextMonitor` (core logic, unit testable) and `IOContextMonitorThread` (wraps the monitor to run periodically). These will be used for two purposes: - Replacing our existing "lag probe" implementation. We currently post the lag probe from within the IO context itself. This means if the loop itself is blocked/unhealthy, we cannot trust the metrics. - Improving our health checks in the GCS & Raylet. #62374 modified the GCS health check to post to the io_context, but this still has limitations: it only checks the io_service, not others, and if the io_service is blocked then we will stop responding to the health check (ideally we would return NOT_SERVING with a useful status message). In addition, I am adding a new gauge to indicate if each IO context is currently healthy vs. unhealthy. In follow-up PRs, I will integrate this class into the GCS, Raylet, and Core Worker. Then we can remove our existing lag probe implementation. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Adds implementations for `IOContextMonitor` (core logic, unit testable) and `IOContextMonitorThread` (wraps the monitor to run periodically). These will be used for two purposes: - Replacing our existing "lag probe" implementation. We currently post the lag probe from within the IO context itself. This means if the loop itself is blocked/unhealthy, we cannot trust the metrics. - Improving our health checks in the GCS & Raylet. ray-project#62374 modified the GCS health check to post to the io_context, but this still has limitations: it only checks the io_service, not others, and if the io_service is blocked then we will stop responding to the health check (ideally we would return NOT_SERVING with a useful status message). In addition, I am adding a new gauge to indicate if each IO context is currently healthy vs. unhealthy. In follow-up PRs, I will integrate this class into the GCS, Raylet, and Core Worker. Then we can remove our existing lag probe implementation. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes added 18 commits April 3, 2026 09:58

WIP

4755d51

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

WIP

0688c62

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fixes

49e0af9

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

564ac62

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix overflow

7533448

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix overflow

11df7ad

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into eoak…

6c3fae6

…es/unavailable-backoff

fix

0cb7555

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

9cb897a

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

fcfa76c

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

1f166ba

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

011e1ec

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

b138499

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

8787ba2

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

e4128a5

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Merge branch 'eoakes/fix-overflow' into eoakes/unavailable-backoff

7873254

fix

0ff2f62

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into eoak…

73808ab

…es/gcs-hc

edoakes added the go add ONLY when ready to merge, run all tests label Apr 6, 2026

gemini-code-assist Bot reviewed Apr 6, 2026

View reviewed changes

Comment thread src/ray/rpc/grpc_server.cc Outdated

Comment thread src/ray/gcs/grpc_services.h

Comment thread src/ray/gcs/grpc_services.h

edoakes added 6 commits April 7, 2026 08:05

Merge branch 'master' of https://github.com/ray-project/ray into eoak…

c13eafa

…es/gcs-hc

fix

a0e3be0

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

0c445c9

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

WIP

4823892

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

WIP

a6d1e53

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

c415d54

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes commented Apr 8, 2026

View reviewed changes

Comment thread src/ray/rpc/grpc_server.h Outdated

edoakes added 3 commits April 8, 2026 07:16

Merge branch 'master' of https://github.com/ray-project/ray into eoak…

31846c2

…es/gcs-hc

cleanup

a839867

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

cleanup

80f7d96

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes changed the title ~~[WIP][core] Run GCS health check on io_service~~ Apr 8, 2026

edoakes marked this pull request as ready for review April 8, 2026 15:42

edoakes requested a review from a team as a code owner April 8, 2026 15:42

edoakes commented Apr 8, 2026

View reviewed changes

Comment thread src/ray/gcs/tests/gcs_health_check_service_test.cc Outdated

edoakes added 3 commits April 8, 2026 09:05

proper tests

0a48698

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

proper tests

755a5c6

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

WIP

f345145

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes commented Apr 8, 2026

View reviewed changes

edoakes assigned rueian and Yicheng-Lu-llll Apr 8, 2026

rueian approved these changes Apr 8, 2026

View reviewed changes

ray-gardener Bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Apr 8, 2026

edoakes merged commit 43dcc92 into ray-project:master Apr 8, 2026
5 of 6 checks passed

This was referenced Apr 10, 2026

[core] Add IOContextMonitor implementation #62501

Closed

[core] Add IOContextMonitor implementation #62608

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Run GCS health check on `io_service`#62374

[core] Run GCS health check on `io_service`#62374
edoakes merged 30 commits into
ray-project:masterfrom
edoakes:eoakes/gcs-hc

edoakes commented Apr 6, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edoakes Apr 8, 2026

rueian left a comment •

edited

Loading

edoakes commented Apr 8, 2026

rueian commented Apr 8, 2026

Yicheng-Lu-llll commented Apr 8, 2026 •

edited

Loading

edoakes commented Apr 8, 2026

Uh oh!

Yicheng-Lu-llll commented Apr 8, 2026

rueian commented Apr 8, 2026

edoakes commented Apr 8, 2026

Labels

3 participants

Uh oh!

Conversation

edoakes commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edoakes Apr 8, 2026

Choose a reason for hiding this comment

rueian left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

edoakes commented Apr 8, 2026

rueian commented Apr 8, 2026

Yicheng-Lu-llll commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

edoakes commented Apr 8, 2026

Uh oh!

Yicheng-Lu-llll commented Apr 8, 2026

rueian commented Apr 8, 2026

edoakes commented Apr 8, 2026

Labels

3 participants

edoakes commented Apr 6, 2026 •

edited

Loading

rueian left a comment •

edited

Loading

Yicheng-Lu-llll commented Apr 8, 2026 •

edited

Loading