[core] Run GCS health check on io_service#62374
Conversation
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…es/unavailable-backoff
There was a problem hiding this comment.
Code Review
This pull request introduces a custom gRPC health check service for the GCS server that operates on the io_context event loop, ensuring that health checks time out if the event loop becomes unresponsive. The review feedback identifies a significant thread-safety issue regarding the global toggling of the default gRPC health check service and recommends better adherence to the gRPC health checking protocol, specifically by handling the service field in requests and acknowledging the missing Watch method.
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
| // The health check should time out because the handler is posted to the | ||
| // main io_context which is no longer processing events. | ||
| status = CheckHealth(std::chrono::milliseconds(100)); | ||
| ASSERT_FALSE(status.ok()); | ||
| EXPECT_EQ(status.error_code(), grpc::StatusCode::DEADLINE_EXCEEDED); |
There was a problem hiding this comment.
I verified that this check fails if I revert the change to gcs_server.cc, meaning that previously the health check was passing if the io_service was blocked and the test is guarding against this behavior.
There was a problem hiding this comment.
LGTM. Just left a few thoughts:
- This drops the support for
/grpc.health.v1.Health/Watch, which may be okay because it is seldom used. - A healthcheck client will never see a NOT_SERVING response if the io_context is blocked. It will need to wait until the timeout specified by itself is reached. I wonder if it would be better if we monitor the lag of the io_context (we did) and let the health check server return NOT_SERVING once we found the lag is too large.
|
(2) is a very good point. I agree this would be likely preferable, and would also allow us to return a more detailed explanation of why it is returning |
Oh, then, I think we can still use the gRPC healthcheck implementation. We just need to toggle the |
Things might become complex here. if the io_context is completely stuck, the lag probe's callback itself is also posted on the io_context, so it won't be able to execute either or needs a long time, which means it can't call SetServingStatus(false) during this window. |
One option is to actively probe the |
This sounds great! |
Oh, that means we need to fix the lag monitor as well; otherwise, we won't see the correct lag metric when the io_context is lagged. |
|
@rueian and I discussed offline and the plan is to move the lag probe and this health checking to a standalone thread. I will work on this. |
Previously, the GCS health check was implemented using the builtin implementation from gRPC. This runs on the internal gRPC server threads rather than on our `boost::asio` event loop(s). This is a poor indicator of system health, as if the `io_service` is stuck, the GCS will make no progress but the health check will be passing. This PR overrides the health check implementation to post a callback to the `io_service` instead. In a future PR I will extend this to also check the health of other control plane threads. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Adds implementations for `IOContextMonitor` (core logic, unit testable) and `IOContextMonitorThread` (wraps the monitor to run periodically). These will be used for two purposes: - Replacing our existing "lag probe" implementation. We currently post the lag probe from within the IO context itself. This means if the loop itself is blocked/unhealthy, we cannot trust the metrics. - Improving our health checks in the GCS & Raylet. #62374 modified the GCS health check to post to the io_context, but this still has limitations: it only checks the io_service, not others, and if the io_service is blocked then we will stop responding to the health check (ideally we would return NOT_SERVING with a useful status message). In addition, I am adding a new gauge to indicate if each IO context is currently healthy vs. unhealthy. In follow-up PRs, I will integrate this class into the GCS, Raylet, and Core Worker. Then we can remove our existing lag probe implementation. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Adds implementations for `IOContextMonitor` (core logic, unit testable) and `IOContextMonitorThread` (wraps the monitor to run periodically). These will be used for two purposes: - Replacing our existing "lag probe" implementation. We currently post the lag probe from within the IO context itself. This means if the loop itself is blocked/unhealthy, we cannot trust the metrics. - Improving our health checks in the GCS & Raylet. ray-project#62374 modified the GCS health check to post to the io_context, but this still has limitations: it only checks the io_service, not others, and if the io_service is blocked then we will stop responding to the health check (ideally we would return NOT_SERVING with a useful status message). In addition, I am adding a new gauge to indicate if each IO context is currently healthy vs. unhealthy. In follow-up PRs, I will integrate this class into the GCS, Raylet, and Core Worker. Then we can remove our existing lag probe implementation. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Adds implementations for `IOContextMonitor` (core logic, unit testable) and `IOContextMonitorThread` (wraps the monitor to run periodically). These will be used for two purposes: - Replacing our existing "lag probe" implementation. We currently post the lag probe from within the IO context itself. This means if the loop itself is blocked/unhealthy, we cannot trust the metrics. - Improving our health checks in the GCS & Raylet. ray-project#62374 modified the GCS health check to post to the io_context, but this still has limitations: it only checks the io_service, not others, and if the io_service is blocked then we will stop responding to the health check (ideally we would return NOT_SERVING with a useful status message). In addition, I am adding a new gauge to indicate if each IO context is currently healthy vs. unhealthy. In follow-up PRs, I will integrate this class into the GCS, Raylet, and Core Worker. Then we can remove our existing lag probe implementation. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Previously, the GCS health check was implemented using the builtin implementation from gRPC. This runs on the internal gRPC server threads rather than on our
boost::asioevent loop(s). This is a poor indicator of system health, as if theio_serviceis stuck, the GCS will make no progress but the health check will be passing.This PR overrides the health check implementation to post a callback to the
io_serviceinstead. In a future PR I will extend this to also check the health of other control plane threads.