[core] fix HandleIsLocalWorkerDead for drivers#62688
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the NodeManager::HandleIsLocalWorkerDead logic to correctly identify registered drivers as alive, preventing premature disconnections. It includes a regression test for the core worker client pool and new unit tests for the node manager. The review feedback identifies a logic error in the new tests where mock expectations would fail due to boolean short-circuiting in the implementation, and suggests verifying callback execution for better test robustness.
965fcb2 to
bd78c01
Compare
dayshah
approved these changes
Apr 16, 2026
dayshah
left a comment
Contributor
There was a problem hiding this comment.
Oh wow, I'm surprised we haven't found more bugs because of this
Collaborator
|
cpp tests failing @rueian |
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
bd78c01 to
eb7bd98
Compare
HLDKNotFound
pushed a commit
to chichic21039/ray
that referenced
this pull request
Apr 22, 2026
This is found during debugging a hanging streaming generator. We have
these logs:
```
{"asctime":"2026-04-09 00:33:06,719","levelname":"W","message":"Core worker 10.0.191.167 has been unavailable for more than 1 seconds","filename":"retryable_grpc_client.cc","lineno":85}
{"asctime":"2026-04-09 00:33:06,724","levelname":"I","message":"Disconnecting core worker client because the worker is dead","filename":"core_worker_client_pool.cc","lineno":55,"worker_id":"02000000ffffffffffffffffffffffffffffffffffffffffffffffff","node_id":"576a2709c6d48e4fdfabe9d993b315652ee8e53d0096bd93bc368379"}
{"asctime":"2026-04-09 00:33:06,724","levelname":"W","message":"Failed to report streaming generator return to the caller. The yield'ed ObjectRef may not be usable. Disconnected: GRPC client is shut down.","filename":"core_worker.cc","lineno":3206,"object_id":"d337091b4a9d35daffffffffffffffffffffffff0200000003000000"}
```
The streaming generator above was not able to send the yield back to the
owner (the driver) because the `RetryableGrpcClient` had been shut down
after the owner was unavailable for more than 1 second, probably due to
being overloaded.
The `RetryableGrpcClient` was shut down because it got `the worker is
dead` message from the raylet on the head node via the
`raylet_client->IsLocalWorkerDead()`, and that message was wrong.
The raylet on the head said the worker was dead wrongly because it
didn't check the drivers it had.
So the fix is simply to let it check the drivers as well, so that
`RetryableGrpcClient` won't be shut down incorrectly.
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Lucas61000
pushed a commit
to Lucas61000/ray
that referenced
this pull request
May 15, 2026
This is found during debugging a hanging streaming generator. We have
these logs:
```
{"asctime":"2026-04-09 00:33:06,719","levelname":"W","message":"Core worker 10.0.191.167 has been unavailable for more than 1 seconds","filename":"retryable_grpc_client.cc","lineno":85}
{"asctime":"2026-04-09 00:33:06,724","levelname":"I","message":"Disconnecting core worker client because the worker is dead","filename":"core_worker_client_pool.cc","lineno":55,"worker_id":"02000000ffffffffffffffffffffffffffffffffffffffffffffffff","node_id":"576a2709c6d48e4fdfabe9d993b315652ee8e53d0096bd93bc368379"}
{"asctime":"2026-04-09 00:33:06,724","levelname":"W","message":"Failed to report streaming generator return to the caller. The yield'ed ObjectRef may not be usable. Disconnected: GRPC client is shut down.","filename":"core_worker.cc","lineno":3206,"object_id":"d337091b4a9d35daffffffffffffffffffffffff0200000003000000"}
```
The streaming generator above was not able to send the yield back to the
owner (the driver) because the `RetryableGrpcClient` had been shut down
after the owner was unavailable for more than 1 second, probably due to
being overloaded.
The `RetryableGrpcClient` was shut down because it got `the worker is
dead` message from the raylet on the head node via the
`raylet_client->IsLocalWorkerDead()`, and that message was wrong.
The raylet on the head said the worker was dead wrongly because it
didn't check the drivers it had.
So the fix is simply to let it check the drivers as well, so that
`RetryableGrpcClient` won't be shut down incorrectly.
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is found during debugging a hanging streaming generator. We have these logs:
The streaming generator above was not able to send the yield back to the owner (the driver) because the
RetryableGrpcClienthad been shut down after the owner was unavailable for more than 1 second, probably due to being overloaded.However, the
RetryableGrpcClientwas shut down because it gotthe worker is deadmessage from the raylet on the head node via theraylet_client->IsLocalWorkerDead(), and that message was wrong.The problem is that the
HandleIsLocalWorkerDeadon the raylet didn't check if the worker was one of the drivers it had, so it said the worker was dead wrongly.So the fix is simply to let it check the drivers as well, so that
RetryableGrpcClientwon't be shut down incorrectly.Description
Related issues
Additional information