Skip to content

[core] fix HandleIsLocalWorkerDead for drivers#62688

Merged
edoakes merged 1 commit into
ray-project:masterfrom
rueian:fix-islocalworkerdead
Apr 17, 2026
Merged

[core] fix HandleIsLocalWorkerDead for drivers#62688
edoakes merged 1 commit into
ray-project:masterfrom
rueian:fix-islocalworkerdead

Conversation

@rueian

@rueian rueian commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

This is found during debugging a hanging streaming generator. We have these logs:

{"asctime":"2026-04-09 00:33:06,719","levelname":"W","message":"Core worker 10.0.191.167 has been unavailable for more than 1 seconds","filename":"retryable_grpc_client.cc","lineno":85}
{"asctime":"2026-04-09 00:33:06,724","levelname":"I","message":"Disconnecting core worker client because the worker is dead","filename":"core_worker_client_pool.cc","lineno":55,"worker_id":"02000000ffffffffffffffffffffffffffffffffffffffffffffffff","node_id":"576a2709c6d48e4fdfabe9d993b315652ee8e53d0096bd93bc368379"}
{"asctime":"2026-04-09 00:33:06,724","levelname":"W","message":"Failed to report streaming generator return to the caller. The yield'ed ObjectRef may not be usable. Disconnected: GRPC client is shut down.","filename":"core_worker.cc","lineno":3206,"object_id":"d337091b4a9d35daffffffffffffffffffffffff0200000003000000"}

The streaming generator above was not able to send the yield back to the owner (the driver) because the RetryableGrpcClient had been shut down after the owner was unavailable for more than 1 second, probably due to being overloaded.

However, the RetryableGrpcClient was shut down because it got the worker is dead message from the raylet on the head node via the raylet_client->IsLocalWorkerDead(), and that message was wrong.

The problem is that the HandleIsLocalWorkerDead on the raylet didn't check if the worker was one of the drivers it had, so it said the worker was dead wrongly.

So the fix is simply to let it check the drivers as well, so that RetryableGrpcClient won't be shut down incorrectly.

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the NodeManager::HandleIsLocalWorkerDead logic to correctly identify registered drivers as alive, preventing premature disconnections. It includes a regression test for the core worker client pool and new unit tests for the node manager. The review feedback identifies a logic error in the new tests where mock expectations would fail due to boolean short-circuiting in the implementation, and suggests verifying callback execution for better test robustness.

Comment thread src/ray/raylet/tests/node_manager_test.cc
Comment thread src/ray/raylet/tests/node_manager_test.cc
@rueian rueian added go add ONLY when ready to merge, run all tests core Issues that should be addressed in Ray Core labels Apr 16, 2026
@rueian rueian force-pushed the fix-islocalworkerdead branch 2 times, most recently from 965fcb2 to bd78c01 Compare April 16, 2026 23:40
@rueian rueian marked this pull request as ready for review April 16, 2026 23:44
@rueian rueian requested a review from a team as a code owner April 16, 2026 23:44

@dayshah dayshah left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, I'm surprised we haven't found more bugs because of this

@dayshah dayshah enabled auto-merge (squash) April 16, 2026 23:50

@edoakes edoakes left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!

@edoakes

edoakes commented Apr 17, 2026

Copy link
Copy Markdown
Collaborator

cpp tests failing @rueian

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
@rueian rueian force-pushed the fix-islocalworkerdead branch from bd78c01 to eb7bd98 Compare April 17, 2026 01:05
@github-actions github-actions Bot disabled auto-merge April 17, 2026 01:05
@dayshah dayshah enabled auto-merge (squash) April 17, 2026 01:25
@edoakes edoakes disabled auto-merge April 17, 2026 16:12
@edoakes edoakes merged commit 17fad30 into ray-project:master Apr 17, 2026
6 of 7 checks passed
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
This is found during debugging a hanging streaming generator. We have
these logs:

```
{"asctime":"2026-04-09 00:33:06,719","levelname":"W","message":"Core worker 10.0.191.167 has been unavailable for more than 1 seconds","filename":"retryable_grpc_client.cc","lineno":85}
{"asctime":"2026-04-09 00:33:06,724","levelname":"I","message":"Disconnecting core worker client because the worker is dead","filename":"core_worker_client_pool.cc","lineno":55,"worker_id":"02000000ffffffffffffffffffffffffffffffffffffffffffffffff","node_id":"576a2709c6d48e4fdfabe9d993b315652ee8e53d0096bd93bc368379"}
{"asctime":"2026-04-09 00:33:06,724","levelname":"W","message":"Failed to report streaming generator return to the caller. The yield'ed ObjectRef may not be usable. Disconnected: GRPC client is shut down.","filename":"core_worker.cc","lineno":3206,"object_id":"d337091b4a9d35daffffffffffffffffffffffff0200000003000000"}
```

The streaming generator above was not able to send the yield back to the
owner (the driver) because the `RetryableGrpcClient` had been shut down
after the owner was unavailable for more than 1 second, probably due to
being overloaded.

The `RetryableGrpcClient` was shut down because it got `the worker is
dead` message from the raylet on the head node via the
`raylet_client->IsLocalWorkerDead()`, and that message was wrong.

The raylet on the head said the worker was dead wrongly because it
didn't check the drivers it had.

So the fix is simply to let it check the drivers as well, so that
`RetryableGrpcClient` won't be shut down incorrectly.

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
This is found during debugging a hanging streaming generator. We have
these logs:

```
{"asctime":"2026-04-09 00:33:06,719","levelname":"W","message":"Core worker 10.0.191.167 has been unavailable for more than 1 seconds","filename":"retryable_grpc_client.cc","lineno":85}
{"asctime":"2026-04-09 00:33:06,724","levelname":"I","message":"Disconnecting core worker client because the worker is dead","filename":"core_worker_client_pool.cc","lineno":55,"worker_id":"02000000ffffffffffffffffffffffffffffffffffffffffffffffff","node_id":"576a2709c6d48e4fdfabe9d993b315652ee8e53d0096bd93bc368379"}
{"asctime":"2026-04-09 00:33:06,724","levelname":"W","message":"Failed to report streaming generator return to the caller. The yield'ed ObjectRef may not be usable. Disconnected: GRPC client is shut down.","filename":"core_worker.cc","lineno":3206,"object_id":"d337091b4a9d35daffffffffffffffffffffffff0200000003000000"}
```

The streaming generator above was not able to send the yield back to the
owner (the driver) because the `RetryableGrpcClient` had been shut down
after the owner was unavailable for more than 1 second, probably due to
being overloaded.

The `RetryableGrpcClient` was shut down because it got `the worker is
dead` message from the raylet on the head node via the
`raylet_client->IsLocalWorkerDead()`, and that message was wrong.

The raylet on the head said the worker was dead wrongly because it
didn't check the drivers it had.

So the fix is simply to let it check the drivers as well, so that
`RetryableGrpcClient` won't be shut down incorrectly.

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

3 participants